xref: /aosp_15_r20/external/licenseclassifier/stringclassifier/README.md (revision 46c4c49da23cae783fa41bf46525a6505638499a)
1*46c4c49dSIbrahim Kanouche# StringClassifier
2*46c4c49dSIbrahim Kanouche
3*46c4c49dSIbrahim KanoucheStringClassifier is a library to classify an unknown text against a set of known
4*46c4c49dSIbrahim Kanouchetexts. The classifier uses the [Levenshtein Distance] algorithm to determine
5*46c4c49dSIbrahim Kanouchewhich of the known texts most closely matches the unknown text. The Levenshtein
6*46c4c49dSIbrahim KanoucheDistance is normalized into a "confidence percentage" between 1 and 0, where 1.0
7*46c4c49dSIbrahim Kanoucheindicates an exact match and 0.0 indicates a complete mismatch.
8*46c4c49dSIbrahim Kanouche
9*46c4c49dSIbrahim Kanouche[Levenshtein Distance]: https://en.wikipedia.org/wiki/Levenshtein_distance
10*46c4c49dSIbrahim Kanouche
11*46c4c49dSIbrahim Kanouche## Types of matching
12*46c4c49dSIbrahim Kanouche
13*46c4c49dSIbrahim KanoucheThere are two kinds of matching algorithms the string classifier can perform:
14*46c4c49dSIbrahim Kanouche
15*46c4c49dSIbrahim Kanouche1. [Nearest matching](#nearest), and
16*46c4c49dSIbrahim Kanouche2. [Multiple matching](#multiple).
17*46c4c49dSIbrahim Kanouche
18*46c4c49dSIbrahim Kanouche### Normalization
19*46c4c49dSIbrahim Kanouche
20*46c4c49dSIbrahim KanoucheTo get the best match, normalizing functions can be applied to the texts. For
21*46c4c49dSIbrahim Kanoucheexample, flattening whitespaces removes a lot of inconsequential formatting
22*46c4c49dSIbrahim Kanouchedifferences that would otherwise lower the matching confidence percentage.
23*46c4c49dSIbrahim Kanouche
24*46c4c49dSIbrahim Kanouche```go
25*46c4c49dSIbrahim Kanouchesc := stringclassifier.New(stringclassifier.FlattenWhitespace, strings.ToLower)
26*46c4c49dSIbrahim Kanouche```
27*46c4c49dSIbrahim Kanouche
28*46c4c49dSIbrahim KanoucheThe normalizating functions are run on all the known texts that are added to the
29*46c4c49dSIbrahim Kanoucheclassifier. They're also run on the unknown text before classification.
30*46c4c49dSIbrahim Kanouche
31*46c4c49dSIbrahim Kanouche### Nearest matching {#nearest}
32*46c4c49dSIbrahim Kanouche
33*46c4c49dSIbrahim KanoucheA nearest match returns the name of the known text that most closely matches the
34*46c4c49dSIbrahim Kanouchefull unknown text. This is most useful when the unknown text doesn't have
35*46c4c49dSIbrahim Kanoucheextraneous text around it.
36*46c4c49dSIbrahim Kanouche
37*46c4c49dSIbrahim KanoucheExample:
38*46c4c49dSIbrahim Kanouche
39*46c4c49dSIbrahim Kanouche```go
40*46c4c49dSIbrahim Kanouchefunc IdentifyText(sc *stringclassifier.Classifier, name, unknown string) {
41*46c4c49dSIbrahim Kanouche  m := sc.NearestMatch(unknown)
42*46c4c49dSIbrahim Kanouche  log.Printf("The nearest match to %q is %q (confidence: %v)", name, m.Name, m.Confidence)
43*46c4c49dSIbrahim Kanouche}
44*46c4c49dSIbrahim Kanouche```
45*46c4c49dSIbrahim Kanouche
46*46c4c49dSIbrahim Kanouche## Multiple matching {#multiple}
47*46c4c49dSIbrahim Kanouche
48*46c4c49dSIbrahim KanoucheMultiple matching identifies all of the known texts which may exist in the
49*46c4c49dSIbrahim Kanoucheunknown text. It can also detect a known text in an unknown text even if there's
50*46c4c49dSIbrahim Kanoucheextraneous text around the unknown text. As with nearest matching, a confidence
51*46c4c49dSIbrahim Kanouchepercentage for each match is given.
52*46c4c49dSIbrahim Kanouche
53*46c4c49dSIbrahim KanoucheExample:
54*46c4c49dSIbrahim Kanouche
55*46c4c49dSIbrahim Kanouche```go
56*46c4c49dSIbrahim Kanouchelog.Printf("The text %q contains:", name)
57*46c4c49dSIbrahim Kanouchefor _, m := range sc.MultipleMatch(unknown, false) {
58*46c4c49dSIbrahim Kanouche  log.Printf("  %q (conf: %v, offset: %v)", m.Name, m.Confidence, m.Offset)
59*46c4c49dSIbrahim Kanouche}
60*46c4c49dSIbrahim Kanouche```
61*46c4c49dSIbrahim Kanouche
62*46c4c49dSIbrahim Kanouche## Disclaimer
63*46c4c49dSIbrahim Kanouche
64*46c4c49dSIbrahim KanoucheThis is not an official Google product (experimental or otherwise), it is just
65*46c4c49dSIbrahim Kanouchecode that happens to be owned by Google.
66