1*46c4c49dSIbrahim Kanouche# StringClassifier 2*46c4c49dSIbrahim Kanouche 3*46c4c49dSIbrahim KanoucheStringClassifier is a library to classify an unknown text against a set of known 4*46c4c49dSIbrahim Kanouchetexts. The classifier uses the [Levenshtein Distance] algorithm to determine 5*46c4c49dSIbrahim Kanouchewhich of the known texts most closely matches the unknown text. The Levenshtein 6*46c4c49dSIbrahim KanoucheDistance is normalized into a "confidence percentage" between 1 and 0, where 1.0 7*46c4c49dSIbrahim Kanoucheindicates an exact match and 0.0 indicates a complete mismatch. 8*46c4c49dSIbrahim Kanouche 9*46c4c49dSIbrahim Kanouche[Levenshtein Distance]: https://en.wikipedia.org/wiki/Levenshtein_distance 10*46c4c49dSIbrahim Kanouche 11*46c4c49dSIbrahim Kanouche## Types of matching 12*46c4c49dSIbrahim Kanouche 13*46c4c49dSIbrahim KanoucheThere are two kinds of matching algorithms the string classifier can perform: 14*46c4c49dSIbrahim Kanouche 15*46c4c49dSIbrahim Kanouche1. [Nearest matching](#nearest), and 16*46c4c49dSIbrahim Kanouche2. [Multiple matching](#multiple). 17*46c4c49dSIbrahim Kanouche 18*46c4c49dSIbrahim Kanouche### Normalization 19*46c4c49dSIbrahim Kanouche 20*46c4c49dSIbrahim KanoucheTo get the best match, normalizing functions can be applied to the texts. For 21*46c4c49dSIbrahim Kanoucheexample, flattening whitespaces removes a lot of inconsequential formatting 22*46c4c49dSIbrahim Kanouchedifferences that would otherwise lower the matching confidence percentage. 23*46c4c49dSIbrahim Kanouche 24*46c4c49dSIbrahim Kanouche```go 25*46c4c49dSIbrahim Kanouchesc := stringclassifier.New(stringclassifier.FlattenWhitespace, strings.ToLower) 26*46c4c49dSIbrahim Kanouche``` 27*46c4c49dSIbrahim Kanouche 28*46c4c49dSIbrahim KanoucheThe normalizating functions are run on all the known texts that are added to the 29*46c4c49dSIbrahim Kanoucheclassifier. They're also run on the unknown text before classification. 30*46c4c49dSIbrahim Kanouche 31*46c4c49dSIbrahim Kanouche### Nearest matching {#nearest} 32*46c4c49dSIbrahim Kanouche 33*46c4c49dSIbrahim KanoucheA nearest match returns the name of the known text that most closely matches the 34*46c4c49dSIbrahim Kanouchefull unknown text. This is most useful when the unknown text doesn't have 35*46c4c49dSIbrahim Kanoucheextraneous text around it. 36*46c4c49dSIbrahim Kanouche 37*46c4c49dSIbrahim KanoucheExample: 38*46c4c49dSIbrahim Kanouche 39*46c4c49dSIbrahim Kanouche```go 40*46c4c49dSIbrahim Kanouchefunc IdentifyText(sc *stringclassifier.Classifier, name, unknown string) { 41*46c4c49dSIbrahim Kanouche m := sc.NearestMatch(unknown) 42*46c4c49dSIbrahim Kanouche log.Printf("The nearest match to %q is %q (confidence: %v)", name, m.Name, m.Confidence) 43*46c4c49dSIbrahim Kanouche} 44*46c4c49dSIbrahim Kanouche``` 45*46c4c49dSIbrahim Kanouche 46*46c4c49dSIbrahim Kanouche## Multiple matching {#multiple} 47*46c4c49dSIbrahim Kanouche 48*46c4c49dSIbrahim KanoucheMultiple matching identifies all of the known texts which may exist in the 49*46c4c49dSIbrahim Kanoucheunknown text. It can also detect a known text in an unknown text even if there's 50*46c4c49dSIbrahim Kanoucheextraneous text around the unknown text. As with nearest matching, a confidence 51*46c4c49dSIbrahim Kanouchepercentage for each match is given. 52*46c4c49dSIbrahim Kanouche 53*46c4c49dSIbrahim KanoucheExample: 54*46c4c49dSIbrahim Kanouche 55*46c4c49dSIbrahim Kanouche```go 56*46c4c49dSIbrahim Kanouchelog.Printf("The text %q contains:", name) 57*46c4c49dSIbrahim Kanouchefor _, m := range sc.MultipleMatch(unknown, false) { 58*46c4c49dSIbrahim Kanouche log.Printf(" %q (conf: %v, offset: %v)", m.Name, m.Confidence, m.Offset) 59*46c4c49dSIbrahim Kanouche} 60*46c4c49dSIbrahim Kanouche``` 61*46c4c49dSIbrahim Kanouche 62*46c4c49dSIbrahim Kanouche## Disclaimer 63*46c4c49dSIbrahim Kanouche 64*46c4c49dSIbrahim KanoucheThis is not an official Google product (experimental or otherwise), it is just 65*46c4c49dSIbrahim Kanouchecode that happens to be owned by Google. 66