1# StringClassifier 2 3StringClassifier is a library to classify an unknown text against a set of known 4texts. The classifier uses the [Levenshtein Distance] algorithm to determine 5which of the known texts most closely matches the unknown text. The Levenshtein 6Distance is normalized into a "confidence percentage" between 1 and 0, where 1.0 7indicates an exact match and 0.0 indicates a complete mismatch. 8 9[Levenshtein Distance]: https://en.wikipedia.org/wiki/Levenshtein_distance 10 11## Types of matching 12 13There are two kinds of matching algorithms the string classifier can perform: 14 151. [Nearest matching](#nearest), and 162. [Multiple matching](#multiple). 17 18### Normalization 19 20To get the best match, normalizing functions can be applied to the texts. For 21example, flattening whitespaces removes a lot of inconsequential formatting 22differences that would otherwise lower the matching confidence percentage. 23 24```go 25sc := stringclassifier.New(stringclassifier.FlattenWhitespace, strings.ToLower) 26``` 27 28The normalizating functions are run on all the known texts that are added to the 29classifier. They're also run on the unknown text before classification. 30 31### Nearest matching {#nearest} 32 33A nearest match returns the name of the known text that most closely matches the 34full unknown text. This is most useful when the unknown text doesn't have 35extraneous text around it. 36 37Example: 38 39```go 40func IdentifyText(sc *stringclassifier.Classifier, name, unknown string) { 41 m := sc.NearestMatch(unknown) 42 log.Printf("The nearest match to %q is %q (confidence: %v)", name, m.Name, m.Confidence) 43} 44``` 45 46## Multiple matching {#multiple} 47 48Multiple matching identifies all of the known texts which may exist in the 49unknown text. It can also detect a known text in an unknown text even if there's 50extraneous text around the unknown text. As with nearest matching, a confidence 51percentage for each match is given. 52 53Example: 54 55```go 56log.Printf("The text %q contains:", name) 57for _, m := range sc.MultipleMatch(unknown, false) { 58 log.Printf(" %q (conf: %v, offset: %v)", m.Name, m.Confidence, m.Offset) 59} 60``` 61 62## Disclaimer 63 64This is not an official Google product (experimental or otherwise), it is just 65code that happens to be owned by Google. 66