licenseclassifier/stringclassifier/README.md

*46c4c49dSIbrahim Kanouche# StringClassifier
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim KanoucheStringClassifier is a library to classify an unknown text against a set of known
*46c4c49dSIbrahim Kanouchetexts. The classifier uses the [Levenshtein Distance] algorithm to determine
*46c4c49dSIbrahim Kanouchewhich of the known texts most closely matches the unknown text. The Levenshtein
*46c4c49dSIbrahim KanoucheDistance is normalized into a "confidence percentage" between 1 and 0, where 1.0
*46c4c49dSIbrahim Kanoucheindicates an exact match and 0.0 indicates a complete mismatch.
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche[Levenshtein Distance]: https://en.wikipedia.org/wiki/Levenshtein_distance
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche## Types of matching
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim KanoucheThere are two kinds of matching algorithms the string classifier can perform:
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche1. [Nearest matching](#nearest), and
*46c4c49dSIbrahim Kanouche2. [Multiple matching](#multiple).
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche### Normalization
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim KanoucheTo get the best match, normalizing functions can be applied to the texts. For
*46c4c49dSIbrahim Kanoucheexample, flattening whitespaces removes a lot of inconsequential formatting
*46c4c49dSIbrahim Kanouchedifferences that would otherwise lower the matching confidence percentage.
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche```go
*46c4c49dSIbrahim Kanouchesc := stringclassifier.New(stringclassifier.FlattenWhitespace, strings.ToLower)
*46c4c49dSIbrahim Kanouche```
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim KanoucheThe normalizating functions are run on all the known texts that are added to the
*46c4c49dSIbrahim Kanoucheclassifier. They're also run on the unknown text before classification.
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche### Nearest matching {#nearest}
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim KanoucheA nearest match returns the name of the known text that most closely matches the
*46c4c49dSIbrahim Kanouchefull unknown text. This is most useful when the unknown text doesn't have
*46c4c49dSIbrahim Kanoucheextraneous text around it.
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim KanoucheExample:
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche```go
*46c4c49dSIbrahim Kanouchefunc IdentifyText(sc *stringclassifier.Classifier, name, unknown string) {
*46c4c49dSIbrahim Kanouche  m := sc.NearestMatch(unknown)
*46c4c49dSIbrahim Kanouche  log.Printf("The nearest match to %q is %q (confidence: %v)", name, m.Name, m.Confidence)
*46c4c49dSIbrahim Kanouche}
*46c4c49dSIbrahim Kanouche```
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche## Multiple matching {#multiple}
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim KanoucheMultiple matching identifies all of the known texts which may exist in the
*46c4c49dSIbrahim Kanoucheunknown text. It can also detect a known text in an unknown text even if there's
*46c4c49dSIbrahim Kanoucheextraneous text around the unknown text. As with nearest matching, a confidence
*46c4c49dSIbrahim Kanouchepercentage for each match is given.
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim KanoucheExample:
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche```go
*46c4c49dSIbrahim Kanouchelog.Printf("The text %q contains:", name)
*46c4c49dSIbrahim Kanouchefor _, m := range sc.MultipleMatch(unknown, false) {
*46c4c49dSIbrahim Kanouche  log.Printf("  %q (conf: %v, offset: %v)", m.Name, m.Confidence, m.Offset)
*46c4c49dSIbrahim Kanouche}
*46c4c49dSIbrahim Kanouche```
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim Kanouche## Disclaimer
*46c4c49dSIbrahim Kanouche
*46c4c49dSIbrahim KanoucheThis is not an official Google product (experimental or otherwise), it is just
*46c4c49dSIbrahim Kanouchecode that happens to be owned by Google.