README.md
1# StringClassifier
2
3StringClassifier is a library to classify an unknown text against a set of known
4texts. The classifier uses the [Levenshtein Distance] algorithm to determine
5which of the known texts most closely matches the unknown text. The Levenshtein
6Distance is normalized into a "confidence percentage" between 1 and 0, where 1.0
7indicates an exact match and 0.0 indicates a complete mismatch.
8
9[Levenshtein Distance]: https://en.wikipedia.org/wiki/Levenshtein_distance
10
11## Types of matching
12
13There are two kinds of matching algorithms the string classifier can perform:
14
151. [Nearest matching](#nearest), and
162. [Multiple matching](#multiple).
17
18### Normalization
19
20To get the best match, normalizing functions can be applied to the texts. For
21example, flattening whitespaces removes a lot of inconsequential formatting
22differences that would otherwise lower the matching confidence percentage.
23
24```go
25sc := stringclassifier.New(stringclassifier.FlattenWhitespace, strings.ToLower)
26```
27
28The normalizating functions are run on all the known texts that are added to the
29classifier. They're also run on the unknown text before classification.
30
31### Nearest matching {#nearest}
32
33A nearest match returns the name of the known text that most closely matches the
34full unknown text. This is most useful when the unknown text doesn't have
35extraneous text around it.
36
37Example:
38
39```go
40func IdentifyText(sc *stringclassifier.Classifier, name, unknown string) {
41 m := sc.NearestMatch(unknown)
42 log.Printf("The nearest match to %q is %q (confidence: %v)", name, m.Name, m.Confidence)
43}
44```
45
46## Multiple matching {#multiple}
47
48Multiple matching identifies all of the known texts which may exist in the
49unknown text. It can also detect a known text in an unknown text even if there's
50extraneous text around the unknown text. As with nearest matching, a confidence
51percentage for each match is given.
52
53Example:
54
55```go
56log.Printf("The text %q contains:", name)
57for _, m := range sc.MultipleMatch(unknown, false) {
58 log.Printf(" %q (conf: %v, offset: %v)", m.Name, m.Confidence, m.Offset)
59}
60```
61
62## Disclaimer
63
64This is not an official Google product (experimental or otherwise), it is just
65code that happens to be owned by Google.
66