v2 - OpenGrok cross reference for /aosp_15_r20/external/licenseclassifier/v2/

# License Classifier v2

This is a substantial revision of the license classifier with a focus on improved accuracy and performance.

## Glossary

- corpus dictionary - contains all the unique tokens stored in the corpus of
documents to match. Any tokens in the target document that aren't in the corpus
dictionary are mapped to an invalid value.

- document - an internal-only data type that contains sequenced token information
for a source or target content for matching.

- source content - a body of text that can be matched by the scanner.

- target content - the argument to Match that is scanned for matches with source
content.

- indexed document - an internal-only data type that maps a document to the
corpus dictionary, resulting in a compressed representation suitable for fast
text searching and mapping operations. an indexed document is necessarily
tightly coupled to its corpus.

- frequency table - a lookup table holding per-token counts of the number of
times a token appears in content. used for fast filtering of target content
against different source contents.

- q-gram - a substring of content of length q tokens used to efficiently match
ranges of text. For background on the q-gram algorithms used, please see
[Indexing Methods for Approximate String Matching](https://users.dcc.uchile.cl/~gnavarro/ps/deb01.pdf)

- searchset - a data structure that uses q-grams to identify ranges of text in
the target that correspond to a range of text in the source. The searchset
algorithms compensate for the allowable error in matching text exactly, dealing
with additional or missing tokens.


## Migrating from v1

The API for the classifier versions is quite similar, but there are two key
distinctions to be aware of while migrating usages.

The confidence value for the v2 classifier is applied uniformly to results; it
will never return a match that is lower confidence than the threshold. In v1,
MultipleMatch behaved this way, but NearestMatch would return a value
regardless of the confidence match. Users often verified that the confidence
was above the threshold, but this is no longer necessary.

The second change is that the classifier now returns all matches against the
supplied corpus. The v1 classifier allowed filtering on header matches via a
boolean field. This can be emulated by creating a license classifier with a
reduced corpus if matching against headers is not desired. Alternatively, the
user can use the MatchType field in the Match struct to filter out unwanted
matches.
Name		Date	Size	#Lines	LOC
..		-	-
assets/	H	25-Apr-2025	-	38,914	30,688
scenarios/	H	25-Apr-2025	-	4,341	3,652
tools/identify_license/	H	25-Apr-2025	-	515	373
README.md	H A D	25-Apr-2025	2.5 KiB	57	39
classifier.go	H A D	25-Apr-2025	10.6 KiB	365	239
classifier_test.go	H A D	25-Apr-2025	7.7 KiB	366	300
diff.go	H A D	25-Apr-2025	3.7 KiB	116	67
diff_test.go	H A D	25-Apr-2025	7.9 KiB	290	264
document.go	H A D	25-Apr-2025	6 KiB	177	107
document_test.go	H A D	25-Apr-2025	2.8 KiB	113	84
frequencies.go	H A D	25-Apr-2025	2 KiB	60	27
frequencies_test.go	H A D	25-Apr-2025	1.5 KiB	56	38
go.mod	H A D	25-Apr-2025	169	10	7
go.sum	H A D	25-Apr-2025	1.7 KiB	20	19
scoring.go	H A D	25-Apr-2025	8.7 KiB	229	135
scoring_test.go	H A D	25-Apr-2025	6.8 KiB	306	276
searchset.go	H A D	25-Apr-2025	15.8 KiB	498	320
searchset_test.go	H A D	25-Apr-2025	11.8 KiB	388	338
tokenizer.go	H A D	25-Apr-2025	11.2 KiB	418	293
tokenizer_test.go	H A D	25-Apr-2025	6.9 KiB	316	275
trace.go	H A D	25-Apr-2025	3 KiB	122	77
trace_test.go	H A D	25-Apr-2025	4.1 KiB	173	143