xref: /aosp_15_r20/external/pigweed/pw_tokenizer/token_databases.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-token-databases:
2*61c4878aSAndroid Build Coastguard Worker
3*61c4878aSAndroid Build Coastguard Worker===============
4*61c4878aSAndroid Build Coastguard WorkerToken databases
5*61c4878aSAndroid Build Coastguard Worker===============
6*61c4878aSAndroid Build Coastguard Worker.. pigweed-module-subpage::
7*61c4878aSAndroid Build Coastguard Worker   :name: pw_tokenizer
8*61c4878aSAndroid Build Coastguard Worker
9*61c4878aSAndroid Build Coastguard WorkerToken databases store a mapping of tokens to the strings they represent. An ELF
10*61c4878aSAndroid Build Coastguard Workerfile can be used as a token database, but it only contains the strings for its
11*61c4878aSAndroid Build Coastguard Workerexact build. A token database file aggregates tokens from multiple ELF files, so
12*61c4878aSAndroid Build Coastguard Workerthat a single database can decode tokenized strings from any known ELF.
13*61c4878aSAndroid Build Coastguard Worker
14*61c4878aSAndroid Build Coastguard WorkerToken databases contain the token, removal date (if any), and string for each
15*61c4878aSAndroid Build Coastguard Workertokenized string.
16*61c4878aSAndroid Build Coastguard Worker
17*61c4878aSAndroid Build Coastguard Worker----------------------
18*61c4878aSAndroid Build Coastguard WorkerToken database formats
19*61c4878aSAndroid Build Coastguard Worker----------------------
20*61c4878aSAndroid Build Coastguard WorkerThree token database formats are supported: CSV, binary, and directory. Tokens
21*61c4878aSAndroid Build Coastguard Workermay also be read from ELF files or ``.a`` archives, but cannot be written to
22*61c4878aSAndroid Build Coastguard Workerthese formats.
23*61c4878aSAndroid Build Coastguard Worker
24*61c4878aSAndroid Build Coastguard WorkerCSV database format
25*61c4878aSAndroid Build Coastguard Worker===================
26*61c4878aSAndroid Build Coastguard WorkerThe CSV database format has four columns: the token in hexadecimal, the removal
27*61c4878aSAndroid Build Coastguard Workerdate (if any) in year-month-day format, the token domain, and the string
28*61c4878aSAndroid Build Coastguard Workerliteral. The domain and string are quoted, and quote characters within the
29*61c4878aSAndroid Build Coastguard Workerdomain or string are represented as two quote characters.
30*61c4878aSAndroid Build Coastguard Worker
31*61c4878aSAndroid Build Coastguard WorkerThis example database contains six strings, three of which have removal dates.
32*61c4878aSAndroid Build Coastguard Worker
33*61c4878aSAndroid Build Coastguard Worker.. code-block::
34*61c4878aSAndroid Build Coastguard Worker
35*61c4878aSAndroid Build Coastguard Worker   141c35d5,          ,"","The answer: ""%s"""
36*61c4878aSAndroid Build Coastguard Worker   2e668cd6,2019-12-25,"","Jello, world!"
37*61c4878aSAndroid Build Coastguard Worker   7a22c974,          ,"metrics","%f"
38*61c4878aSAndroid Build Coastguard Worker   7b940e2a,          ,"","Hello %s! %hd %e"
39*61c4878aSAndroid Build Coastguard Worker   851beeb6,          ,"","%u %d"
40*61c4878aSAndroid Build Coastguard Worker   881436a0,2020-01-01,"","The answer is: %s"
41*61c4878aSAndroid Build Coastguard Worker   e13b0f94,2020-04-01,"metrics","%llu"
42*61c4878aSAndroid Build Coastguard Worker
43*61c4878aSAndroid Build Coastguard WorkerLegacy CSV databases did not include the domain, so only had three columns.
44*61c4878aSAndroid Build Coastguard WorkerThese databases are still supported, but tokens are always in the default domain
45*61c4878aSAndroid Build Coastguard Worker(``""``).
46*61c4878aSAndroid Build Coastguard Worker
47*61c4878aSAndroid Build Coastguard WorkerBinary database format
48*61c4878aSAndroid Build Coastguard Worker======================
49*61c4878aSAndroid Build Coastguard WorkerThe binary database format is comprised of a 16-byte header followed by a series
50*61c4878aSAndroid Build Coastguard Workerof 8-byte entries. Each entry stores the token and the removal date, which is
51*61c4878aSAndroid Build Coastguard Worker0xFFFFFFFF if there is none. The string literals are stored next in the same
52*61c4878aSAndroid Build Coastguard Workerorder as the entries. Strings are stored with null terminators. See
53*61c4878aSAndroid Build Coastguard Worker`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
54*61c4878aSAndroid Build Coastguard Workerfor full details.
55*61c4878aSAndroid Build Coastguard Worker
56*61c4878aSAndroid Build Coastguard WorkerThe binary form of the CSV database is shown below. It contains the same
57*61c4878aSAndroid Build Coastguard Workerinformation, but in a more compact and easily processed form. It takes 141 B
58*61c4878aSAndroid Build Coastguard Workercompared with the CSV database's 211 B.
59*61c4878aSAndroid Build Coastguard Worker
60*61c4878aSAndroid Build Coastguard Worker.. code-block:: text
61*61c4878aSAndroid Build Coastguard Worker
62*61c4878aSAndroid Build Coastguard Worker   [header]
63*61c4878aSAndroid Build Coastguard Worker   0x00: 454b4f54 0000534e  TOKENS..
64*61c4878aSAndroid Build Coastguard Worker   0x08: 00000006 00000000  ........
65*61c4878aSAndroid Build Coastguard Worker
66*61c4878aSAndroid Build Coastguard Worker   [entries]
67*61c4878aSAndroid Build Coastguard Worker   0x10: 141c35d5 ffffffff  .5......
68*61c4878aSAndroid Build Coastguard Worker   0x18: 2e668cd6 07e30c19  ..f.....
69*61c4878aSAndroid Build Coastguard Worker   0x20: 7b940e2a ffffffff  *..{....
70*61c4878aSAndroid Build Coastguard Worker   0x28: 851beeb6 ffffffff  ........
71*61c4878aSAndroid Build Coastguard Worker   0x30: 881436a0 07e40101  .6......
72*61c4878aSAndroid Build Coastguard Worker   0x38: e13b0f94 07e40401  ..;.....
73*61c4878aSAndroid Build Coastguard Worker
74*61c4878aSAndroid Build Coastguard Worker   [string table]
75*61c4878aSAndroid Build Coastguard Worker   0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22  The answer: "%s"
76*61c4878aSAndroid Build Coastguard Worker   0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48  .Jello, world!.H
77*61c4878aSAndroid Build Coastguard Worker   0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00  ello %s! %hd %e.
78*61c4878aSAndroid Build Coastguard Worker   0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72  %u %d.The answer
79*61c4878aSAndroid Build Coastguard Worker   0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00            is: %s.%llu.
80*61c4878aSAndroid Build Coastguard Worker
81*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-directory-database-format:
82*61c4878aSAndroid Build Coastguard Worker
83*61c4878aSAndroid Build Coastguard WorkerDirectory database format
84*61c4878aSAndroid Build Coastguard Worker=========================
85*61c4878aSAndroid Build Coastguard Workerpw_tokenizer can consume directories of CSV databases. A directory database
86*61c4878aSAndroid Build Coastguard Workerwill be searched recursively for files with a `.pw_tokenizer.csv` suffix, all
87*61c4878aSAndroid Build Coastguard Workerof which will be used for subsequent detokenization lookups.
88*61c4878aSAndroid Build Coastguard Worker
89*61c4878aSAndroid Build Coastguard WorkerAn example directory database might look something like this:
90*61c4878aSAndroid Build Coastguard Worker
91*61c4878aSAndroid Build Coastguard Worker.. code-block:: text
92*61c4878aSAndroid Build Coastguard Worker
93*61c4878aSAndroid Build Coastguard Worker   directory_token_database
94*61c4878aSAndroid Build Coastguard Worker   ├── database.pw_tokenizer.csv
95*61c4878aSAndroid Build Coastguard Worker   ├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv
96*61c4878aSAndroid Build Coastguard Worker   └── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv
97*61c4878aSAndroid Build Coastguard Worker
98*61c4878aSAndroid Build Coastguard WorkerThis format is optimized for storage in a Git repository alongside source code.
99*61c4878aSAndroid Build Coastguard WorkerThe token database commands randomly generate unique file names for the CSVs in
100*61c4878aSAndroid Build Coastguard Workerthe database to prevent merge conflicts. Running ``mark_removed`` or ``purge``
101*61c4878aSAndroid Build Coastguard Workercommands in the database CLI consolidates the files to a single CSV.
102*61c4878aSAndroid Build Coastguard Worker
103*61c4878aSAndroid Build Coastguard WorkerThe database command line tool supports a ``--discard-temporary
104*61c4878aSAndroid Build Coastguard Worker<upstream_commit>`` option for ``add``. In this mode, the tool attempts to
105*61c4878aSAndroid Build Coastguard Workerdiscard temporary tokens. It identifies the latest CSV not present in the
106*61c4878aSAndroid Build Coastguard Workerprovided ``<upstream_commit>``, and tokens present that CSV that are not in the
107*61c4878aSAndroid Build Coastguard Workernewly added tokens are discarded. This helps keep temporary tokens (e.g from
108*61c4878aSAndroid Build Coastguard Workerdebug logs) out of the database.
109*61c4878aSAndroid Build Coastguard Worker
110*61c4878aSAndroid Build Coastguard WorkerJSON support
111*61c4878aSAndroid Build Coastguard Worker============
112*61c4878aSAndroid Build Coastguard WorkerWhile pw_tokenizer doesn't specify a JSON database format, a token database can
113*61c4878aSAndroid Build Coastguard Workerbe created from a JSON formatted array of strings. This is useful for side-band
114*61c4878aSAndroid Build Coastguard Workertoken database generation for strings that are not embedded as parsable tokens
115*61c4878aSAndroid Build Coastguard Workerin compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for
116*61c4878aSAndroid Build Coastguard Workerinstructions on generating a token database from a JSON file.
117*61c4878aSAndroid Build Coastguard Worker
118*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-managing-token-databases:
119*61c4878aSAndroid Build Coastguard Worker
120*61c4878aSAndroid Build Coastguard Worker------------------------
121*61c4878aSAndroid Build Coastguard WorkerManaging token databases
122*61c4878aSAndroid Build Coastguard Worker------------------------
123*61c4878aSAndroid Build Coastguard WorkerToken databases are managed with the ``database.py`` script. This script can be
124*61c4878aSAndroid Build Coastguard Workerused to extract tokens from compilation artifacts and manage database files.
125*61c4878aSAndroid Build Coastguard WorkerInvoke ``database.py`` with ``-h`` for full usage information.
126*61c4878aSAndroid Build Coastguard Worker
127*61c4878aSAndroid Build Coastguard WorkerAn example ELF file with tokenized logs is provided at
128*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
129*61c4878aSAndroid Build Coastguard Workerfile to experiment with the ``database.py`` commands.
130*61c4878aSAndroid Build Coastguard Worker
131*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-database-creation:
132*61c4878aSAndroid Build Coastguard Worker
133*61c4878aSAndroid Build Coastguard WorkerCreate a database
134*61c4878aSAndroid Build Coastguard Worker=================
135*61c4878aSAndroid Build Coastguard WorkerThe ``create`` command makes a new token database from ELF files (.elf, .o, .so,
136*61c4878aSAndroid Build Coastguard Workeretc.), archives (.a), existing token databases (CSV or binary), or a JSON file
137*61c4878aSAndroid Build Coastguard Workercontaining an array of strings.
138*61c4878aSAndroid Build Coastguard Worker
139*61c4878aSAndroid Build Coastguard Worker.. code-block:: sh
140*61c4878aSAndroid Build Coastguard Worker
141*61c4878aSAndroid Build Coastguard Worker   ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
142*61c4878aSAndroid Build Coastguard Worker
143*61c4878aSAndroid Build Coastguard WorkerTwo database output formats are supported: CSV and binary. Provide
144*61c4878aSAndroid Build Coastguard Worker``--type binary`` to ``create`` to generate a binary database instead of the
145*61c4878aSAndroid Build Coastguard Workerdefault CSV. CSV databases are great for checking into a source control or for
146*61c4878aSAndroid Build Coastguard Workerhuman review. Binary databases are more compact and simpler to parse. The C++
147*61c4878aSAndroid Build Coastguard Workerdetokenizer library only supports binary databases currently.
148*61c4878aSAndroid Build Coastguard Worker
149*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-update-token-database:
150*61c4878aSAndroid Build Coastguard Worker
151*61c4878aSAndroid Build Coastguard WorkerUpdate a database
152*61c4878aSAndroid Build Coastguard Worker=================
153*61c4878aSAndroid Build Coastguard WorkerAs new tokenized strings are added, update the database with the ``add``
154*61c4878aSAndroid Build Coastguard Workercommand.
155*61c4878aSAndroid Build Coastguard Worker
156*61c4878aSAndroid Build Coastguard Worker.. code-block:: sh
157*61c4878aSAndroid Build Coastguard Worker
158*61c4878aSAndroid Build Coastguard Worker   ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
159*61c4878aSAndroid Build Coastguard Worker
160*61c4878aSAndroid Build Coastguard WorkerThis command adds new tokens from ELF files or other databases to the database.
161*61c4878aSAndroid Build Coastguard WorkerAdding tokens already present in the database updates the date removed, if any,
162*61c4878aSAndroid Build Coastguard Workerto the latest.
163*61c4878aSAndroid Build Coastguard Worker
164*61c4878aSAndroid Build Coastguard WorkerA CSV token database can be checked into a source repository and updated as code
165*61c4878aSAndroid Build Coastguard Workerchanges are made. The build system can invoke ``database.py`` to update the
166*61c4878aSAndroid Build Coastguard Workerdatabase after each build.
167*61c4878aSAndroid Build Coastguard Worker
168*61c4878aSAndroid Build Coastguard WorkerGN integration
169*61c4878aSAndroid Build Coastguard Worker==============
170*61c4878aSAndroid Build Coastguard WorkerToken databases may be updated or created as part of a GN build. The
171*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer_database`` template provided by
172*61c4878aSAndroid Build Coastguard Worker``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
173*61c4878aSAndroid Build Coastguard Workerstrings database or creates a new database with artifacts from one or more GN
174*61c4878aSAndroid Build Coastguard Workertargets or other database files.
175*61c4878aSAndroid Build Coastguard Worker
176*61c4878aSAndroid Build Coastguard WorkerTo create a new database, set the ``create`` variable to the desired database
177*61c4878aSAndroid Build Coastguard Workertype (``"csv"`` or ``"binary"``). The database will be created in the output
178*61c4878aSAndroid Build Coastguard Workerdirectory. To update an existing database, provide the path to the database with
179*61c4878aSAndroid Build Coastguard Workerthe ``database`` variable.
180*61c4878aSAndroid Build Coastguard Worker
181*61c4878aSAndroid Build Coastguard Worker.. code-block::
182*61c4878aSAndroid Build Coastguard Worker
183*61c4878aSAndroid Build Coastguard Worker   import("//build_overrides/pigweed.gni")
184*61c4878aSAndroid Build Coastguard Worker
185*61c4878aSAndroid Build Coastguard Worker   import("$dir_pw_tokenizer/database.gni")
186*61c4878aSAndroid Build Coastguard Worker
187*61c4878aSAndroid Build Coastguard Worker   pw_tokenizer_database("my_database") {
188*61c4878aSAndroid Build Coastguard Worker     database = "database_in_the_source_tree.csv"
189*61c4878aSAndroid Build Coastguard Worker     targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
190*61c4878aSAndroid Build Coastguard Worker     input_databases = [ "other_database.csv" ]
191*61c4878aSAndroid Build Coastguard Worker   }
192*61c4878aSAndroid Build Coastguard Worker
193*61c4878aSAndroid Build Coastguard WorkerInstead of specifying GN targets, paths or globs to output files may be provided
194*61c4878aSAndroid Build Coastguard Workerwith the ``paths`` option.
195*61c4878aSAndroid Build Coastguard Worker
196*61c4878aSAndroid Build Coastguard Worker.. code-block::
197*61c4878aSAndroid Build Coastguard Worker
198*61c4878aSAndroid Build Coastguard Worker   pw_tokenizer_database("my_database") {
199*61c4878aSAndroid Build Coastguard Worker     database = "database_in_the_source_tree.csv"
200*61c4878aSAndroid Build Coastguard Worker     deps = [ ":apps" ]
201*61c4878aSAndroid Build Coastguard Worker     optional_paths = [ "$root_build_dir/**/*.elf" ]
202*61c4878aSAndroid Build Coastguard Worker   }
203*61c4878aSAndroid Build Coastguard Worker
204*61c4878aSAndroid Build Coastguard Worker.. note::
205*61c4878aSAndroid Build Coastguard Worker
206*61c4878aSAndroid Build Coastguard Worker   The ``paths`` and ``optional_targets`` arguments do not add anything to
207*61c4878aSAndroid Build Coastguard Worker   ``deps``, so there is no guarantee that the referenced artifacts will exist
208*61c4878aSAndroid Build Coastguard Worker   when the database is updated. Provide ``targets`` or ``deps`` or build other
209*61c4878aSAndroid Build Coastguard Worker   GN targets first if this is a concern.
210*61c4878aSAndroid Build Coastguard Worker
211*61c4878aSAndroid Build Coastguard WorkerCMake integration
212*61c4878aSAndroid Build Coastguard Worker=================
213*61c4878aSAndroid Build Coastguard WorkerToken databases may be updated or created as part of a CMake build. The
214*61c4878aSAndroid Build Coastguard Worker``pw_tokenizer_database`` template provided by
215*61c4878aSAndroid Build Coastguard Worker``$dir_pw_tokenizer/database.cmake`` automatically updates an in-source tokenized
216*61c4878aSAndroid Build Coastguard Workerstrings database or creates a new database with artifacts from a CMake target.
217*61c4878aSAndroid Build Coastguard Worker
218*61c4878aSAndroid Build Coastguard WorkerTo create a new database, set the ``CREATE`` variable to the desired database
219*61c4878aSAndroid Build Coastguard Workertype (``"csv"`` or ``"binary"``). The database will be created in the output
220*61c4878aSAndroid Build Coastguard Workerdirectory.
221*61c4878aSAndroid Build Coastguard Worker
222*61c4878aSAndroid Build Coastguard Worker.. code-block::
223*61c4878aSAndroid Build Coastguard Worker
224*61c4878aSAndroid Build Coastguard Worker   include("$dir_pw_tokenizer/database.cmake")
225*61c4878aSAndroid Build Coastguard Worker
226*61c4878aSAndroid Build Coastguard Worker   pw_tokenizer_database("my_database") {
227*61c4878aSAndroid Build Coastguard Worker     CREATE binary
228*61c4878aSAndroid Build Coastguard Worker     TARGET my_target.ext
229*61c4878aSAndroid Build Coastguard Worker     DEPS ${deps_list}
230*61c4878aSAndroid Build Coastguard Worker   }
231*61c4878aSAndroid Build Coastguard Worker
232*61c4878aSAndroid Build Coastguard WorkerTo update an existing database, provide the path to the database with
233*61c4878aSAndroid Build Coastguard Workerthe ``database`` variable.
234*61c4878aSAndroid Build Coastguard Worker
235*61c4878aSAndroid Build Coastguard Worker.. code-block::
236*61c4878aSAndroid Build Coastguard Worker
237*61c4878aSAndroid Build Coastguard Worker   pw_tokenizer_database("my_database") {
238*61c4878aSAndroid Build Coastguard Worker     DATABASE database_in_the_source_tree.csv
239*61c4878aSAndroid Build Coastguard Worker     TARGET my_target.ext
240*61c4878aSAndroid Build Coastguard Worker     DEPS ${deps_list}
241*61c4878aSAndroid Build Coastguard Worker   }
242*61c4878aSAndroid Build Coastguard Worker
243*61c4878aSAndroid Build Coastguard Worker.. _module-pw_tokenizer-collisions:
244*61c4878aSAndroid Build Coastguard Worker
245*61c4878aSAndroid Build Coastguard Worker----------------
246*61c4878aSAndroid Build Coastguard WorkerToken collisions
247*61c4878aSAndroid Build Coastguard Worker----------------
248*61c4878aSAndroid Build Coastguard WorkerTokens are calculated with a hash function. It is possible for different
249*61c4878aSAndroid Build Coastguard Workerstrings to hash to the same token. When this happens, multiple strings will have
250*61c4878aSAndroid Build Coastguard Workerthe same token in the database, and it may not be possible to unambiguously
251*61c4878aSAndroid Build Coastguard Workerdecode a token.
252*61c4878aSAndroid Build Coastguard Worker
253*61c4878aSAndroid Build Coastguard WorkerThe detokenization tools attempt to resolve collisions automatically. Collisions
254*61c4878aSAndroid Build Coastguard Workerare resolved based on two things:
255*61c4878aSAndroid Build Coastguard Worker
256*61c4878aSAndroid Build Coastguard Worker- whether the tokenized data matches the strings arguments' (if any), and
257*61c4878aSAndroid Build Coastguard Worker- if / when the string was marked as having been removed from the database.
258*61c4878aSAndroid Build Coastguard Worker
259*61c4878aSAndroid Build Coastguard WorkerResolving collisions
260*61c4878aSAndroid Build Coastguard Worker====================
261*61c4878aSAndroid Build Coastguard WorkerCollisions may occur occasionally. Run the command
262*61c4878aSAndroid Build Coastguard Worker``python -m pw_tokenizer.database report <database>`` to see information about a
263*61c4878aSAndroid Build Coastguard Workertoken database, including any collisions.
264*61c4878aSAndroid Build Coastguard Worker
265*61c4878aSAndroid Build Coastguard WorkerIf there are collisions, take the following steps to resolve them.
266*61c4878aSAndroid Build Coastguard Worker
267*61c4878aSAndroid Build Coastguard Worker- Change one of the colliding strings slightly to give it a new token.
268*61c4878aSAndroid Build Coastguard Worker- In C (not C++), artificial collisions may occur if strings longer than
269*61c4878aSAndroid Build Coastguard Worker  ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider
270*61c4878aSAndroid Build Coastguard Worker  setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.  See
271*61c4878aSAndroid Build Coastguard Worker  ``pw_tokenizer/public/pw_tokenizer/config.h``.
272*61c4878aSAndroid Build Coastguard Worker- Run the ``mark_removed`` command with the latest version of the build
273*61c4878aSAndroid Build Coastguard Worker  artifacts to mark missing strings as removed. This deprioritizes them in
274*61c4878aSAndroid Build Coastguard Worker  collision resolution.
275*61c4878aSAndroid Build Coastguard Worker
276*61c4878aSAndroid Build Coastguard Worker  .. code-block:: sh
277*61c4878aSAndroid Build Coastguard Worker
278*61c4878aSAndroid Build Coastguard Worker     python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
279*61c4878aSAndroid Build Coastguard Worker
280*61c4878aSAndroid Build Coastguard Worker  The ``purge`` command may be used to delete these tokens from the database.
281*61c4878aSAndroid Build Coastguard Worker
282*61c4878aSAndroid Build Coastguard WorkerProbability of collisions
283*61c4878aSAndroid Build Coastguard Worker=========================
284*61c4878aSAndroid Build Coastguard WorkerHashes of any size have a collision risk. The probability of one at least
285*61c4878aSAndroid Build Coastguard Workerone collision occurring for a given number of strings is unintuitively high
286*61c4878aSAndroid Build Coastguard Worker(this is known as the `birthday problem
287*61c4878aSAndroid Build Coastguard Worker<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
288*61c4878aSAndroid Build Coastguard Workerused for tokens, the probability of collisions increases substantially.
289*61c4878aSAndroid Build Coastguard Worker
290*61c4878aSAndroid Build Coastguard WorkerThis table shows the approximate number of strings that can be hashed to have a
291*61c4878aSAndroid Build Coastguard Worker1% or 50% probability of at least one collision (assuming a uniform, random
292*61c4878aSAndroid Build Coastguard Workerhash).
293*61c4878aSAndroid Build Coastguard Worker
294*61c4878aSAndroid Build Coastguard Worker+-------+---------------------------------------+
295*61c4878aSAndroid Build Coastguard Worker| Token | Collision probability by string count |
296*61c4878aSAndroid Build Coastguard Worker| bits  +--------------------+------------------+
297*61c4878aSAndroid Build Coastguard Worker|       |         50%        |          1%      |
298*61c4878aSAndroid Build Coastguard Worker+=======+====================+==================+
299*61c4878aSAndroid Build Coastguard Worker|   32  |       77000        |        9300      |
300*61c4878aSAndroid Build Coastguard Worker+-------+--------------------+------------------+
301*61c4878aSAndroid Build Coastguard Worker|   31  |       54000        |        6600      |
302*61c4878aSAndroid Build Coastguard Worker+-------+--------------------+------------------+
303*61c4878aSAndroid Build Coastguard Worker|   24  |        4800        |         580      |
304*61c4878aSAndroid Build Coastguard Worker+-------+--------------------+------------------+
305*61c4878aSAndroid Build Coastguard Worker|   16  |         300        |          36      |
306*61c4878aSAndroid Build Coastguard Worker+-------+--------------------+------------------+
307*61c4878aSAndroid Build Coastguard Worker|    8  |          19        |           3      |
308*61c4878aSAndroid Build Coastguard Worker+-------+--------------------+------------------+
309*61c4878aSAndroid Build Coastguard Worker
310*61c4878aSAndroid Build Coastguard WorkerKeep this table in mind when masking tokens (see
311*61c4878aSAndroid Build Coastguard Worker:ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when
312*61c4878aSAndroid Build Coastguard Workertokenizing a small set of strings, such as module names, but won't be suitable
313*61c4878aSAndroid Build Coastguard Workerfor large sets of strings, like log messages.
314