xref: /aosp_15_r20/external/pigweed/pw_tokenizer/token_databases.rst (revision 61c4878ac05f98d0ceed94b57d316916de578985)
1.. _module-pw_tokenizer-token-databases:
2
3===============
4Token databases
5===============
6.. pigweed-module-subpage::
7   :name: pw_tokenizer
8
9Token databases store a mapping of tokens to the strings they represent. An ELF
10file can be used as a token database, but it only contains the strings for its
11exact build. A token database file aggregates tokens from multiple ELF files, so
12that a single database can decode tokenized strings from any known ELF.
13
14Token databases contain the token, removal date (if any), and string for each
15tokenized string.
16
17----------------------
18Token database formats
19----------------------
20Three token database formats are supported: CSV, binary, and directory. Tokens
21may also be read from ELF files or ``.a`` archives, but cannot be written to
22these formats.
23
24CSV database format
25===================
26The CSV database format has four columns: the token in hexadecimal, the removal
27date (if any) in year-month-day format, the token domain, and the string
28literal. The domain and string are quoted, and quote characters within the
29domain or string are represented as two quote characters.
30
31This example database contains six strings, three of which have removal dates.
32
33.. code-block::
34
35   141c35d5,          ,"","The answer: ""%s"""
36   2e668cd6,2019-12-25,"","Jello, world!"
37   7a22c974,          ,"metrics","%f"
38   7b940e2a,          ,"","Hello %s! %hd %e"
39   851beeb6,          ,"","%u %d"
40   881436a0,2020-01-01,"","The answer is: %s"
41   e13b0f94,2020-04-01,"metrics","%llu"
42
43Legacy CSV databases did not include the domain, so only had three columns.
44These databases are still supported, but tokens are always in the default domain
45(``""``).
46
47Binary database format
48======================
49The binary database format is comprised of a 16-byte header followed by a series
50of 8-byte entries. Each entry stores the token and the removal date, which is
510xFFFFFFFF if there is none. The string literals are stored next in the same
52order as the entries. Strings are stored with null terminators. See
53`token_database.h <https://pigweed.googlesource.com/pigweed/pigweed/+/HEAD/pw_tokenizer/public/pw_tokenizer/token_database.h>`_
54for full details.
55
56The binary form of the CSV database is shown below. It contains the same
57information, but in a more compact and easily processed form. It takes 141 B
58compared with the CSV database's 211 B.
59
60.. code-block:: text
61
62   [header]
63   0x00: 454b4f54 0000534e  TOKENS..
64   0x08: 00000006 00000000  ........
65
66   [entries]
67   0x10: 141c35d5 ffffffff  .5......
68   0x18: 2e668cd6 07e30c19  ..f.....
69   0x20: 7b940e2a ffffffff  *..{....
70   0x28: 851beeb6 ffffffff  ........
71   0x30: 881436a0 07e40101  .6......
72   0x38: e13b0f94 07e40401  ..;.....
73
74   [string table]
75   0x40: 54 68 65 20 61 6e 73 77 65 72 3a 20 22 25 73 22  The answer: "%s"
76   0x50: 00 4a 65 6c 6c 6f 2c 20 77 6f 72 6c 64 21 00 48  .Jello, world!.H
77   0x60: 65 6c 6c 6f 20 25 73 21 20 25 68 64 20 25 65 00  ello %s! %hd %e.
78   0x70: 25 75 20 25 64 00 54 68 65 20 61 6e 73 77 65 72  %u %d.The answer
79   0x80: 20 69 73 3a 20 25 73 00 25 6c 6c 75 00            is: %s.%llu.
80
81.. _module-pw_tokenizer-directory-database-format:
82
83Directory database format
84=========================
85pw_tokenizer can consume directories of CSV databases. A directory database
86will be searched recursively for files with a `.pw_tokenizer.csv` suffix, all
87of which will be used for subsequent detokenization lookups.
88
89An example directory database might look something like this:
90
91.. code-block:: text
92
93   directory_token_database
94   ├── database.pw_tokenizer.csv
95   ├── 9a8906c30d7c4abaa788de5634d2fa25.pw_tokenizer.csv
96   └── b9aff81a03ad4d8a82a250a737285454.pw_tokenizer.csv
97
98This format is optimized for storage in a Git repository alongside source code.
99The token database commands randomly generate unique file names for the CSVs in
100the database to prevent merge conflicts. Running ``mark_removed`` or ``purge``
101commands in the database CLI consolidates the files to a single CSV.
102
103The database command line tool supports a ``--discard-temporary
104<upstream_commit>`` option for ``add``. In this mode, the tool attempts to
105discard temporary tokens. It identifies the latest CSV not present in the
106provided ``<upstream_commit>``, and tokens present that CSV that are not in the
107newly added tokens are discarded. This helps keep temporary tokens (e.g from
108debug logs) out of the database.
109
110JSON support
111============
112While pw_tokenizer doesn't specify a JSON database format, a token database can
113be created from a JSON formatted array of strings. This is useful for side-band
114token database generation for strings that are not embedded as parsable tokens
115in compiled binaries. See :ref:`module-pw_tokenizer-database-creation` for
116instructions on generating a token database from a JSON file.
117
118.. _module-pw_tokenizer-managing-token-databases:
119
120------------------------
121Managing token databases
122------------------------
123Token databases are managed with the ``database.py`` script. This script can be
124used to extract tokens from compilation artifacts and manage database files.
125Invoke ``database.py`` with ``-h`` for full usage information.
126
127An example ELF file with tokenized logs is provided at
128``pw_tokenizer/py/example_binary_with_tokenized_strings.elf``. You can use that
129file to experiment with the ``database.py`` commands.
130
131.. _module-pw_tokenizer-database-creation:
132
133Create a database
134=================
135The ``create`` command makes a new token database from ELF files (.elf, .o, .so,
136etc.), archives (.a), existing token databases (CSV or binary), or a JSON file
137containing an array of strings.
138
139.. code-block:: sh
140
141   ./database.py create --database DATABASE_NAME ELF_OR_DATABASE_FILE...
142
143Two database output formats are supported: CSV and binary. Provide
144``--type binary`` to ``create`` to generate a binary database instead of the
145default CSV. CSV databases are great for checking into a source control or for
146human review. Binary databases are more compact and simpler to parse. The C++
147detokenizer library only supports binary databases currently.
148
149.. _module-pw_tokenizer-update-token-database:
150
151Update a database
152=================
153As new tokenized strings are added, update the database with the ``add``
154command.
155
156.. code-block:: sh
157
158   ./database.py add --database DATABASE_NAME ELF_OR_DATABASE_FILE...
159
160This command adds new tokens from ELF files or other databases to the database.
161Adding tokens already present in the database updates the date removed, if any,
162to the latest.
163
164A CSV token database can be checked into a source repository and updated as code
165changes are made. The build system can invoke ``database.py`` to update the
166database after each build.
167
168GN integration
169==============
170Token databases may be updated or created as part of a GN build. The
171``pw_tokenizer_database`` template provided by
172``$dir_pw_tokenizer/database.gni`` automatically updates an in-source tokenized
173strings database or creates a new database with artifacts from one or more GN
174targets or other database files.
175
176To create a new database, set the ``create`` variable to the desired database
177type (``"csv"`` or ``"binary"``). The database will be created in the output
178directory. To update an existing database, provide the path to the database with
179the ``database`` variable.
180
181.. code-block::
182
183   import("//build_overrides/pigweed.gni")
184
185   import("$dir_pw_tokenizer/database.gni")
186
187   pw_tokenizer_database("my_database") {
188     database = "database_in_the_source_tree.csv"
189     targets = [ "//firmware/image:foo(//targets/my_board:some_toolchain)" ]
190     input_databases = [ "other_database.csv" ]
191   }
192
193Instead of specifying GN targets, paths or globs to output files may be provided
194with the ``paths`` option.
195
196.. code-block::
197
198   pw_tokenizer_database("my_database") {
199     database = "database_in_the_source_tree.csv"
200     deps = [ ":apps" ]
201     optional_paths = [ "$root_build_dir/**/*.elf" ]
202   }
203
204.. note::
205
206   The ``paths`` and ``optional_targets`` arguments do not add anything to
207   ``deps``, so there is no guarantee that the referenced artifacts will exist
208   when the database is updated. Provide ``targets`` or ``deps`` or build other
209   GN targets first if this is a concern.
210
211CMake integration
212=================
213Token databases may be updated or created as part of a CMake build. The
214``pw_tokenizer_database`` template provided by
215``$dir_pw_tokenizer/database.cmake`` automatically updates an in-source tokenized
216strings database or creates a new database with artifacts from a CMake target.
217
218To create a new database, set the ``CREATE`` variable to the desired database
219type (``"csv"`` or ``"binary"``). The database will be created in the output
220directory.
221
222.. code-block::
223
224   include("$dir_pw_tokenizer/database.cmake")
225
226   pw_tokenizer_database("my_database") {
227     CREATE binary
228     TARGET my_target.ext
229     DEPS ${deps_list}
230   }
231
232To update an existing database, provide the path to the database with
233the ``database`` variable.
234
235.. code-block::
236
237   pw_tokenizer_database("my_database") {
238     DATABASE database_in_the_source_tree.csv
239     TARGET my_target.ext
240     DEPS ${deps_list}
241   }
242
243.. _module-pw_tokenizer-collisions:
244
245----------------
246Token collisions
247----------------
248Tokens are calculated with a hash function. It is possible for different
249strings to hash to the same token. When this happens, multiple strings will have
250the same token in the database, and it may not be possible to unambiguously
251decode a token.
252
253The detokenization tools attempt to resolve collisions automatically. Collisions
254are resolved based on two things:
255
256- whether the tokenized data matches the strings arguments' (if any), and
257- if / when the string was marked as having been removed from the database.
258
259Resolving collisions
260====================
261Collisions may occur occasionally. Run the command
262``python -m pw_tokenizer.database report <database>`` to see information about a
263token database, including any collisions.
264
265If there are collisions, take the following steps to resolve them.
266
267- Change one of the colliding strings slightly to give it a new token.
268- In C (not C++), artificial collisions may occur if strings longer than
269  ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` are hashed. If this is happening, consider
270  setting ``PW_TOKENIZER_CFG_C_HASH_LENGTH`` to a larger value.  See
271  ``pw_tokenizer/public/pw_tokenizer/config.h``.
272- Run the ``mark_removed`` command with the latest version of the build
273  artifacts to mark missing strings as removed. This deprioritizes them in
274  collision resolution.
275
276  .. code-block:: sh
277
278     python -m pw_tokenizer.database mark_removed --database <database> <ELF files>
279
280  The ``purge`` command may be used to delete these tokens from the database.
281
282Probability of collisions
283=========================
284Hashes of any size have a collision risk. The probability of one at least
285one collision occurring for a given number of strings is unintuitively high
286(this is known as the `birthday problem
287<https://en.wikipedia.org/wiki/Birthday_problem>`_). If fewer than 32 bits are
288used for tokens, the probability of collisions increases substantially.
289
290This table shows the approximate number of strings that can be hashed to have a
2911% or 50% probability of at least one collision (assuming a uniform, random
292hash).
293
294+-------+---------------------------------------+
295| Token | Collision probability by string count |
296| bits  +--------------------+------------------+
297|       |         50%        |          1%      |
298+=======+====================+==================+
299|   32  |       77000        |        9300      |
300+-------+--------------------+------------------+
301|   31  |       54000        |        6600      |
302+-------+--------------------+------------------+
303|   24  |        4800        |         580      |
304+-------+--------------------+------------------+
305|   16  |         300        |          36      |
306+-------+--------------------+------------------+
307|    8  |          19        |           3      |
308+-------+--------------------+------------------+
309
310Keep this table in mind when masking tokens (see
311:ref:`module-pw_tokenizer-masks`). 16 bits might be acceptable when
312tokenizing a small set of strings, such as module names, but won't be suitable
313for large sets of strings, like log messages.
314