xref: /aosp_15_r20/external/pcre/doc/html/pcre2serialize.html (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf Ebrahimi<html>
2*22dc650dSSadaf Ebrahimi<head>
3*22dc650dSSadaf Ebrahimi<title>pcre2serialize specification</title>
4*22dc650dSSadaf Ebrahimi</head>
5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6*22dc650dSSadaf Ebrahimi<h1>pcre2serialize man page</h1>
7*22dc650dSSadaf Ebrahimi<p>
8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
9*22dc650dSSadaf Ebrahimi</p>
10*22dc650dSSadaf Ebrahimi<p>
11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated
12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it,
13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong.
14*22dc650dSSadaf Ebrahimi<br>
15*22dc650dSSadaf Ebrahimi<ul>
16*22dc650dSSadaf Ebrahimi<li><a name="TOC1" href="#SEC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a>
17*22dc650dSSadaf Ebrahimi<li><a name="TOC2" href="#SEC2">SECURITY CONCERNS</a>
18*22dc650dSSadaf Ebrahimi<li><a name="TOC3" href="#SEC3">SAVING COMPILED PATTERNS</a>
19*22dc650dSSadaf Ebrahimi<li><a name="TOC4" href="#SEC4">RE-USING PRECOMPILED PATTERNS</a>
20*22dc650dSSadaf Ebrahimi<li><a name="TOC5" href="#SEC5">AUTHOR</a>
21*22dc650dSSadaf Ebrahimi<li><a name="TOC6" href="#SEC6">REVISION</a>
22*22dc650dSSadaf Ebrahimi</ul>
23*22dc650dSSadaf Ebrahimi<br><a name="SEC1" href="#TOC1">SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS</a><br>
24*22dc650dSSadaf Ebrahimi<P>
25*22dc650dSSadaf Ebrahimi<b>int32_t pcre2_serialize_decode(pcre2_code **<i>codes</i>,</b>
26*22dc650dSSadaf Ebrahimi<b>  int32_t <i>number_of_codes</i>, const uint8_t *<i>bytes</i>,</b>
27*22dc650dSSadaf Ebrahimi<b>  pcre2_general_context *<i>gcontext</i>);</b>
28*22dc650dSSadaf Ebrahimi<br>
29*22dc650dSSadaf Ebrahimi<br>
30*22dc650dSSadaf Ebrahimi<b>int32_t pcre2_serialize_encode(const pcre2_code **<i>codes</i>,</b>
31*22dc650dSSadaf Ebrahimi<b>  int32_t <i>number_of_codes</i>, uint8_t **<i>serialized_bytes</i>,</b>
32*22dc650dSSadaf Ebrahimi<b>  PCRE2_SIZE *<i>serialized_size</i>, pcre2_general_context *<i>gcontext</i>);</b>
33*22dc650dSSadaf Ebrahimi<br>
34*22dc650dSSadaf Ebrahimi<br>
35*22dc650dSSadaf Ebrahimi<b>void pcre2_serialize_free(uint8_t *<i>bytes</i>);</b>
36*22dc650dSSadaf Ebrahimi<br>
37*22dc650dSSadaf Ebrahimi<br>
38*22dc650dSSadaf Ebrahimi<b>int32_t pcre2_serialize_get_number_of_codes(const uint8_t *<i>bytes</i>);</b>
39*22dc650dSSadaf Ebrahimi<br>
40*22dc650dSSadaf Ebrahimi<br>
41*22dc650dSSadaf EbrahimiIf you are running an application that uses a large number of regular
42*22dc650dSSadaf Ebrahimiexpression patterns, it may be useful to store them in a precompiled form
43*22dc650dSSadaf Ebrahimiinstead of having to compile them every time the application is run. However,
44*22dc650dSSadaf Ebrahimiif you are using the just-in-time optimization feature, it is not possible to
45*22dc650dSSadaf Ebrahimisave and reload the JIT data, because it is position-dependent. The host on
46*22dc650dSSadaf Ebrahimiwhich the patterns are reloaded must be running the same version of PCRE2, with
47*22dc650dSSadaf Ebrahimithe same code unit width, and must also have the same endianness, pointer width
48*22dc650dSSadaf Ebrahimiand PCRE2_SIZE type. For example, patterns compiled on a 32-bit system using
49*22dc650dSSadaf EbrahimiPCRE2's 16-bit library cannot be reloaded on a 64-bit system, nor can they be
50*22dc650dSSadaf Ebrahimireloaded using the 8-bit library.
51*22dc650dSSadaf Ebrahimi</P>
52*22dc650dSSadaf Ebrahimi<P>
53*22dc650dSSadaf EbrahimiNote that "serialization" in PCRE2 does not convert compiled patterns to an
54*22dc650dSSadaf Ebrahimiabstract format like Java or .NET serialization. The serialized output is
55*22dc650dSSadaf Ebrahimireally just a bytecode dump, which is why it can only be reloaded in the same
56*22dc650dSSadaf Ebrahimienvironment as the one that created it. Hence the restrictions mentioned above.
57*22dc650dSSadaf EbrahimiApplications that are not statically linked with a fixed version of PCRE2 must
58*22dc650dSSadaf Ebrahimibe prepared to recompile patterns from their sources, in order to be immune to
59*22dc650dSSadaf EbrahimiPCRE2 upgrades.
60*22dc650dSSadaf Ebrahimi</P>
61*22dc650dSSadaf Ebrahimi<br><a name="SEC2" href="#TOC1">SECURITY CONCERNS</a><br>
62*22dc650dSSadaf Ebrahimi<P>
63*22dc650dSSadaf EbrahimiThe facility for saving and restoring compiled patterns is intended for use
64*22dc650dSSadaf Ebrahimiwithin individual applications. As such, the data supplied to
65*22dc650dSSadaf Ebrahimi<b>pcre2_serialize_decode()</b> is expected to be trusted data, not data from
66*22dc650dSSadaf Ebrahimiarbitrary external sources. There is only some simple consistency checking, not
67*22dc650dSSadaf Ebrahimicomplete validation of what is being re-loaded. Corrupted data may cause
68*22dc650dSSadaf Ebrahimiundefined results. For example, if the length field of a pattern in the
69*22dc650dSSadaf Ebrahimiserialized data is corrupted, the deserializing code may read beyond the end of
70*22dc650dSSadaf Ebrahimithe byte stream that is passed to it.
71*22dc650dSSadaf Ebrahimi</P>
72*22dc650dSSadaf Ebrahimi<br><a name="SEC3" href="#TOC1">SAVING COMPILED PATTERNS</a><br>
73*22dc650dSSadaf Ebrahimi<P>
74*22dc650dSSadaf EbrahimiBefore compiled patterns can be saved they must be serialized, which in PCRE2
75*22dc650dSSadaf Ebrahimimeans converting the pattern to a stream of bytes. A single byte stream may
76*22dc650dSSadaf Ebrahimicontain any number of compiled patterns, but they must all use the same
77*22dc650dSSadaf Ebrahimicharacter tables. A single copy of the tables is included in the byte stream
78*22dc650dSSadaf Ebrahimi(its size is 1088 bytes). For more details of character tables, see the
79*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#localesupport">section on locale support</a>
80*22dc650dSSadaf Ebrahimiin the
81*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
82*22dc650dSSadaf Ebrahimidocumentation.
83*22dc650dSSadaf Ebrahimi</P>
84*22dc650dSSadaf Ebrahimi<P>
85*22dc650dSSadaf EbrahimiThe function <b>pcre2_serialize_encode()</b> creates a serialized byte stream
86*22dc650dSSadaf Ebrahimifrom a list of compiled patterns. Its first two arguments specify the list,
87*22dc650dSSadaf Ebrahimibeing a pointer to a vector of pointers to compiled patterns, and the length of
88*22dc650dSSadaf Ebrahimithe vector. The third and fourth arguments point to variables which are set to
89*22dc650dSSadaf Ebrahimipoint to the created byte stream and its length, respectively. The final
90*22dc650dSSadaf Ebrahimiargument is a pointer to a general context, which can be used to specify custom
91*22dc650dSSadaf Ebrahimimemory management functions. If this argument is NULL, <b>malloc()</b> is used
92*22dc650dSSadaf Ebrahimito obtain memory for the byte stream. The yield of the function is the number
93*22dc650dSSadaf Ebrahimiof serialized patterns, or one of the following negative error codes:
94*22dc650dSSadaf Ebrahimi<pre>
95*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_BADDATA      the number of patterns is zero or less
96*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_BADMAGIC     mismatch of id bytes in one of the patterns
97*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_NOMEMORY     memory allocation failed
98*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_MIXEDTABLES  the patterns do not all use the same tables
99*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_NULL         the 1st, 3rd, or 4th argument is NULL
100*22dc650dSSadaf Ebrahimi</pre>
101*22dc650dSSadaf EbrahimiPCRE2_ERROR_BADMAGIC means either that a pattern's code has been corrupted, or
102*22dc650dSSadaf Ebrahimithat a slot in the vector does not point to a compiled pattern.
103*22dc650dSSadaf Ebrahimi</P>
104*22dc650dSSadaf Ebrahimi<P>
105*22dc650dSSadaf EbrahimiOnce a set of patterns has been serialized you can save the data in any
106*22dc650dSSadaf Ebrahimiappropriate manner. Here is sample code that compiles two patterns and writes
107*22dc650dSSadaf Ebrahimithem to a file. It assumes that the variable <i>fd</i> refers to a file that is
108*22dc650dSSadaf Ebrahimiopen for output. The error checking that should be present in a real
109*22dc650dSSadaf Ebrahimiapplication has been omitted for simplicity.
110*22dc650dSSadaf Ebrahimi<pre>
111*22dc650dSSadaf Ebrahimi  int errorcode;
112*22dc650dSSadaf Ebrahimi  uint8_t *bytes;
113*22dc650dSSadaf Ebrahimi  PCRE2_SIZE erroroffset;
114*22dc650dSSadaf Ebrahimi  PCRE2_SIZE bytescount;
115*22dc650dSSadaf Ebrahimi  pcre2_code *list_of_codes[2];
116*22dc650dSSadaf Ebrahimi  list_of_codes[0] = pcre2_compile("first pattern",
117*22dc650dSSadaf Ebrahimi    PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
118*22dc650dSSadaf Ebrahimi  list_of_codes[1] = pcre2_compile("second pattern",
119*22dc650dSSadaf Ebrahimi    PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
120*22dc650dSSadaf Ebrahimi  errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
121*22dc650dSSadaf Ebrahimi    &bytescount, NULL);
122*22dc650dSSadaf Ebrahimi  errorcode = fwrite(bytes, 1, bytescount, fd);
123*22dc650dSSadaf Ebrahimi</pre>
124*22dc650dSSadaf EbrahimiNote that the serialized data is binary data that may contain any of the 256
125*22dc650dSSadaf Ebrahimipossible byte values. On systems that make a distinction between binary and
126*22dc650dSSadaf Ebrahiminon-binary data, be sure that the file is opened for binary output.
127*22dc650dSSadaf Ebrahimi</P>
128*22dc650dSSadaf Ebrahimi<P>
129*22dc650dSSadaf EbrahimiSerializing a set of patterns leaves the original data untouched, so they can
130*22dc650dSSadaf Ebrahimistill be used for matching. Their memory must eventually be freed in the usual
131*22dc650dSSadaf Ebrahimiway by calling <b>pcre2_code_free()</b>. When you have finished with the byte
132*22dc650dSSadaf Ebrahimistream, it too must be freed by calling <b>pcre2_serialize_free()</b>. If this
133*22dc650dSSadaf Ebrahimifunction is called with a NULL argument, it returns immediately without doing
134*22dc650dSSadaf Ebrahimianything.
135*22dc650dSSadaf Ebrahimi</P>
136*22dc650dSSadaf Ebrahimi<br><a name="SEC4" href="#TOC1">RE-USING PRECOMPILED PATTERNS</a><br>
137*22dc650dSSadaf Ebrahimi<P>
138*22dc650dSSadaf EbrahimiIn order to re-use a set of saved patterns you must first make the serialized
139*22dc650dSSadaf Ebrahimibyte stream available in main memory (for example, by reading from a file). The
140*22dc650dSSadaf Ebrahimimanagement of this memory block is up to the application. You can use the
141*22dc650dSSadaf Ebrahimi<b>pcre2_serialize_get_number_of_codes()</b> function to find out how many
142*22dc650dSSadaf Ebrahimicompiled patterns are in the serialized data without actually decoding the
143*22dc650dSSadaf Ebrahimipatterns:
144*22dc650dSSadaf Ebrahimi<pre>
145*22dc650dSSadaf Ebrahimi  uint8_t *bytes = &#60;serialized data&#62;;
146*22dc650dSSadaf Ebrahimi  int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
147*22dc650dSSadaf Ebrahimi</pre>
148*22dc650dSSadaf EbrahimiThe <b>pcre2_serialize_decode()</b> function reads a byte stream and recreates
149*22dc650dSSadaf Ebrahimithe compiled patterns in new memory blocks, setting pointers to them in a
150*22dc650dSSadaf Ebrahimivector. The first two arguments are a pointer to a suitable vector and its
151*22dc650dSSadaf Ebrahimilength, and the third argument points to a byte stream. The final argument is a
152*22dc650dSSadaf Ebrahimipointer to a general context, which can be used to specify custom memory
153*22dc650dSSadaf Ebrahimimanagement functions for the decoded patterns. If this argument is NULL,
154*22dc650dSSadaf Ebrahimi<b>malloc()</b> and <b>free()</b> are used. After deserialization, the byte
155*22dc650dSSadaf Ebrahimistream is no longer needed and can be discarded.
156*22dc650dSSadaf Ebrahimi<pre>
157*22dc650dSSadaf Ebrahimi  pcre2_code *list_of_codes[2];
158*22dc650dSSadaf Ebrahimi  uint8_t *bytes = &#60;serialized data&#62;;
159*22dc650dSSadaf Ebrahimi  int32_t number_of_codes =
160*22dc650dSSadaf Ebrahimi    pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
161*22dc650dSSadaf Ebrahimi</pre>
162*22dc650dSSadaf EbrahimiIf the vector is not large enough for all the patterns in the byte stream, it
163*22dc650dSSadaf Ebrahimiis filled with those that fit, and the remainder are ignored. The yield of the
164*22dc650dSSadaf Ebrahimifunction is the number of decoded patterns, or one of the following negative
165*22dc650dSSadaf Ebrahimierror codes:
166*22dc650dSSadaf Ebrahimi<pre>
167*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_BADDATA    second argument is zero or less
168*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_BADMAGIC   mismatch of id bytes in the data
169*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_BADMODE    mismatch of code unit size or PCRE2 version
170*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_BADSERIALIZEDDATA  other sanity check failure
171*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_MEMORY     memory allocation failed
172*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_NULL       first or third argument is NULL
173*22dc650dSSadaf Ebrahimi</pre>
174*22dc650dSSadaf EbrahimiPCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was compiled
175*22dc650dSSadaf Ebrahimion a system with different endianness.
176*22dc650dSSadaf Ebrahimi</P>
177*22dc650dSSadaf Ebrahimi<P>
178*22dc650dSSadaf EbrahimiDecoded patterns can be used for matching in the usual way, and must be freed
179*22dc650dSSadaf Ebrahimiby calling <b>pcre2_code_free()</b>. However, be aware that there is a potential
180*22dc650dSSadaf Ebrahimirace issue if you are using multiple patterns that were decoded from a single
181*22dc650dSSadaf Ebrahimibyte stream in a multithreaded application. A single copy of the character
182*22dc650dSSadaf Ebrahimitables is used by all the decoded patterns and a reference count is used to
183*22dc650dSSadaf Ebrahimiarrange for its memory to be automatically freed when the last pattern is
184*22dc650dSSadaf Ebrahimifreed, but there is no locking on this reference count. Therefore, if you want
185*22dc650dSSadaf Ebrahimito call <b>pcre2_code_free()</b> for these patterns in different threads, you
186*22dc650dSSadaf Ebrahimimust arrange your own locking, and ensure that <b>pcre2_code_free()</b> cannot
187*22dc650dSSadaf Ebrahimibe called by two threads at the same time.
188*22dc650dSSadaf Ebrahimi</P>
189*22dc650dSSadaf Ebrahimi<P>
190*22dc650dSSadaf EbrahimiIf a pattern was processed by <b>pcre2_jit_compile()</b> before being
191*22dc650dSSadaf Ebrahimiserialized, the JIT data is discarded and so is no longer available after a
192*22dc650dSSadaf Ebrahimisave/restore cycle. You can, however, process a restored pattern with
193*22dc650dSSadaf Ebrahimi<b>pcre2_jit_compile()</b> if you wish.
194*22dc650dSSadaf Ebrahimi</P>
195*22dc650dSSadaf Ebrahimi<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
196*22dc650dSSadaf Ebrahimi<P>
197*22dc650dSSadaf EbrahimiPhilip Hazel
198*22dc650dSSadaf Ebrahimi<br>
199*22dc650dSSadaf EbrahimiRetired from University Computing Service
200*22dc650dSSadaf Ebrahimi<br>
201*22dc650dSSadaf EbrahimiCambridge, England.
202*22dc650dSSadaf Ebrahimi<br>
203*22dc650dSSadaf Ebrahimi</P>
204*22dc650dSSadaf Ebrahimi<br><a name="SEC6" href="#TOC1">REVISION</a><br>
205*22dc650dSSadaf Ebrahimi<P>
206*22dc650dSSadaf EbrahimiLast updated: 27 June 2018
207*22dc650dSSadaf Ebrahimi<br>
208*22dc650dSSadaf EbrahimiCopyright &copy; 1997-2018 University of Cambridge.
209*22dc650dSSadaf Ebrahimi<br>
210*22dc650dSSadaf Ebrahimi<p>
211*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
212*22dc650dSSadaf Ebrahimi</p>
213