xref: /aosp_15_r20/external/pcre/doc/html/pcre2.html (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf Ebrahimi<html>
2*22dc650dSSadaf Ebrahimi<head>
3*22dc650dSSadaf Ebrahimi<title>pcre2 specification</title>
4*22dc650dSSadaf Ebrahimi</head>
5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6*22dc650dSSadaf Ebrahimi<h1>pcre2 man page</h1>
7*22dc650dSSadaf Ebrahimi<p>
8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
9*22dc650dSSadaf Ebrahimi</p>
10*22dc650dSSadaf Ebrahimi<p>
11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated
12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it,
13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong.
14*22dc650dSSadaf Ebrahimi<br>
15*22dc650dSSadaf Ebrahimi<ul>
16*22dc650dSSadaf Ebrahimi<li><a name="TOC1" href="#SEC1">INTRODUCTION</a>
17*22dc650dSSadaf Ebrahimi<li><a name="TOC2" href="#SEC2">SECURITY CONSIDERATIONS</a>
18*22dc650dSSadaf Ebrahimi<li><a name="TOC3" href="#SEC3">USER DOCUMENTATION</a>
19*22dc650dSSadaf Ebrahimi<li><a name="TOC4" href="#SEC4">AUTHOR</a>
20*22dc650dSSadaf Ebrahimi<li><a name="TOC5" href="#SEC5">REVISION</a>
21*22dc650dSSadaf Ebrahimi</ul>
22*22dc650dSSadaf Ebrahimi<br><a name="SEC1" href="#TOC1">INTRODUCTION</a><br>
23*22dc650dSSadaf Ebrahimi<P>
24*22dc650dSSadaf EbrahimiPCRE2 is the name used for a revised API for the PCRE library, which is a set
25*22dc650dSSadaf Ebrahimiof functions, written in C, that implement regular expression pattern matching
26*22dc650dSSadaf Ebrahimiusing the same syntax and semantics as Perl, with just a few differences. After
27*22dc650dSSadaf Ebrahiminearly two decades, the limitations of the original API were making development
28*22dc650dSSadaf Ebrahimiincreasingly difficult. The new API is more extensible, and it was simplified
29*22dc650dSSadaf Ebrahimiby abolishing the separate "study" optimizing function; in PCRE2, patterns are
30*22dc650dSSadaf Ebrahimiautomatically optimized where possible. Since forking from PCRE1, the code has
31*22dc650dSSadaf Ebrahimibeen extensively refactored and new features introduced. The old library is now
32*22dc650dSSadaf Ebrahimiobsolete and is no longer maintained.
33*22dc650dSSadaf Ebrahimi</P>
34*22dc650dSSadaf Ebrahimi<P>
35*22dc650dSSadaf EbrahimiAs well as Perl-style regular expression patterns, some features that appeared
36*22dc650dSSadaf Ebrahimiin Python and the original PCRE before they appeared in Perl are available
37*22dc650dSSadaf Ebrahimiusing the Python syntax. There is also some support for one or two .NET and
38*22dc650dSSadaf EbrahimiOniguruma syntax items, and there are options for requesting some minor changes
39*22dc650dSSadaf Ebrahimithat give better ECMAScript (aka JavaScript) compatibility.
40*22dc650dSSadaf Ebrahimi</P>
41*22dc650dSSadaf Ebrahimi<P>
42*22dc650dSSadaf EbrahimiThe source code for PCRE2 can be compiled to support strings of 8-bit, 16-bit,
43*22dc650dSSadaf Ebrahimior 32-bit code units, which means that up to three separate libraries may be
44*22dc650dSSadaf Ebrahimiinstalled, one for each code unit size. The size of code unit is not related to
45*22dc650dSSadaf Ebrahimithe bit size of the underlying hardware. In a 64-bit environment that also
46*22dc650dSSadaf Ebrahimisupports 32-bit applications, versions of PCRE2 that are compiled in both
47*22dc650dSSadaf Ebrahimi64-bit and 32-bit modes may be needed.
48*22dc650dSSadaf Ebrahimi</P>
49*22dc650dSSadaf Ebrahimi<P>
50*22dc650dSSadaf EbrahimiThe original work to extend PCRE to 16-bit and 32-bit code units was done by
51*22dc650dSSadaf EbrahimiZoltan Herczeg and Christian Persch, respectively. In all three cases, strings
52*22dc650dSSadaf Ebrahimican be interpreted either as one character per code unit, or as UTF-encoded
53*22dc650dSSadaf EbrahimiUnicode, with support for Unicode general category properties. Unicode support
54*22dc650dSSadaf Ebrahimiis optional at build time (but is the default). However, processing strings as
55*22dc650dSSadaf EbrahimiUTF code units must be enabled explicitly at run time. The version of Unicode
56*22dc650dSSadaf Ebrahimiin use can be discovered by running
57*22dc650dSSadaf Ebrahimi<pre>
58*22dc650dSSadaf Ebrahimi  pcre2test -C
59*22dc650dSSadaf Ebrahimi</PRE>
60*22dc650dSSadaf Ebrahimi</P>
61*22dc650dSSadaf Ebrahimi<P>
62*22dc650dSSadaf EbrahimiThe three libraries contain identical sets of functions, with names ending in
63*22dc650dSSadaf Ebrahimi_8, _16, or _32, respectively (for example, <b>pcre2_compile_8()</b>). However,
64*22dc650dSSadaf Ebrahimiby defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or 32, a program that uses just
65*22dc650dSSadaf Ebrahimione code unit width can be written using generic names such as
66*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b>, and the documentation is written assuming that this is
67*22dc650dSSadaf Ebrahimithe case.
68*22dc650dSSadaf Ebrahimi</P>
69*22dc650dSSadaf Ebrahimi<P>
70*22dc650dSSadaf EbrahimiIn addition to the Perl-compatible matching function, PCRE2 contains an
71*22dc650dSSadaf Ebrahimialternative function that matches the same compiled patterns in a different
72*22dc650dSSadaf Ebrahimiway. In certain circumstances, the alternative function has some advantages.
73*22dc650dSSadaf EbrahimiFor a discussion of the two matching algorithms, see the
74*22dc650dSSadaf Ebrahimi<a href="pcre2matching.html"><b>pcre2matching</b></a>
75*22dc650dSSadaf Ebrahimipage.
76*22dc650dSSadaf Ebrahimi</P>
77*22dc650dSSadaf Ebrahimi<P>
78*22dc650dSSadaf EbrahimiDetails of exactly which Perl regular expression features are and are not
79*22dc650dSSadaf Ebrahimisupported by PCRE2 are given in separate documents. See the
80*22dc650dSSadaf Ebrahimi<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
81*22dc650dSSadaf Ebrahimiand
82*22dc650dSSadaf Ebrahimi<a href="pcre2compat.html"><b>pcre2compat</b></a>
83*22dc650dSSadaf Ebrahimipages. There is a syntax summary in the
84*22dc650dSSadaf Ebrahimi<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
85*22dc650dSSadaf Ebrahimipage.
86*22dc650dSSadaf Ebrahimi</P>
87*22dc650dSSadaf Ebrahimi<P>
88*22dc650dSSadaf EbrahimiSome features of PCRE2 can be included, excluded, or changed when the library
89*22dc650dSSadaf Ebrahimiis built. The
90*22dc650dSSadaf Ebrahimi<a href="pcre2_config.html"><b>pcre2_config()</b></a>
91*22dc650dSSadaf Ebrahimifunction makes it possible for a client to discover which features are
92*22dc650dSSadaf Ebrahimiavailable. The features themselves are described in the
93*22dc650dSSadaf Ebrahimi<a href="pcre2build.html"><b>pcre2build</b></a>
94*22dc650dSSadaf Ebrahimipage. Documentation about building PCRE2 for various operating systems can be
95*22dc650dSSadaf Ebrahimifound in the
96*22dc650dSSadaf Ebrahimi<a href="README.txt"><b>README</b></a>
97*22dc650dSSadaf Ebrahimiand
98*22dc650dSSadaf Ebrahimi<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS_BUILD</b></a>
99*22dc650dSSadaf Ebrahimifiles in the source distribution.
100*22dc650dSSadaf Ebrahimi</P>
101*22dc650dSSadaf Ebrahimi<P>
102*22dc650dSSadaf EbrahimiThe libraries contains a number of undocumented internal functions and data
103*22dc650dSSadaf Ebrahimitables that are used by more than one of the exported external functions, but
104*22dc650dSSadaf Ebrahimiwhich are not intended for use by external callers. Their names all begin with
105*22dc650dSSadaf Ebrahimi"_pcre2", which hopefully will not provoke any name clashes. In some
106*22dc650dSSadaf Ebrahimienvironments, it is possible to control which external symbols are exported
107*22dc650dSSadaf Ebrahimiwhen a shared library is built, and in these cases the undocumented symbols are
108*22dc650dSSadaf Ebrahiminot exported.
109*22dc650dSSadaf Ebrahimi</P>
110*22dc650dSSadaf Ebrahimi<br><a name="SEC2" href="#TOC1">SECURITY CONSIDERATIONS</a><br>
111*22dc650dSSadaf Ebrahimi<P>
112*22dc650dSSadaf EbrahimiIf you are using PCRE2 in a non-UTF application that permits users to supply
113*22dc650dSSadaf Ebrahimiarbitrary patterns for compilation, you should be aware of a feature that
114*22dc650dSSadaf Ebrahimiallows users to turn on UTF support from within a pattern. For example, an
115*22dc650dSSadaf Ebrahimi8-bit pattern that begins with "(*UTF)" turns on UTF-8 mode, which interprets
116*22dc650dSSadaf Ebrahimipatterns and subjects as strings of UTF-8 code units instead of individual
117*22dc650dSSadaf Ebrahimi8-bit characters. This causes both the pattern and any data against which it is
118*22dc650dSSadaf Ebrahimimatched to be checked for UTF-8 validity. If the data string is very long, such
119*22dc650dSSadaf Ebrahimia check might use sufficiently many resources as to cause your application to
120*22dc650dSSadaf Ebrahimilose performance.
121*22dc650dSSadaf Ebrahimi</P>
122*22dc650dSSadaf Ebrahimi<P>
123*22dc650dSSadaf EbrahimiOne way of guarding against this possibility is to use the
124*22dc650dSSadaf Ebrahimi<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for
125*22dc650dSSadaf EbrahimiPCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling
126*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains
127*22dc650dSSadaf Ebrahimia UTF-setting sequence.
128*22dc650dSSadaf Ebrahimi</P>
129*22dc650dSSadaf Ebrahimi<P>
130*22dc650dSSadaf EbrahimiThe use of Unicode properties for character types such as \d can also be
131*22dc650dSSadaf Ebrahimienabled from within the pattern, by specifying "(*UCP)". This feature can be
132*22dc650dSSadaf Ebrahimidisallowed by setting the PCRE2_NEVER_UCP option.
133*22dc650dSSadaf Ebrahimi</P>
134*22dc650dSSadaf Ebrahimi<P>
135*22dc650dSSadaf EbrahimiIf your application is one that supports UTF, be aware that validity checking
136*22dc650dSSadaf Ebrahimican take time. If the same data string is to be matched many times, you can use
137*22dc650dSSadaf Ebrahimithe PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid
138*22dc650dSSadaf Ebrahimirunning redundant checks.
139*22dc650dSSadaf Ebrahimi</P>
140*22dc650dSSadaf Ebrahimi<P>
141*22dc650dSSadaf EbrahimiThe use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to
142*22dc650dSSadaf Ebrahimiproblems, because it may leave the current matching point in the middle of a
143*22dc650dSSadaf Ebrahimimulti-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used by an
144*22dc650dSSadaf Ebrahimiapplication to lock out the use of \C, causing a compile-time error if it is
145*22dc650dSSadaf Ebrahimiencountered. It is also possible to build PCRE2 with the use of \C permanently
146*22dc650dSSadaf Ebrahimidisabled.
147*22dc650dSSadaf Ebrahimi</P>
148*22dc650dSSadaf Ebrahimi<P>
149*22dc650dSSadaf EbrahimiAnother way that performance can be hit is by running a pattern that has a very
150*22dc650dSSadaf Ebrahimilarge search tree against a string that will never match. Nested unlimited
151*22dc650dSSadaf Ebrahimirepeats in a pattern are a common example. PCRE2 provides some protection
152*22dc650dSSadaf Ebrahimiagainst this: see the <b>pcre2_set_match_limit()</b> function in the
153*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
154*22dc650dSSadaf Ebrahimipage. There is a similar function called <b>pcre2_set_depth_limit()</b> that can
155*22dc650dSSadaf Ebrahimibe used to restrict the amount of memory that is used.
156*22dc650dSSadaf Ebrahimi</P>
157*22dc650dSSadaf Ebrahimi<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br>
158*22dc650dSSadaf Ebrahimi<P>
159*22dc650dSSadaf EbrahimiThe user documentation for PCRE2 comprises a number of different sections. In
160*22dc650dSSadaf Ebrahimithe "man" format, each of these is a separate "man page". In the HTML format,
161*22dc650dSSadaf Ebrahimieach is a separate page, linked from the index page. In the plain text format,
162*22dc650dSSadaf Ebrahimithe descriptions of the <b>pcre2grep</b> and <b>pcre2test</b> programs are in
163*22dc650dSSadaf Ebrahimifiles called <b>pcre2grep.txt</b> and <b>pcre2test.txt</b>, respectively. The
164*22dc650dSSadaf Ebrahimiremaining sections, except for the <b>pcre2demo</b> section (which is a program
165*22dc650dSSadaf Ebrahimilisting), and the short pages for individual functions, are concatenated in
166*22dc650dSSadaf Ebrahimi<b>pcre2.txt</b>, for ease of searching. The sections are as follows:
167*22dc650dSSadaf Ebrahimi<pre>
168*22dc650dSSadaf Ebrahimi  pcre2              this document
169*22dc650dSSadaf Ebrahimi  pcre2-config       show PCRE2 installation configuration information
170*22dc650dSSadaf Ebrahimi  pcre2api           details of PCRE2's native C API
171*22dc650dSSadaf Ebrahimi  pcre2build         building PCRE2
172*22dc650dSSadaf Ebrahimi  pcre2callout       details of the pattern callout feature
173*22dc650dSSadaf Ebrahimi  pcre2compat        discussion of Perl compatibility
174*22dc650dSSadaf Ebrahimi  pcre2convert       details of pattern conversion functions
175*22dc650dSSadaf Ebrahimi  pcre2demo          a demonstration C program that uses PCRE2
176*22dc650dSSadaf Ebrahimi  pcre2grep          description of the <b>pcre2grep</b> command (8-bit only)
177*22dc650dSSadaf Ebrahimi  pcre2jit           discussion of just-in-time optimization support
178*22dc650dSSadaf Ebrahimi  pcre2limits        details of size and other limits
179*22dc650dSSadaf Ebrahimi  pcre2matching      discussion of the two matching algorithms
180*22dc650dSSadaf Ebrahimi  pcre2partial       details of the partial matching facility
181*22dc650dSSadaf Ebrahimi  pcre2pattern       syntax and semantics of supported regular expression patterns
182*22dc650dSSadaf Ebrahimi  pcre2perform       discussion of performance issues
183*22dc650dSSadaf Ebrahimi  pcre2posix         the POSIX-compatible C API for the 8-bit library
184*22dc650dSSadaf Ebrahimi  pcre2sample        discussion of the pcre2demo program
185*22dc650dSSadaf Ebrahimi  pcre2serialize     details of pattern serialization
186*22dc650dSSadaf Ebrahimi  pcre2syntax        quick syntax reference
187*22dc650dSSadaf Ebrahimi  pcre2test          description of the <b>pcre2test</b> command
188*22dc650dSSadaf Ebrahimi  pcre2unicode       discussion of Unicode and UTF support
189*22dc650dSSadaf Ebrahimi</pre>
190*22dc650dSSadaf EbrahimiIn the "man" and HTML formats, there is also a short page for each C library
191*22dc650dSSadaf Ebrahimifunction, listing its arguments and results.
192*22dc650dSSadaf Ebrahimi</P>
193*22dc650dSSadaf Ebrahimi<br><a name="SEC4" href="#TOC1">AUTHOR</a><br>
194*22dc650dSSadaf Ebrahimi<P>
195*22dc650dSSadaf EbrahimiPhilip Hazel
196*22dc650dSSadaf Ebrahimi<br>
197*22dc650dSSadaf EbrahimiRetired from University Computing Service
198*22dc650dSSadaf Ebrahimi<br>
199*22dc650dSSadaf EbrahimiCambridge, England.
200*22dc650dSSadaf Ebrahimi<br>
201*22dc650dSSadaf Ebrahimi</P>
202*22dc650dSSadaf Ebrahimi<P>
203*22dc650dSSadaf EbrahimiPutting an actual email address here is a spam magnet. If you want to email me,
204*22dc650dSSadaf Ebrahimiuse my two names separated by a dot at gmail.com.
205*22dc650dSSadaf Ebrahimi</P>
206*22dc650dSSadaf Ebrahimi<br><a name="SEC5" href="#TOC1">REVISION</a><br>
207*22dc650dSSadaf Ebrahimi<P>
208*22dc650dSSadaf EbrahimiLast updated: 27 August 2021
209*22dc650dSSadaf Ebrahimi<br>
210*22dc650dSSadaf EbrahimiCopyright &copy; 1997-2021 University of Cambridge.
211*22dc650dSSadaf Ebrahimi<br>
212*22dc650dSSadaf Ebrahimi<p>
213*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
214*22dc650dSSadaf Ebrahimi</p>
215