xref: /aosp_15_r20/external/pcre/doc/html/pcre2pattern.html (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf Ebrahimi<html>
2*22dc650dSSadaf Ebrahimi<head>
3*22dc650dSSadaf Ebrahimi<title>pcre2pattern specification</title>
4*22dc650dSSadaf Ebrahimi</head>
5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6*22dc650dSSadaf Ebrahimi<h1>pcre2pattern man page</h1>
7*22dc650dSSadaf Ebrahimi<p>
8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
9*22dc650dSSadaf Ebrahimi</p>
10*22dc650dSSadaf Ebrahimi<p>
11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated
12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it,
13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong.
14*22dc650dSSadaf Ebrahimi<br>
15*22dc650dSSadaf Ebrahimi<ul>
16*22dc650dSSadaf Ebrahimi<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION DETAILS</a>
17*22dc650dSSadaf Ebrahimi<li><a name="TOC2" href="#SEC2">SPECIAL START-OF-PATTERN ITEMS</a>
18*22dc650dSSadaf Ebrahimi<li><a name="TOC3" href="#SEC3">EBCDIC CHARACTER CODES</a>
19*22dc650dSSadaf Ebrahimi<li><a name="TOC4" href="#SEC4">CHARACTERS AND METACHARACTERS</a>
20*22dc650dSSadaf Ebrahimi<li><a name="TOC5" href="#SEC5">BACKSLASH</a>
21*22dc650dSSadaf Ebrahimi<li><a name="TOC6" href="#SEC6">CIRCUMFLEX AND DOLLAR</a>
22*22dc650dSSadaf Ebrahimi<li><a name="TOC7" href="#SEC7">FULL STOP (PERIOD, DOT) AND \N</a>
23*22dc650dSSadaf Ebrahimi<li><a name="TOC8" href="#SEC8">MATCHING A SINGLE CODE UNIT</a>
24*22dc650dSSadaf Ebrahimi<li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a>
25*22dc650dSSadaf Ebrahimi<li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a>
26*22dc650dSSadaf Ebrahimi<li><a name="TOC11" href="#SEC11">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a>
27*22dc650dSSadaf Ebrahimi<li><a name="TOC12" href="#SEC12">VERTICAL BAR</a>
28*22dc650dSSadaf Ebrahimi<li><a name="TOC13" href="#SEC13">INTERNAL OPTION SETTING</a>
29*22dc650dSSadaf Ebrahimi<li><a name="TOC14" href="#SEC14">GROUPS</a>
30*22dc650dSSadaf Ebrahimi<li><a name="TOC15" href="#SEC15">DUPLICATE GROUP NUMBERS</a>
31*22dc650dSSadaf Ebrahimi<li><a name="TOC16" href="#SEC16">NAMED CAPTURE GROUPS</a>
32*22dc650dSSadaf Ebrahimi<li><a name="TOC17" href="#SEC17">REPETITION</a>
33*22dc650dSSadaf Ebrahimi<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
34*22dc650dSSadaf Ebrahimi<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a>
35*22dc650dSSadaf Ebrahimi<li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
36*22dc650dSSadaf Ebrahimi<li><a name="TOC21" href="#SEC21">NON-ATOMIC ASSERTIONS</a>
37*22dc650dSSadaf Ebrahimi<li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a>
38*22dc650dSSadaf Ebrahimi<li><a name="TOC23" href="#SEC23">CONDITIONAL GROUPS</a>
39*22dc650dSSadaf Ebrahimi<li><a name="TOC24" href="#SEC24">COMMENTS</a>
40*22dc650dSSadaf Ebrahimi<li><a name="TOC25" href="#SEC25">RECURSIVE PATTERNS</a>
41*22dc650dSSadaf Ebrahimi<li><a name="TOC26" href="#SEC26">GROUPS AS SUBROUTINES</a>
42*22dc650dSSadaf Ebrahimi<li><a name="TOC27" href="#SEC27">ONIGURUMA SUBROUTINE SYNTAX</a>
43*22dc650dSSadaf Ebrahimi<li><a name="TOC28" href="#SEC28">CALLOUTS</a>
44*22dc650dSSadaf Ebrahimi<li><a name="TOC29" href="#SEC29">BACKTRACKING CONTROL</a>
45*22dc650dSSadaf Ebrahimi<li><a name="TOC30" href="#SEC30">SEE ALSO</a>
46*22dc650dSSadaf Ebrahimi<li><a name="TOC31" href="#SEC31">AUTHOR</a>
47*22dc650dSSadaf Ebrahimi<li><a name="TOC32" href="#SEC32">REVISION</a>
48*22dc650dSSadaf Ebrahimi</ul>
49*22dc650dSSadaf Ebrahimi<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br>
50*22dc650dSSadaf Ebrahimi<P>
51*22dc650dSSadaf EbrahimiThe syntax and semantics of the regular expressions that are supported by PCRE2
52*22dc650dSSadaf Ebrahimiare described in detail below. There is a quick-reference syntax summary in the
53*22dc650dSSadaf Ebrahimi<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
54*22dc650dSSadaf Ebrahimipage. PCRE2 tries to match Perl syntax and semantics as closely as it can.
55*22dc650dSSadaf EbrahimiPCRE2 also supports some alternative regular expression syntax (which does not
56*22dc650dSSadaf Ebrahimiconflict with the Perl syntax) in order to provide some compatibility with
57*22dc650dSSadaf Ebrahimiregular expressions in Python, .NET, and Oniguruma.
58*22dc650dSSadaf Ebrahimi</P>
59*22dc650dSSadaf Ebrahimi<P>
60*22dc650dSSadaf EbrahimiPerl's regular expressions are described in its own documentation, and regular
61*22dc650dSSadaf Ebrahimiexpressions in general are covered in a number of books, some of which have
62*22dc650dSSadaf Ebrahimicopious examples. Jeffrey Friedl's "Mastering Regular Expressions", published
63*22dc650dSSadaf Ebrahimiby O'Reilly, covers regular expressions in great detail. This description of
64*22dc650dSSadaf EbrahimiPCRE2's regular expressions is intended as reference material.
65*22dc650dSSadaf Ebrahimi</P>
66*22dc650dSSadaf Ebrahimi<P>
67*22dc650dSSadaf EbrahimiThis document discusses the regular expression patterns that are supported by
68*22dc650dSSadaf EbrahimiPCRE2 when its main matching function, <b>pcre2_match()</b>, is used. PCRE2 also
69*22dc650dSSadaf Ebrahimihas an alternative matching function, <b>pcre2_dfa_match()</b>, which matches
70*22dc650dSSadaf Ebrahimiusing a different algorithm that is not Perl-compatible. Some of the features
71*22dc650dSSadaf Ebrahimidiscussed below are not available when DFA matching is used. The advantages and
72*22dc650dSSadaf Ebrahimidisadvantages of the alternative function, and how it differs from the normal
73*22dc650dSSadaf Ebrahimifunction, are discussed in the
74*22dc650dSSadaf Ebrahimi<a href="pcre2matching.html"><b>pcre2matching</b></a>
75*22dc650dSSadaf Ebrahimipage.
76*22dc650dSSadaf Ebrahimi</P>
77*22dc650dSSadaf Ebrahimi<br><a name="SEC2" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br>
78*22dc650dSSadaf Ebrahimi<P>
79*22dc650dSSadaf EbrahimiA number of options that can be passed to <b>pcre2_compile()</b> can also be set
80*22dc650dSSadaf Ebrahimiby special items at the start of a pattern. These are not Perl-compatible, but
81*22dc650dSSadaf Ebrahimiare provided to make these options accessible to pattern writers who are not
82*22dc650dSSadaf Ebrahimiable to change the program that processes the pattern. Any number of these
83*22dc650dSSadaf Ebrahimiitems may appear, but they must all be together right at the start of the
84*22dc650dSSadaf Ebrahimipattern string, and the letters must be in upper case.
85*22dc650dSSadaf Ebrahimi</P>
86*22dc650dSSadaf Ebrahimi<br><b>
87*22dc650dSSadaf EbrahimiUTF support
88*22dc650dSSadaf Ebrahimi</b><br>
89*22dc650dSSadaf Ebrahimi<P>
90*22dc650dSSadaf EbrahimiIn the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
91*22dc650dSSadaf Ebrahimisingle code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
92*22dc650dSSadaf Ebrahimispecified for the 32-bit library, in which case it constrains the character
93*22dc650dSSadaf Ebrahimivalues to valid Unicode code points. To process UTF strings, PCRE2 must be
94*22dc650dSSadaf Ebrahimibuilt to include Unicode support (which is the default). When using UTF strings
95*22dc650dSSadaf Ebrahimiyou must either call the compiling function with one or both of the PCRE2_UTF
96*22dc650dSSadaf Ebrahimior PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special
97*22dc650dSSadaf Ebrahimisequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How
98*22dc650dSSadaf Ebrahimisetting a UTF mode affects pattern matching is mentioned in several places
99*22dc650dSSadaf Ebrahimibelow. There is also a summary of features in the
100*22dc650dSSadaf Ebrahimi<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
101*22dc650dSSadaf Ebrahimipage.
102*22dc650dSSadaf Ebrahimi</P>
103*22dc650dSSadaf Ebrahimi<P>
104*22dc650dSSadaf EbrahimiSome applications that allow their users to supply patterns may wish to
105*22dc650dSSadaf Ebrahimirestrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
106*22dc650dSSadaf Ebrahimioption is passed to <b>pcre2_compile()</b>, (*UTF) is not allowed, and its
107*22dc650dSSadaf Ebrahimiappearance in a pattern causes an error.
108*22dc650dSSadaf Ebrahimi</P>
109*22dc650dSSadaf Ebrahimi<br><b>
110*22dc650dSSadaf EbrahimiUnicode property support
111*22dc650dSSadaf Ebrahimi</b><br>
112*22dc650dSSadaf Ebrahimi<P>
113*22dc650dSSadaf EbrahimiAnother special sequence that may appear at the start of a pattern is (*UCP).
114*22dc650dSSadaf EbrahimiThis has the same effect as setting the PCRE2_UCP option: it causes sequences
115*22dc650dSSadaf Ebrahimisuch as \d and \w to use Unicode properties to determine character types,
116*22dc650dSSadaf Ebrahimiinstead of recognizing only characters with codes less than 256 via a lookup
117*22dc650dSSadaf Ebrahimitable. If also causes upper/lower casing operations to use Unicode properties
118*22dc650dSSadaf Ebrahimifor characters with code points greater than 127, even when UTF is not set.
119*22dc650dSSadaf EbrahimiThese behaviours can be changed within the pattern; see the section entitled
120*22dc650dSSadaf Ebrahimi<a href="#internaloptions">"Internal Option Setting"</a>
121*22dc650dSSadaf Ebrahimibelow.
122*22dc650dSSadaf Ebrahimi</P>
123*22dc650dSSadaf Ebrahimi<P>
124*22dc650dSSadaf EbrahimiSome applications that allow their users to supply patterns may wish to
125*22dc650dSSadaf Ebrahimirestrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
126*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b>, (*UCP) is not allowed, and its appearance in a pattern
127*22dc650dSSadaf Ebrahimicauses an error.
128*22dc650dSSadaf Ebrahimi</P>
129*22dc650dSSadaf Ebrahimi<br><b>
130*22dc650dSSadaf EbrahimiLocking out empty string matching
131*22dc650dSSadaf Ebrahimi</b><br>
132*22dc650dSSadaf Ebrahimi<P>
133*22dc650dSSadaf EbrahimiStarting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
134*22dc650dSSadaf Ebrahimias passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
135*22dc650dSSadaf Ebrahimimatching function is subsequently called to match the pattern. These options
136*22dc650dSSadaf Ebrahimilock out the matching of empty strings, either entirely, or only at the start
137*22dc650dSSadaf Ebrahimiof the subject.
138*22dc650dSSadaf Ebrahimi</P>
139*22dc650dSSadaf Ebrahimi<br><b>
140*22dc650dSSadaf EbrahimiDisabling auto-possessification
141*22dc650dSSadaf Ebrahimi</b><br>
142*22dc650dSSadaf Ebrahimi<P>
143*22dc650dSSadaf EbrahimiIf a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
144*22dc650dSSadaf Ebrahimithe PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making quantifiers
145*22dc650dSSadaf Ebrahimipossessive when what follows cannot match the repeated item. For example, by
146*22dc650dSSadaf Ebrahimidefault a+b is treated as a++b. For more details, see the
147*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
148*22dc650dSSadaf Ebrahimidocumentation.
149*22dc650dSSadaf Ebrahimi</P>
150*22dc650dSSadaf Ebrahimi<br><b>
151*22dc650dSSadaf EbrahimiDisabling start-up optimizations
152*22dc650dSSadaf Ebrahimi</b><br>
153*22dc650dSSadaf Ebrahimi<P>
154*22dc650dSSadaf EbrahimiIf a pattern starts with (*NO_START_OPT), it has the same effect as setting the
155*22dc650dSSadaf EbrahimiPCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
156*22dc650dSSadaf Ebrahimireaching "no match" results. For more details, see the
157*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
158*22dc650dSSadaf Ebrahimidocumentation.
159*22dc650dSSadaf Ebrahimi</P>
160*22dc650dSSadaf Ebrahimi<br><b>
161*22dc650dSSadaf EbrahimiDisabling automatic anchoring
162*22dc650dSSadaf Ebrahimi</b><br>
163*22dc650dSSadaf Ebrahimi<P>
164*22dc650dSSadaf EbrahimiIf a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
165*22dc650dSSadaf Ebrahimisetting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
166*22dc650dSSadaf Ebrahimiapply to patterns whose top-level branches all start with .* (match any number
167*22dc650dSSadaf Ebrahimiof arbitrary characters). For more details, see the
168*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
169*22dc650dSSadaf Ebrahimidocumentation.
170*22dc650dSSadaf Ebrahimi</P>
171*22dc650dSSadaf Ebrahimi<br><b>
172*22dc650dSSadaf EbrahimiDisabling JIT compilation
173*22dc650dSSadaf Ebrahimi</b><br>
174*22dc650dSSadaf Ebrahimi<P>
175*22dc650dSSadaf EbrahimiIf a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by
176*22dc650dSSadaf Ebrahimithe application to apply the JIT optimization by calling
177*22dc650dSSadaf Ebrahimi<b>pcre2_jit_compile()</b> is ignored.
178*22dc650dSSadaf Ebrahimi</P>
179*22dc650dSSadaf Ebrahimi<br><b>
180*22dc650dSSadaf EbrahimiSetting match resource limits
181*22dc650dSSadaf Ebrahimi</b><br>
182*22dc650dSSadaf Ebrahimi<P>
183*22dc650dSSadaf EbrahimiThe <b>pcre2_match()</b> function contains a counter that is incremented every
184*22dc650dSSadaf Ebrahimitime it goes round its main loop. The caller of <b>pcre2_match()</b> can set a
185*22dc650dSSadaf Ebrahimilimit on this counter, which therefore limits the amount of computing resource
186*22dc650dSSadaf Ebrahimiused for a match. The maximum depth of nested backtracking can also be limited;
187*22dc650dSSadaf Ebrahimithis indirectly restricts the amount of heap memory that is used, but there is
188*22dc650dSSadaf Ebrahimialso an explicit memory limit that can be set.
189*22dc650dSSadaf Ebrahimi</P>
190*22dc650dSSadaf Ebrahimi<P>
191*22dc650dSSadaf EbrahimiThese facilities are provided to catch runaway matches that are provoked by
192*22dc650dSSadaf Ebrahimipatterns with huge matching trees. A common example is a pattern with nested
193*22dc650dSSadaf Ebrahimiunlimited repeats applied to a long string that does not match. When one of
194*22dc650dSSadaf Ebrahimithese limits is reached, <b>pcre2_match()</b> gives an error return. The limits
195*22dc650dSSadaf Ebrahimican also be set by items at the start of the pattern of the form
196*22dc650dSSadaf Ebrahimi<pre>
197*22dc650dSSadaf Ebrahimi  (*LIMIT_HEAP=d)
198*22dc650dSSadaf Ebrahimi  (*LIMIT_MATCH=d)
199*22dc650dSSadaf Ebrahimi  (*LIMIT_DEPTH=d)
200*22dc650dSSadaf Ebrahimi</pre>
201*22dc650dSSadaf Ebrahimiwhere d is any number of decimal digits. However, the value of the setting must
202*22dc650dSSadaf Ebrahimibe less than the value set (or defaulted) by the caller of <b>pcre2_match()</b>
203*22dc650dSSadaf Ebrahimifor it to have any effect. In other words, the pattern writer can lower the
204*22dc650dSSadaf Ebrahimilimits set by the programmer, but not raise them. If there is more than one
205*22dc650dSSadaf Ebrahimisetting of one of these limits, the lower value is used. The heap limit is
206*22dc650dSSadaf Ebrahimispecified in kibibytes (units of 1024 bytes).
207*22dc650dSSadaf Ebrahimi</P>
208*22dc650dSSadaf Ebrahimi<P>
209*22dc650dSSadaf EbrahimiPrior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
210*22dc650dSSadaf Ebrahimistill recognized for backwards compatibility.
211*22dc650dSSadaf Ebrahimi</P>
212*22dc650dSSadaf Ebrahimi<P>
213*22dc650dSSadaf EbrahimiThe heap limit applies only when the <b>pcre2_match()</b> or
214*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b> interpreters are used for matching. It does not apply
215*22dc650dSSadaf Ebrahimito JIT. The match limit is used (but in a different way) when JIT is being
216*22dc650dSSadaf Ebrahimiused, or when <b>pcre2_dfa_match()</b> is called, to limit computing resource
217*22dc650dSSadaf Ebrahimiusage by those matching functions. The depth limit is ignored by JIT but is
218*22dc650dSSadaf Ebrahimirelevant for DFA matching, which uses function recursion for recursions within
219*22dc650dSSadaf Ebrahimithe pattern and for lookaround assertions and atomic groups. In this case, the
220*22dc650dSSadaf Ebrahimidepth limit controls the depth of such recursion.
221*22dc650dSSadaf Ebrahimi<a name="newlines"></a></P>
222*22dc650dSSadaf Ebrahimi<br><b>
223*22dc650dSSadaf EbrahimiNewline conventions
224*22dc650dSSadaf Ebrahimi</b><br>
225*22dc650dSSadaf Ebrahimi<P>
226*22dc650dSSadaf EbrahimiPCRE2 supports six different conventions for indicating line breaks in
227*22dc650dSSadaf Ebrahimistrings: a single CR (carriage return) character, a single LF (linefeed)
228*22dc650dSSadaf Ebrahimicharacter, the two-character sequence CRLF, any of the three preceding, any
229*22dc650dSSadaf EbrahimiUnicode newline sequence, or the NUL character (binary zero). The
230*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
231*22dc650dSSadaf Ebrahimipage has
232*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#newlines">further discussion</a>
233*22dc650dSSadaf Ebrahimiabout newlines, and shows how to set the newline convention when calling
234*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b>.
235*22dc650dSSadaf Ebrahimi</P>
236*22dc650dSSadaf Ebrahimi<P>
237*22dc650dSSadaf EbrahimiIt is also possible to specify a newline convention by starting a pattern
238*22dc650dSSadaf Ebrahimistring with one of the following sequences:
239*22dc650dSSadaf Ebrahimi<pre>
240*22dc650dSSadaf Ebrahimi  (*CR)        carriage return
241*22dc650dSSadaf Ebrahimi  (*LF)        linefeed
242*22dc650dSSadaf Ebrahimi  (*CRLF)      carriage return, followed by linefeed
243*22dc650dSSadaf Ebrahimi  (*ANYCRLF)   any of the three above
244*22dc650dSSadaf Ebrahimi  (*ANY)       all Unicode newline sequences
245*22dc650dSSadaf Ebrahimi  (*NUL)       the NUL character (binary zero)
246*22dc650dSSadaf Ebrahimi</pre>
247*22dc650dSSadaf EbrahimiThese override the default and the options given to the compiling function. For
248*22dc650dSSadaf Ebrahimiexample, on a Unix system where LF is the default newline sequence, the pattern
249*22dc650dSSadaf Ebrahimi<pre>
250*22dc650dSSadaf Ebrahimi  (*CR)a.b
251*22dc650dSSadaf Ebrahimi</pre>
252*22dc650dSSadaf Ebrahimichanges the convention to CR. That pattern matches "a\nb" because LF is no
253*22dc650dSSadaf Ebrahimilonger a newline. If more than one of these settings is present, the last one
254*22dc650dSSadaf Ebrahimiis used.
255*22dc650dSSadaf Ebrahimi</P>
256*22dc650dSSadaf Ebrahimi<P>
257*22dc650dSSadaf EbrahimiThe newline convention affects where the circumflex and dollar assertions are
258*22dc650dSSadaf Ebrahimitrue. It also affects the interpretation of the dot metacharacter when
259*22dc650dSSadaf EbrahimiPCRE2_DOTALL is not set, and the behaviour of \N when not followed by an
260*22dc650dSSadaf Ebrahimiopening brace. However, it does not affect what the \R escape sequence
261*22dc650dSSadaf Ebrahimimatches. By default, this is any Unicode newline sequence, for Perl
262*22dc650dSSadaf Ebrahimicompatibility. However, this can be changed; see the next section and the
263*22dc650dSSadaf Ebrahimidescription of \R in the section entitled
264*22dc650dSSadaf Ebrahimi<a href="#newlineseq">"Newline sequences"</a>
265*22dc650dSSadaf Ebrahimibelow. A change of \R setting can be combined with a change of newline
266*22dc650dSSadaf Ebrahimiconvention.
267*22dc650dSSadaf Ebrahimi</P>
268*22dc650dSSadaf Ebrahimi<br><b>
269*22dc650dSSadaf EbrahimiSpecifying what \R matches
270*22dc650dSSadaf Ebrahimi</b><br>
271*22dc650dSSadaf Ebrahimi<P>
272*22dc650dSSadaf EbrahimiIt is possible to restrict \R to match only CR, LF, or CRLF (instead of the
273*22dc650dSSadaf Ebrahimicomplete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
274*22dc650dSSadaf Ebrahimiat compile time. This effect can also be achieved by starting a pattern with
275*22dc650dSSadaf Ebrahimi(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
276*22dc650dSSadaf Ebrahimicorresponding to PCRE2_BSR_UNICODE.
277*22dc650dSSadaf Ebrahimi</P>
278*22dc650dSSadaf Ebrahimi<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
279*22dc650dSSadaf Ebrahimi<P>
280*22dc650dSSadaf EbrahimiPCRE2 can be compiled to run in an environment that uses EBCDIC as its
281*22dc650dSSadaf Ebrahimicharacter code instead of ASCII or Unicode (typically a mainframe system). In
282*22dc650dSSadaf Ebrahimithe sections below, character code values are ASCII or Unicode; in an EBCDIC
283*22dc650dSSadaf Ebrahimienvironment these characters may have different code values, and there are no
284*22dc650dSSadaf Ebrahimicode points greater than 255.
285*22dc650dSSadaf Ebrahimi</P>
286*22dc650dSSadaf Ebrahimi<br><a name="SEC4" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
287*22dc650dSSadaf Ebrahimi<P>
288*22dc650dSSadaf EbrahimiA regular expression is a pattern that is matched against a subject string from
289*22dc650dSSadaf Ebrahimileft to right. Most characters stand for themselves in a pattern, and match the
290*22dc650dSSadaf Ebrahimicorresponding characters in the subject. As a trivial example, the pattern
291*22dc650dSSadaf Ebrahimi<pre>
292*22dc650dSSadaf Ebrahimi  The quick brown fox
293*22dc650dSSadaf Ebrahimi</pre>
294*22dc650dSSadaf Ebrahimimatches a portion of a subject string that is identical to itself. When
295*22dc650dSSadaf Ebrahimicaseless matching is specified (the PCRE2_CASELESS option or (?i) within the
296*22dc650dSSadaf Ebrahimipattern), letters are matched independently of case. Note that there are two
297*22dc650dSSadaf EbrahimiASCII characters, K and S, that, in addition to their lower case ASCII
298*22dc650dSSadaf Ebrahimiequivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F
299*22dc650dSSadaf Ebrahimi(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set, unless the
300*22dc650dSSadaf EbrahimiPCRE2_EXTRA_CASELESS_RESTRICT option is in force (either passed to
301*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b> or set by (?r) within the pattern).
302*22dc650dSSadaf Ebrahimi</P>
303*22dc650dSSadaf Ebrahimi<P>
304*22dc650dSSadaf EbrahimiThe power of regular expressions comes from the ability to include wild cards,
305*22dc650dSSadaf Ebrahimicharacter classes, alternatives, and repetitions in the pattern. These are
306*22dc650dSSadaf Ebrahimiencoded in the pattern by the use of <i>metacharacters</i>, which do not stand
307*22dc650dSSadaf Ebrahimifor themselves but instead are interpreted in some special way.
308*22dc650dSSadaf Ebrahimi</P>
309*22dc650dSSadaf Ebrahimi<P>
310*22dc650dSSadaf EbrahimiThere are two different sets of metacharacters: those that are recognized
311*22dc650dSSadaf Ebrahimianywhere in the pattern except within square brackets, and those that are
312*22dc650dSSadaf Ebrahimirecognized within square brackets. Outside square brackets, the metacharacters
313*22dc650dSSadaf Ebrahimiare as follows:
314*22dc650dSSadaf Ebrahimi<pre>
315*22dc650dSSadaf Ebrahimi  \      general escape character with several uses
316*22dc650dSSadaf Ebrahimi  ^      assert start of string (or line, in multiline mode)
317*22dc650dSSadaf Ebrahimi  $      assert end of string (or line, in multiline mode)
318*22dc650dSSadaf Ebrahimi  .      match any character except newline (by default)
319*22dc650dSSadaf Ebrahimi  [      start character class definition
320*22dc650dSSadaf Ebrahimi  |      start of alternative branch
321*22dc650dSSadaf Ebrahimi  (      start group or control verb
322*22dc650dSSadaf Ebrahimi  )      end group or control verb
323*22dc650dSSadaf Ebrahimi  *      0 or more quantifier
324*22dc650dSSadaf Ebrahimi  +      1 or more quantifier; also "possessive quantifier"
325*22dc650dSSadaf Ebrahimi  ?      0 or 1 quantifier; also quantifier minimizer
326*22dc650dSSadaf Ebrahimi  {      potential start of min/max quantifier
327*22dc650dSSadaf Ebrahimi</pre>
328*22dc650dSSadaf EbrahimiBrace characters { and } are also used to enclose data for constructions such
329*22dc650dSSadaf Ebrahimias \g{2} or \k{name}. In almost all uses of braces, space and/or horizontal
330*22dc650dSSadaf Ebrahimitab characters that follow { or precede } are allowed and are ignored. In the
331*22dc650dSSadaf Ebrahimicase of quantifiers, they may also appear before or after the comma. The
332*22dc650dSSadaf Ebrahimiexception to this is \u{...} which is an ECMAScript compatibility feature
333*22dc650dSSadaf Ebrahimithat is recognized only when the PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript
334*22dc650dSSadaf Ebrahimidoes not ignore such white space; it causes the item to be interpreted as
335*22dc650dSSadaf Ebrahimiliteral.
336*22dc650dSSadaf Ebrahimi</P>
337*22dc650dSSadaf Ebrahimi<P>
338*22dc650dSSadaf EbrahimiPart of a pattern that is in square brackets is called a "character class". In
339*22dc650dSSadaf Ebrahimia character class the only metacharacters are:
340*22dc650dSSadaf Ebrahimi<pre>
341*22dc650dSSadaf Ebrahimi  \      general escape character
342*22dc650dSSadaf Ebrahimi  ^      negate the class, but only if the first character
343*22dc650dSSadaf Ebrahimi  -      indicates character range
344*22dc650dSSadaf Ebrahimi  [      POSIX character class (if followed by POSIX syntax)
345*22dc650dSSadaf Ebrahimi  ]      terminates the character class
346*22dc650dSSadaf Ebrahimi</pre>
347*22dc650dSSadaf EbrahimiIf a pattern is compiled with the PCRE2_EXTENDED option, most white space in
348*22dc650dSSadaf Ebrahimithe pattern, other than in a character class, within a \Q...\E sequence, or
349*22dc650dSSadaf Ebrahimibetween a # outside a character class and the next newline, inclusive, are
350*22dc650dSSadaf Ebrahimiignored. An escaping backslash can be used to include a white space or a #
351*22dc650dSSadaf Ebrahimicharacter as part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the
352*22dc650dSSadaf Ebrahimisame applies, but in addition unescaped space and horizontal tab characters are
353*22dc650dSSadaf Ebrahimiignored inside a character class. Note: only these two characters are ignored,
354*22dc650dSSadaf Ebrahiminot the full set of pattern white space characters that are ignored outside a
355*22dc650dSSadaf Ebrahimicharacter class. Option settings can be changed within a pattern; see the
356*22dc650dSSadaf Ebrahimisection entitled
357*22dc650dSSadaf Ebrahimi<a href="#internaloptions">"Internal Option Setting"</a>
358*22dc650dSSadaf Ebrahimibelow.
359*22dc650dSSadaf Ebrahimi</P>
360*22dc650dSSadaf Ebrahimi<P>
361*22dc650dSSadaf EbrahimiThe following sections describe the use of each of the metacharacters.
362*22dc650dSSadaf Ebrahimi</P>
363*22dc650dSSadaf Ebrahimi<br><a name="SEC5" href="#TOC1">BACKSLASH</a><br>
364*22dc650dSSadaf Ebrahimi<P>
365*22dc650dSSadaf EbrahimiThe backslash character has several uses. Firstly, if it is followed by a
366*22dc650dSSadaf Ebrahimicharacter that is not a digit or a letter, it takes away any special meaning
367*22dc650dSSadaf Ebrahimithat character may have. This use of backslash as an escape character applies
368*22dc650dSSadaf Ebrahimiboth inside and outside character classes.
369*22dc650dSSadaf Ebrahimi</P>
370*22dc650dSSadaf Ebrahimi<P>
371*22dc650dSSadaf EbrahimiFor example, if you want to match a * character, you must write \* in the
372*22dc650dSSadaf Ebrahimipattern. This escaping action applies whether or not the following character
373*22dc650dSSadaf Ebrahimiwould otherwise be interpreted as a metacharacter, so it is always safe to
374*22dc650dSSadaf Ebrahimiprecede a non-alphanumeric with backslash to specify that it stands for itself.
375*22dc650dSSadaf EbrahimiIn particular, if you want to match a backslash, you write \\.
376*22dc650dSSadaf Ebrahimi</P>
377*22dc650dSSadaf Ebrahimi<P>
378*22dc650dSSadaf EbrahimiOnly ASCII digits and letters have any special meaning after a backslash. All
379*22dc650dSSadaf Ebrahimiother characters (in particular, those whose code points are greater than 127)
380*22dc650dSSadaf Ebrahimiare treated as literals.
381*22dc650dSSadaf Ebrahimi</P>
382*22dc650dSSadaf Ebrahimi<P>
383*22dc650dSSadaf EbrahimiIf you want to treat all characters in a sequence as literals, you can do so by
384*22dc650dSSadaf Ebrahimiputting them between \Q and \E. Note that this includes white space even when
385*22dc650dSSadaf Ebrahimithe PCRE2_EXTENDED option is set so that most other white space is ignored. The
386*22dc650dSSadaf Ebrahimibehaviour is different from Perl in that $ and @ are handled as literals in
387*22dc650dSSadaf Ebrahimi\Q...\E sequences in PCRE2, whereas in Perl, $ and @ cause variable
388*22dc650dSSadaf Ebrahimiinterpolation. Also, Perl does "double-quotish backslash interpolation" on any
389*22dc650dSSadaf Ebrahimibackslashes between \Q and \E which, its documentation says, "may lead to
390*22dc650dSSadaf Ebrahimiconfusing results". PCRE2 treats a backslash between \Q and \E just like any
391*22dc650dSSadaf Ebrahimiother character. Note the following examples:
392*22dc650dSSadaf Ebrahimi<pre>
393*22dc650dSSadaf Ebrahimi  Pattern            PCRE2 matches   Perl matches
394*22dc650dSSadaf Ebrahimi
395*22dc650dSSadaf Ebrahimi  \Qabc$xyz\E        abc$xyz        abc followed by the contents of $xyz
396*22dc650dSSadaf Ebrahimi  \Qabc\$xyz\E       abc\$xyz       abc\$xyz
397*22dc650dSSadaf Ebrahimi  \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
398*22dc650dSSadaf Ebrahimi  \QA\B\E            A\B            A\B
399*22dc650dSSadaf Ebrahimi  \Q\\E              \              \\E
400*22dc650dSSadaf Ebrahimi</pre>
401*22dc650dSSadaf EbrahimiThe \Q...\E sequence is recognized both inside and outside character classes.
402*22dc650dSSadaf EbrahimiAn isolated \E that is not preceded by \Q is ignored. If \Q is not followed
403*22dc650dSSadaf Ebrahimiby \E later in the pattern, the literal interpretation continues to the end of
404*22dc650dSSadaf Ebrahimithe pattern (that is, \E is assumed at the end). If the isolated \Q is inside
405*22dc650dSSadaf Ebrahimia character class, this causes an error, because the character class is then
406*22dc650dSSadaf Ebrahiminot terminated by a closing square bracket.
407*22dc650dSSadaf Ebrahimi<a name="digitsafterbackslash"></a></P>
408*22dc650dSSadaf Ebrahimi<br><b>
409*22dc650dSSadaf EbrahimiNon-printing characters
410*22dc650dSSadaf Ebrahimi</b><br>
411*22dc650dSSadaf Ebrahimi<P>
412*22dc650dSSadaf EbrahimiA second use of backslash provides a way of encoding non-printing characters
413*22dc650dSSadaf Ebrahimiin patterns in a visible manner. There is no restriction on the appearance of
414*22dc650dSSadaf Ebrahiminon-printing characters in a pattern, but when a pattern is being prepared by
415*22dc650dSSadaf Ebrahimitext editing, it is often easier to use one of the following escape sequences
416*22dc650dSSadaf Ebrahimiinstead of the binary character it represents. In an ASCII or Unicode
417*22dc650dSSadaf Ebrahimienvironment, these escapes are as follows:
418*22dc650dSSadaf Ebrahimi<pre>
419*22dc650dSSadaf Ebrahimi  \a          alarm, that is, the BEL character (hex 07)
420*22dc650dSSadaf Ebrahimi  \cx         "control-x", where x is a non-control ASCII character
421*22dc650dSSadaf Ebrahimi  \e          escape (hex 1B)
422*22dc650dSSadaf Ebrahimi  \f          form feed (hex 0C)
423*22dc650dSSadaf Ebrahimi  \n          linefeed (hex 0A)
424*22dc650dSSadaf Ebrahimi  \r          carriage return (hex 0D) (but see below)
425*22dc650dSSadaf Ebrahimi  \t          tab (hex 09)
426*22dc650dSSadaf Ebrahimi  \0dd        character with octal code 0dd
427*22dc650dSSadaf Ebrahimi  \ddd        character with octal code ddd, or backreference
428*22dc650dSSadaf Ebrahimi  \o{ddd..}   character with octal code ddd..
429*22dc650dSSadaf Ebrahimi  \xhh        character with hex code hh
430*22dc650dSSadaf Ebrahimi  \x{hhh..}   character with hex code hhh..
431*22dc650dSSadaf Ebrahimi  \N{U+hhh..} character with Unicode hex code point hhh..
432*22dc650dSSadaf Ebrahimi</pre>
433*22dc650dSSadaf EbrahimiBy default, after \x that is not followed by {, from zero to two hexadecimal
434*22dc650dSSadaf Ebrahimidigits are read (letters can be in upper or lower case). Any number of
435*22dc650dSSadaf Ebrahimihexadecimal digits may appear between \x{ and }. If a character other than a
436*22dc650dSSadaf Ebrahimihexadecimal digit appears between \x{ and }, or if there is no terminating },
437*22dc650dSSadaf Ebrahimian error occurs.
438*22dc650dSSadaf Ebrahimi</P>
439*22dc650dSSadaf Ebrahimi<P>
440*22dc650dSSadaf EbrahimiCharacters whose code points are less than 256 can be defined by either of the
441*22dc650dSSadaf Ebrahimitwo syntaxes for \x or by an octal sequence. There is no difference in the way
442*22dc650dSSadaf Ebrahimithey are handled. For example, \xdc is exactly the same as \x{dc} or \334.
443*22dc650dSSadaf EbrahimiHowever, using the braced versions does make such sequences easier to read.
444*22dc650dSSadaf Ebrahimi</P>
445*22dc650dSSadaf Ebrahimi<P>
446*22dc650dSSadaf EbrahimiSupport is available for some ECMAScript (aka JavaScript) escape sequences via
447*22dc650dSSadaf Ebrahimitwo compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed
448*22dc650dSSadaf Ebrahimiby { is not recognized. Only if \x is followed by two hexadecimal digits is it
449*22dc650dSSadaf Ebrahimirecognized as a character escape. Otherwise it is interpreted as a literal "x"
450*22dc650dSSadaf Ebrahimicharacter. In this mode, support for code points greater than 256 is provided
451*22dc650dSSadaf Ebrahimiby \u, which must be followed by four hexadecimal digits; otherwise it is
452*22dc650dSSadaf Ebrahimiinterpreted as a literal "u" character.
453*22dc650dSSadaf Ebrahimi</P>
454*22dc650dSSadaf Ebrahimi<P>
455*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition,
456*22dc650dSSadaf Ebrahimi\u{hhh..} is recognized as the character specified by hexadecimal code point.
457*22dc650dSSadaf EbrahimiThere may be any number of hexadecimal digits, but unlike other places that
458*22dc650dSSadaf Ebrahimialso use curly brackets, spaces are not allowed and would result in the string
459*22dc650dSSadaf Ebrahimibeing interpreted as a literal. This syntax is from ECMAScript 6.
460*22dc650dSSadaf Ebrahimi</P>
461*22dc650dSSadaf Ebrahimi<P>
462*22dc650dSSadaf EbrahimiThe \N{U+hhh..} escape sequence is recognized only when PCRE2 is operating in
463*22dc650dSSadaf EbrahimiUTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2
464*22dc650dSSadaf Ebrahimidoes not support this. Note that when \N is not followed by an opening brace
465*22dc650dSSadaf Ebrahimi(curly bracket) it has an entirely different meaning, matching any character
466*22dc650dSSadaf Ebrahimithat is not a newline.
467*22dc650dSSadaf Ebrahimi</P>
468*22dc650dSSadaf Ebrahimi<P>
469*22dc650dSSadaf EbrahimiThere are some legacy applications where the escape sequence \r is expected to
470*22dc650dSSadaf Ebrahimimatch a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a
471*22dc650dSSadaf Ebrahimipattern is converted to \n so that it matches a LF (linefeed) instead of a CR
472*22dc650dSSadaf Ebrahimi(carriage return) character.
473*22dc650dSSadaf Ebrahimi</P>
474*22dc650dSSadaf Ebrahimi<P>
475*22dc650dSSadaf EbrahimiAn error occurs if \c is not followed by a character whose ASCII code point
476*22dc650dSSadaf Ebrahimiis in the range 32 to 126. The precise effect of \cx is as follows: if x is a
477*22dc650dSSadaf Ebrahimilower case letter, it is converted to upper case. Then bit 6 of the character
478*22dc650dSSadaf Ebrahimi(hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is
479*22dc650dSSadaf Ebrahimi5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If
480*22dc650dSSadaf Ebrahimithe code unit following \c has a code point less than 32 or greater than 126,
481*22dc650dSSadaf Ebrahimia compile-time error occurs.
482*22dc650dSSadaf Ebrahimi</P>
483*22dc650dSSadaf Ebrahimi<P>
484*22dc650dSSadaf EbrahimiWhen PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e,
485*22dc650dSSadaf Ebrahimi\f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c
486*22dc650dSSadaf Ebrahimiescape is processed as specified for Perl in the <b>perlebcdic</b> document. The
487*22dc650dSSadaf Ebrahimionly characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ],
488*22dc650dSSadaf Ebrahimi^, _, or ?. Any other character provokes a compile-time error. The sequence
489*22dc650dSSadaf Ebrahimi\c@ encodes character code 0; after \c the letters (in either case) encode
490*22dc650dSSadaf Ebrahimicharacters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
491*22dc650dSSadaf Ebrahimi(hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F).
492*22dc650dSSadaf Ebrahimi</P>
493*22dc650dSSadaf Ebrahimi<P>
494*22dc650dSSadaf EbrahimiThus, apart from \c?, these escapes generate the same character code values as
495*22dc650dSSadaf Ebrahimithey do in an ASCII environment, though the meanings of the values mostly
496*22dc650dSSadaf Ebrahimidiffer. For example, \cG always generates code value 7, which is BEL in ASCII
497*22dc650dSSadaf Ebrahimibut DEL in EBCDIC.
498*22dc650dSSadaf Ebrahimi</P>
499*22dc650dSSadaf Ebrahimi<P>
500*22dc650dSSadaf EbrahimiThe sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but
501*22dc650dSSadaf Ebrahimibecause 127 is not a control character in EBCDIC, Perl makes it generate the
502*22dc650dSSadaf EbrahimiAPC character. Unfortunately, there are several variants of EBCDIC. In most of
503*22dc650dSSadaf Ebrahimithem the APC character has the value 255 (hex FF), but in the one Perl calls
504*22dc650dSSadaf EbrahimiPOSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
505*22dc650dSSadaf Ebrahimivalues, PCRE2 makes \c? generate 95; otherwise it generates 255.
506*22dc650dSSadaf Ebrahimi</P>
507*22dc650dSSadaf Ebrahimi<P>
508*22dc650dSSadaf EbrahimiAfter \0 up to two further octal digits are read. If there are fewer than two
509*22dc650dSSadaf Ebrahimidigits, just those that are present are used. Thus the sequence \0\x\015
510*22dc650dSSadaf Ebrahimispecifies two binary zeros followed by a CR character (code value 13). Make
511*22dc650dSSadaf Ebrahimisure you supply two digits after the initial zero if the pattern character that
512*22dc650dSSadaf Ebrahimifollows is itself an octal digit.
513*22dc650dSSadaf Ebrahimi</P>
514*22dc650dSSadaf Ebrahimi<P>
515*22dc650dSSadaf EbrahimiThe escape \o must be followed by a sequence of octal digits, enclosed in
516*22dc650dSSadaf Ebrahimibraces. An error occurs if this is not the case. This escape is a recent
517*22dc650dSSadaf Ebrahimiaddition to Perl; it provides way of specifying character code points as octal
518*22dc650dSSadaf Ebrahiminumbers greater than 0777, and it also allows octal numbers and backreferences
519*22dc650dSSadaf Ebrahimito be unambiguously specified.
520*22dc650dSSadaf Ebrahimi</P>
521*22dc650dSSadaf Ebrahimi<P>
522*22dc650dSSadaf EbrahimiFor greater clarity and unambiguity, it is best to avoid following \ by a
523*22dc650dSSadaf Ebrahimidigit greater than zero. Instead, use \o{...} or \x{...} to specify numerical
524*22dc650dSSadaf Ebrahimicharacter code points, and \g{...} to specify backreferences. The following
525*22dc650dSSadaf Ebrahimiparagraphs describe the old, ambiguous syntax.
526*22dc650dSSadaf Ebrahimi</P>
527*22dc650dSSadaf Ebrahimi<P>
528*22dc650dSSadaf EbrahimiThe handling of a backslash followed by a digit other than 0 is complicated,
529*22dc650dSSadaf Ebrahimiand Perl has changed over time, causing PCRE2 also to change.
530*22dc650dSSadaf Ebrahimi</P>
531*22dc650dSSadaf Ebrahimi<P>
532*22dc650dSSadaf EbrahimiOutside a character class, PCRE2 reads the digit and any following digits as a
533*22dc650dSSadaf Ebrahimidecimal number. If the number is less than 10, begins with the digit 8 or 9, or
534*22dc650dSSadaf Ebrahimiif there are at least that many previous capture groups in the expression, the
535*22dc650dSSadaf Ebrahimientire sequence is taken as a <i>backreference</i>. A description of how this
536*22dc650dSSadaf Ebrahimiworks is given
537*22dc650dSSadaf Ebrahimi<a href="#backreferences">later,</a>
538*22dc650dSSadaf Ebrahimifollowing the discussion of
539*22dc650dSSadaf Ebrahimi<a href="#group">parenthesized groups.</a>
540*22dc650dSSadaf EbrahimiOtherwise, up to three octal digits are read to form a character code.
541*22dc650dSSadaf Ebrahimi</P>
542*22dc650dSSadaf Ebrahimi<P>
543*22dc650dSSadaf EbrahimiInside a character class, PCRE2 handles \8 and \9 as the literal characters
544*22dc650dSSadaf Ebrahimi"8" and "9", and otherwise reads up to three octal digits following the
545*22dc650dSSadaf Ebrahimibackslash, using them to generate a data character. Any subsequent digits stand
546*22dc650dSSadaf Ebrahimifor themselves. For example, outside a character class:
547*22dc650dSSadaf Ebrahimi<pre>
548*22dc650dSSadaf Ebrahimi  \040   is another way of writing an ASCII space
549*22dc650dSSadaf Ebrahimi  \40    is the same, provided there are fewer than 40 previous capture groups
550*22dc650dSSadaf Ebrahimi  \7     is always a backreference
551*22dc650dSSadaf Ebrahimi  \11    might be a backreference, or another way of writing a tab
552*22dc650dSSadaf Ebrahimi  \011   is always a tab
553*22dc650dSSadaf Ebrahimi  \0113  is a tab followed by the character "3"
554*22dc650dSSadaf Ebrahimi  \113   might be a backreference, otherwise the character with octal code 113
555*22dc650dSSadaf Ebrahimi  \377   might be a backreference, otherwise the value 255 (decimal)
556*22dc650dSSadaf Ebrahimi  \81    is always a backreference
557*22dc650dSSadaf Ebrahimi</pre>
558*22dc650dSSadaf EbrahimiNote that octal values of 100 or greater that are specified using this syntax
559*22dc650dSSadaf Ebrahimimust not be introduced by a leading zero, because no more than three octal
560*22dc650dSSadaf Ebrahimidigits are ever read.
561*22dc650dSSadaf Ebrahimi</P>
562*22dc650dSSadaf Ebrahimi<br><b>
563*22dc650dSSadaf EbrahimiConstraints on character values
564*22dc650dSSadaf Ebrahimi</b><br>
565*22dc650dSSadaf Ebrahimi<P>
566*22dc650dSSadaf EbrahimiCharacters that are specified using octal or hexadecimal numbers are
567*22dc650dSSadaf Ebrahimilimited to certain values, as follows:
568*22dc650dSSadaf Ebrahimi<pre>
569*22dc650dSSadaf Ebrahimi  8-bit non-UTF mode    no greater than 0xff
570*22dc650dSSadaf Ebrahimi  16-bit non-UTF mode   no greater than 0xffff
571*22dc650dSSadaf Ebrahimi  32-bit non-UTF mode   no greater than 0xffffffff
572*22dc650dSSadaf Ebrahimi  All UTF modes         no greater than 0x10ffff and a valid code point
573*22dc650dSSadaf Ebrahimi</pre>
574*22dc650dSSadaf EbrahimiInvalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
575*22dc650dSSadaf Ebrahimiso-called "surrogate" code points). The check for these can be disabled by the
576*22dc650dSSadaf Ebrahimicaller of <b>pcre2_compile()</b> by setting the option
577*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
578*22dc650dSSadaf Ebrahimiand UTF-32 modes, because these values are not representable in UTF-16.
579*22dc650dSSadaf Ebrahimi</P>
580*22dc650dSSadaf Ebrahimi<br><b>
581*22dc650dSSadaf EbrahimiEscape sequences in character classes
582*22dc650dSSadaf Ebrahimi</b><br>
583*22dc650dSSadaf Ebrahimi<P>
584*22dc650dSSadaf EbrahimiAll the sequences that define a single character value can be used both inside
585*22dc650dSSadaf Ebrahimiand outside character classes. In addition, inside a character class, \b is
586*22dc650dSSadaf Ebrahimiinterpreted as the backspace character (hex 08).
587*22dc650dSSadaf Ebrahimi</P>
588*22dc650dSSadaf Ebrahimi<P>
589*22dc650dSSadaf EbrahimiWhen not followed by an opening brace, \N is not allowed in a character class.
590*22dc650dSSadaf Ebrahimi\B, \R, and \X are not special inside a character class. Like other
591*22dc650dSSadaf Ebrahimiunrecognized alphabetic escape sequences, they cause an error. Outside a
592*22dc650dSSadaf Ebrahimicharacter class, these sequences have different meanings.
593*22dc650dSSadaf Ebrahimi</P>
594*22dc650dSSadaf Ebrahimi<br><b>
595*22dc650dSSadaf EbrahimiUnsupported escape sequences
596*22dc650dSSadaf Ebrahimi</b><br>
597*22dc650dSSadaf Ebrahimi<P>
598*22dc650dSSadaf EbrahimiIn Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string
599*22dc650dSSadaf Ebrahimihandler and used to modify the case of following characters. By default, PCRE2
600*22dc650dSSadaf Ebrahimidoes not support these escape sequences in patterns. However, if either of the
601*22dc650dSSadaf EbrahimiPCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U matches a "U"
602*22dc650dSSadaf Ebrahimicharacter, and \u can be used to define a character by code point, as
603*22dc650dSSadaf Ebrahimidescribed above.
604*22dc650dSSadaf Ebrahimi</P>
605*22dc650dSSadaf Ebrahimi<br><b>
606*22dc650dSSadaf EbrahimiAbsolute and relative backreferences
607*22dc650dSSadaf Ebrahimi</b><br>
608*22dc650dSSadaf Ebrahimi<P>
609*22dc650dSSadaf EbrahimiThe sequence \g followed by a signed or unsigned number, optionally enclosed
610*22dc650dSSadaf Ebrahimiin braces, is an absolute or relative backreference. A named backreference
611*22dc650dSSadaf Ebrahimican be coded as \g{name}. Backreferences are discussed
612*22dc650dSSadaf Ebrahimi<a href="#backreferences">later,</a>
613*22dc650dSSadaf Ebrahimifollowing the discussion of
614*22dc650dSSadaf Ebrahimi<a href="#group">parenthesized groups.</a>
615*22dc650dSSadaf Ebrahimi</P>
616*22dc650dSSadaf Ebrahimi<br><b>
617*22dc650dSSadaf EbrahimiAbsolute and relative subroutine calls
618*22dc650dSSadaf Ebrahimi</b><br>
619*22dc650dSSadaf Ebrahimi<P>
620*22dc650dSSadaf EbrahimiFor compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
621*22dc650dSSadaf Ebrahimia number enclosed either in angle brackets or single quotes, is an alternative
622*22dc650dSSadaf Ebrahimisyntax for referencing a capture group as a subroutine. Details are discussed
623*22dc650dSSadaf Ebrahimi<a href="#onigurumasubroutines">later.</a>
624*22dc650dSSadaf EbrahimiNote that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
625*22dc650dSSadaf Ebrahimisynonymous. The former is a backreference; the latter is a
626*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">subroutine</a>
627*22dc650dSSadaf Ebrahimicall.
628*22dc650dSSadaf Ebrahimi<a name="genericchartypes"></a></P>
629*22dc650dSSadaf Ebrahimi<br><b>
630*22dc650dSSadaf EbrahimiGeneric character types
631*22dc650dSSadaf Ebrahimi</b><br>
632*22dc650dSSadaf Ebrahimi<P>
633*22dc650dSSadaf EbrahimiAnother use of backslash is for specifying generic character types:
634*22dc650dSSadaf Ebrahimi<pre>
635*22dc650dSSadaf Ebrahimi  \d     any decimal digit
636*22dc650dSSadaf Ebrahimi  \D     any character that is not a decimal digit
637*22dc650dSSadaf Ebrahimi  \h     any horizontal white space character
638*22dc650dSSadaf Ebrahimi  \H     any character that is not a horizontal white space character
639*22dc650dSSadaf Ebrahimi  \N     any character that is not a newline
640*22dc650dSSadaf Ebrahimi  \s     any white space character
641*22dc650dSSadaf Ebrahimi  \S     any character that is not a white space character
642*22dc650dSSadaf Ebrahimi  \v     any vertical white space character
643*22dc650dSSadaf Ebrahimi  \V     any character that is not a vertical white space character
644*22dc650dSSadaf Ebrahimi  \w     any "word" character
645*22dc650dSSadaf Ebrahimi  \W     any "non-word" character
646*22dc650dSSadaf Ebrahimi</pre>
647*22dc650dSSadaf EbrahimiThe \N escape sequence has the same meaning as
648*22dc650dSSadaf Ebrahimi<a href="#fullstopdot">the "." metacharacter</a>
649*22dc650dSSadaf Ebrahimiwhen PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
650*22dc650dSSadaf Ebrahimimeaning of \N. Note that when \N is followed by an opening brace it has a
651*22dc650dSSadaf Ebrahimidifferent meaning. See the section entitled
652*22dc650dSSadaf Ebrahimi<a href="#digitsafterbackslash">"Non-printing characters"</a>
653*22dc650dSSadaf Ebrahimiabove for details. Perl also uses \N{name} to specify characters by Unicode
654*22dc650dSSadaf Ebrahiminame; PCRE2 does not support this.
655*22dc650dSSadaf Ebrahimi</P>
656*22dc650dSSadaf Ebrahimi<P>
657*22dc650dSSadaf EbrahimiEach pair of lower and upper case escape sequences partitions the complete set
658*22dc650dSSadaf Ebrahimiof characters into two disjoint sets. Any given character matches one, and only
659*22dc650dSSadaf Ebrahimione, of each pair. The sequences can appear both inside and outside character
660*22dc650dSSadaf Ebrahimiclasses. They each match one character of the appropriate type. If the current
661*22dc650dSSadaf Ebrahimimatching point is at the end of the subject string, all of them fail, because
662*22dc650dSSadaf Ebrahimithere is no character to match.
663*22dc650dSSadaf Ebrahimi</P>
664*22dc650dSSadaf Ebrahimi<P>
665*22dc650dSSadaf EbrahimiThe default \s characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
666*22dc650dSSadaf Ebrahimispace (32), which are defined as white space in the "C" locale. This list may
667*22dc650dSSadaf Ebrahimivary if locale-specific matching is taking place. For example, in some locales
668*22dc650dSSadaf Ebrahimithe "non-breaking space" character (\xA0) is recognized as white space, and in
669*22dc650dSSadaf Ebrahimiothers the VT character is not.
670*22dc650dSSadaf Ebrahimi</P>
671*22dc650dSSadaf Ebrahimi<P>
672*22dc650dSSadaf EbrahimiA "word" character is an underscore or any character that is a letter or digit.
673*22dc650dSSadaf EbrahimiBy default, the definition of letters and digits is controlled by PCRE2's
674*22dc650dSSadaf Ebrahimilow-valued character tables, and may vary if locale-specific matching is taking
675*22dc650dSSadaf Ebrahimiplace (see
676*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#localesupport">"Locale support"</a>
677*22dc650dSSadaf Ebrahimiin the
678*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
679*22dc650dSSadaf Ebrahimipage). For example, in a French locale such as "fr_FR" in Unix-like systems,
680*22dc650dSSadaf Ebrahimior "french" in Windows, some character codes greater than 127 are used for
681*22dc650dSSadaf Ebrahimiaccented letters, and these are then matched by \w. The use of locales with
682*22dc650dSSadaf EbrahimiUnicode is discouraged.
683*22dc650dSSadaf Ebrahimi</P>
684*22dc650dSSadaf Ebrahimi<P>
685*22dc650dSSadaf EbrahimiBy default, characters whose code points are greater than 127 never match \d,
686*22dc650dSSadaf Ebrahimi\s, or \w, and always match \D, \S, and \W, although this may be different
687*22dc650dSSadaf Ebrahimifor characters in the range 128-255 when locale-specific matching is happening.
688*22dc650dSSadaf EbrahimiThese escape sequences retain their original meanings from before Unicode
689*22dc650dSSadaf Ebrahimisupport was available, mainly for efficiency reasons. If the PCRE2_UCP option
690*22dc650dSSadaf Ebrahimiis set, the behaviour is changed so that Unicode properties are used to
691*22dc650dSSadaf Ebrahimidetermine character types, as follows:
692*22dc650dSSadaf Ebrahimi<pre>
693*22dc650dSSadaf Ebrahimi  \d  any character that matches \p{Nd} (decimal digit)
694*22dc650dSSadaf Ebrahimi  \s  any character that matches \p{Z} or \h or \v
695*22dc650dSSadaf Ebrahimi  \w  any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc}
696*22dc650dSSadaf Ebrahimi</pre>
697*22dc650dSSadaf EbrahimiThe addition of \p{Mn} (non-spacing mark) and the replacement of an explicit
698*22dc650dSSadaf Ebrahimitest for underscore with a test for \p{Pc} (connector punctuation) happened in
699*22dc650dSSadaf EbrahimiPCRE2 release 10.43. This brings PCRE2 into line with Perl.
700*22dc650dSSadaf Ebrahimi</P>
701*22dc650dSSadaf Ebrahimi<P>
702*22dc650dSSadaf EbrahimiThe upper case escapes match the inverse sets of characters. Note that \d
703*22dc650dSSadaf Ebrahimimatches only decimal digits, whereas \w matches any Unicode digit, as well as
704*22dc650dSSadaf Ebrahimiother character categories. Note also that PCRE2_UCP affects \b, and
705*22dc650dSSadaf Ebrahimi\B because they are defined in terms of \w and \W. Matching these sequences
706*22dc650dSSadaf Ebrahimiis noticeably slower when PCRE2_UCP is set.
707*22dc650dSSadaf Ebrahimi</P>
708*22dc650dSSadaf Ebrahimi<P>
709*22dc650dSSadaf EbrahimiThe effect of PCRE2_UCP on any one of these escape sequences can be negated by
710*22dc650dSSadaf Ebrahimithe options PCRE2_EXTRA_ASCII_BSD, PCRE2_EXTRA_ASCII_BSS, and
711*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ASCII_BSW, respectively. These options can be set and reset within
712*22dc650dSSadaf Ebrahimia pattern by means of an internal option setting
713*22dc650dSSadaf Ebrahimi<a href="#internaloptions">(see below).</a>
714*22dc650dSSadaf Ebrahimi</P>
715*22dc650dSSadaf Ebrahimi<P>
716*22dc650dSSadaf EbrahimiThe sequences \h, \H, \v, and \V, in contrast to the other sequences, which
717*22dc650dSSadaf Ebrahimimatch only ASCII characters by default, always match a specific list of code
718*22dc650dSSadaf Ebrahimipoints, whether or not PCRE2_UCP is set. The horizontal space characters are:
719*22dc650dSSadaf Ebrahimi<pre>
720*22dc650dSSadaf Ebrahimi  U+0009     Horizontal tab (HT)
721*22dc650dSSadaf Ebrahimi  U+0020     Space
722*22dc650dSSadaf Ebrahimi  U+00A0     Non-break space
723*22dc650dSSadaf Ebrahimi  U+1680     Ogham space mark
724*22dc650dSSadaf Ebrahimi  U+180E     Mongolian vowel separator
725*22dc650dSSadaf Ebrahimi  U+2000     En quad
726*22dc650dSSadaf Ebrahimi  U+2001     Em quad
727*22dc650dSSadaf Ebrahimi  U+2002     En space
728*22dc650dSSadaf Ebrahimi  U+2003     Em space
729*22dc650dSSadaf Ebrahimi  U+2004     Three-per-em space
730*22dc650dSSadaf Ebrahimi  U+2005     Four-per-em space
731*22dc650dSSadaf Ebrahimi  U+2006     Six-per-em space
732*22dc650dSSadaf Ebrahimi  U+2007     Figure space
733*22dc650dSSadaf Ebrahimi  U+2008     Punctuation space
734*22dc650dSSadaf Ebrahimi  U+2009     Thin space
735*22dc650dSSadaf Ebrahimi  U+200A     Hair space
736*22dc650dSSadaf Ebrahimi  U+202F     Narrow no-break space
737*22dc650dSSadaf Ebrahimi  U+205F     Medium mathematical space
738*22dc650dSSadaf Ebrahimi  U+3000     Ideographic space
739*22dc650dSSadaf Ebrahimi</pre>
740*22dc650dSSadaf EbrahimiThe vertical space characters are:
741*22dc650dSSadaf Ebrahimi<pre>
742*22dc650dSSadaf Ebrahimi  U+000A     Linefeed (LF)
743*22dc650dSSadaf Ebrahimi  U+000B     Vertical tab (VT)
744*22dc650dSSadaf Ebrahimi  U+000C     Form feed (FF)
745*22dc650dSSadaf Ebrahimi  U+000D     Carriage return (CR)
746*22dc650dSSadaf Ebrahimi  U+0085     Next line (NEL)
747*22dc650dSSadaf Ebrahimi  U+2028     Line separator
748*22dc650dSSadaf Ebrahimi  U+2029     Paragraph separator
749*22dc650dSSadaf Ebrahimi</pre>
750*22dc650dSSadaf EbrahimiIn 8-bit, non-UTF-8 mode, only the characters with code points less than 256
751*22dc650dSSadaf Ebrahimiare relevant.
752*22dc650dSSadaf Ebrahimi<a name="newlineseq"></a></P>
753*22dc650dSSadaf Ebrahimi<br><b>
754*22dc650dSSadaf EbrahimiNewline sequences
755*22dc650dSSadaf Ebrahimi</b><br>
756*22dc650dSSadaf Ebrahimi<P>
757*22dc650dSSadaf EbrahimiOutside a character class, by default, the escape sequence \R matches any
758*22dc650dSSadaf EbrahimiUnicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the
759*22dc650dSSadaf Ebrahimifollowing:
760*22dc650dSSadaf Ebrahimi<pre>
761*22dc650dSSadaf Ebrahimi  (?&#62;\r\n|\n|\x0b|\f|\r|\x85)
762*22dc650dSSadaf Ebrahimi</pre>
763*22dc650dSSadaf EbrahimiThis is an example of an "atomic group", details of which are given
764*22dc650dSSadaf Ebrahimi<a href="#atomicgroup">below.</a>
765*22dc650dSSadaf EbrahimiThis particular group matches either the two-character sequence CR followed by
766*22dc650dSSadaf EbrahimiLF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
767*22dc650dSSadaf EbrahimiU+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
768*22dc650dSSadaf Ebrahimiline, U+0085). Because this is an atomic group, the two-character sequence is
769*22dc650dSSadaf Ebrahimitreated as a single unit that cannot be split.
770*22dc650dSSadaf Ebrahimi</P>
771*22dc650dSSadaf Ebrahimi<P>
772*22dc650dSSadaf EbrahimiIn other modes, two additional characters whose code points are greater than 255
773*22dc650dSSadaf Ebrahimiare added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
774*22dc650dSSadaf EbrahimiUnicode support is not needed for these characters to be recognized.
775*22dc650dSSadaf Ebrahimi</P>
776*22dc650dSSadaf Ebrahimi<P>
777*22dc650dSSadaf EbrahimiIt is possible to restrict \R to match only CR, LF, or CRLF (instead of the
778*22dc650dSSadaf Ebrahimicomplete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
779*22dc650dSSadaf Ebrahimiat compile time. (BSR is an abbreviation for "backslash R".) This can be made
780*22dc650dSSadaf Ebrahimithe default when PCRE2 is built; if this is the case, the other behaviour can
781*22dc650dSSadaf Ebrahimibe requested via the PCRE2_BSR_UNICODE option. It is also possible to specify
782*22dc650dSSadaf Ebrahimithese settings by starting a pattern string with one of the following
783*22dc650dSSadaf Ebrahimisequences:
784*22dc650dSSadaf Ebrahimi<pre>
785*22dc650dSSadaf Ebrahimi  (*BSR_ANYCRLF)   CR, LF, or CRLF only
786*22dc650dSSadaf Ebrahimi  (*BSR_UNICODE)   any Unicode newline sequence
787*22dc650dSSadaf Ebrahimi</pre>
788*22dc650dSSadaf EbrahimiThese override the default and the options given to the compiling function.
789*22dc650dSSadaf EbrahimiNote that these special settings, which are not Perl-compatible, are recognized
790*22dc650dSSadaf Ebrahimionly at the very start of a pattern, and that they must be in upper case. If
791*22dc650dSSadaf Ebrahimimore than one of them is present, the last one is used. They can be combined
792*22dc650dSSadaf Ebrahimiwith a change of newline convention; for example, a pattern can start with:
793*22dc650dSSadaf Ebrahimi<pre>
794*22dc650dSSadaf Ebrahimi  (*ANY)(*BSR_ANYCRLF)
795*22dc650dSSadaf Ebrahimi</pre>
796*22dc650dSSadaf EbrahimiThey can also be combined with the (*UTF) or (*UCP) special sequences. Inside a
797*22dc650dSSadaf Ebrahimicharacter class, \R is treated as an unrecognized escape sequence, and causes
798*22dc650dSSadaf Ebrahimian error.
799*22dc650dSSadaf Ebrahimi<a name="uniextseq"></a></P>
800*22dc650dSSadaf Ebrahimi<br><b>
801*22dc650dSSadaf EbrahimiUnicode character properties
802*22dc650dSSadaf Ebrahimi</b><br>
803*22dc650dSSadaf Ebrahimi<P>
804*22dc650dSSadaf EbrahimiWhen PCRE2 is built with Unicode support (the default), three additional escape
805*22dc650dSSadaf Ebrahimisequences that match characters with specific properties are available. They
806*22dc650dSSadaf Ebrahimican be used in any mode, though in 8-bit and 16-bit non-UTF modes these
807*22dc650dSSadaf Ebrahimisequences are of course limited to testing characters whose code points are
808*22dc650dSSadaf Ebrahimiless than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points
809*22dc650dSSadaf Ebrahimigreater than 0x10ffff (the Unicode limit) may be encountered. These are all
810*22dc650dSSadaf Ebrahimitreated as being in the Unknown script and with an unassigned type.
811*22dc650dSSadaf Ebrahimi</P>
812*22dc650dSSadaf Ebrahimi<P>
813*22dc650dSSadaf EbrahimiMatching characters by Unicode property is not fast, because PCRE2 has to do a
814*22dc650dSSadaf Ebrahimimultistage table lookup in order to find a character's property. That is why
815*22dc650dSSadaf Ebrahimithe traditional escape sequences such as \d and \w do not use Unicode
816*22dc650dSSadaf Ebrahimiproperties in PCRE2 by default, though you can make them do so by setting the
817*22dc650dSSadaf EbrahimiPCRE2_UCP option or by starting the pattern with (*UCP).
818*22dc650dSSadaf Ebrahimi</P>
819*22dc650dSSadaf Ebrahimi<P>
820*22dc650dSSadaf EbrahimiThe extra escape sequences that provide property support are:
821*22dc650dSSadaf Ebrahimi<pre>
822*22dc650dSSadaf Ebrahimi  \p{<i>xx</i>}   a character with the <i>xx</i> property
823*22dc650dSSadaf Ebrahimi  \P{<i>xx</i>}   a character without the <i>xx</i> property
824*22dc650dSSadaf Ebrahimi  \X       a Unicode extended grapheme cluster
825*22dc650dSSadaf Ebrahimi</pre>
826*22dc650dSSadaf EbrahimiThe property names represented by <i>xx</i> above are not case-sensitive, and in
827*22dc650dSSadaf Ebrahimiaccordance with Unicode's "loose matching" rules, spaces, hyphens, and
828*22dc650dSSadaf Ebrahimiunderscores are ignored. There is support for Unicode script names, Unicode
829*22dc650dSSadaf Ebrahimigeneral category properties, "Any", which matches any character (including
830*22dc650dSSadaf Ebrahiminewline), Bidi_Class, a number of binary (yes/no) properties, and some special
831*22dc650dSSadaf EbrahimiPCRE2 properties (described
832*22dc650dSSadaf Ebrahimi<a href="#extraprops">below).</a>
833*22dc650dSSadaf EbrahimiCertain other Perl properties such as "InMusicalSymbols" are not supported by
834*22dc650dSSadaf EbrahimiPCRE2. Note that \P{Any} does not match any characters, so always causes a
835*22dc650dSSadaf Ebrahimimatch failure.
836*22dc650dSSadaf Ebrahimi</P>
837*22dc650dSSadaf Ebrahimi<br><b>
838*22dc650dSSadaf EbrahimiScript properties for \p and \P
839*22dc650dSSadaf Ebrahimi</b><br>
840*22dc650dSSadaf Ebrahimi<P>
841*22dc650dSSadaf EbrahimiThere are three different syntax forms for matching a script. Each Unicode
842*22dc650dSSadaf Ebrahimicharacter has a basic script and, optionally, a list of other scripts ("Script
843*22dc650dSSadaf EbrahimiExtensions") with which it is commonly used. Using the Adlam script as an
844*22dc650dSSadaf Ebrahimiexample, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas
845*22dc650dSSadaf Ebrahimi\p{scx:Adlam} matches, in addition, characters that have Adlam in their
846*22dc650dSSadaf Ebrahimiextensions list. The full names "script" and "script extensions" for the
847*22dc650dSSadaf Ebrahimiproperty types are recognized, and a equals sign is an alternative to the
848*22dc650dSSadaf Ebrahimicolon. If a script name is given without a property type, for example,
849*22dc650dSSadaf Ebrahimi\p{Adlam}, it is treated as \p{scx:Adlam}. Perl changed to this
850*22dc650dSSadaf Ebrahimiinterpretation at release 5.26 and PCRE2 changed at release 10.40.
851*22dc650dSSadaf Ebrahimi</P>
852*22dc650dSSadaf Ebrahimi<P>
853*22dc650dSSadaf EbrahimiUnassigned characters (and in non-UTF 32-bit mode, characters with code points
854*22dc650dSSadaf Ebrahimigreater than 0x10FFFF) are assigned the "Unknown" script. Others that are not
855*22dc650dSSadaf Ebrahimipart of an identified script are lumped together as "Common". The current list
856*22dc650dSSadaf Ebrahimiof recognized script names and their 4-character abbreviations can be obtained
857*22dc650dSSadaf Ebrahimiby running this command:
858*22dc650dSSadaf Ebrahimi<pre>
859*22dc650dSSadaf Ebrahimi  pcre2test -LS
860*22dc650dSSadaf Ebrahimi
861*22dc650dSSadaf Ebrahimi</PRE>
862*22dc650dSSadaf Ebrahimi</P>
863*22dc650dSSadaf Ebrahimi<br><b>
864*22dc650dSSadaf EbrahimiThe general category property for \p and \P
865*22dc650dSSadaf Ebrahimi</b><br>
866*22dc650dSSadaf Ebrahimi<P>
867*22dc650dSSadaf EbrahimiEach character has exactly one Unicode general category property, specified by
868*22dc650dSSadaf Ebrahimia two-letter abbreviation. For compatibility with Perl, negation can be
869*22dc650dSSadaf Ebrahimispecified by including a circumflex between the opening brace and the property
870*22dc650dSSadaf Ebrahiminame. For example, \p{^Lu} is the same as \P{Lu}.
871*22dc650dSSadaf Ebrahimi</P>
872*22dc650dSSadaf Ebrahimi<P>
873*22dc650dSSadaf EbrahimiIf only one letter is specified with \p or \P, it includes all the general
874*22dc650dSSadaf Ebrahimicategory properties that start with that letter. In this case, in the absence
875*22dc650dSSadaf Ebrahimiof negation, the curly brackets in the escape sequence are optional; these two
876*22dc650dSSadaf Ebrahimiexamples have the same effect:
877*22dc650dSSadaf Ebrahimi<pre>
878*22dc650dSSadaf Ebrahimi  \p{L}
879*22dc650dSSadaf Ebrahimi  \pL
880*22dc650dSSadaf Ebrahimi</pre>
881*22dc650dSSadaf EbrahimiThe following general category property codes are supported:
882*22dc650dSSadaf Ebrahimi<pre>
883*22dc650dSSadaf Ebrahimi  C     Other
884*22dc650dSSadaf Ebrahimi  Cc    Control
885*22dc650dSSadaf Ebrahimi  Cf    Format
886*22dc650dSSadaf Ebrahimi  Cn    Unassigned
887*22dc650dSSadaf Ebrahimi  Co    Private use
888*22dc650dSSadaf Ebrahimi  Cs    Surrogate
889*22dc650dSSadaf Ebrahimi
890*22dc650dSSadaf Ebrahimi  L     Letter
891*22dc650dSSadaf Ebrahimi  Ll    Lower case letter
892*22dc650dSSadaf Ebrahimi  Lm    Modifier letter
893*22dc650dSSadaf Ebrahimi  Lo    Other letter
894*22dc650dSSadaf Ebrahimi  Lt    Title case letter
895*22dc650dSSadaf Ebrahimi  Lu    Upper case letter
896*22dc650dSSadaf Ebrahimi
897*22dc650dSSadaf Ebrahimi  M     Mark
898*22dc650dSSadaf Ebrahimi  Mc    Spacing mark
899*22dc650dSSadaf Ebrahimi  Me    Enclosing mark
900*22dc650dSSadaf Ebrahimi  Mn    Non-spacing mark
901*22dc650dSSadaf Ebrahimi
902*22dc650dSSadaf Ebrahimi  N     Number
903*22dc650dSSadaf Ebrahimi  Nd    Decimal number
904*22dc650dSSadaf Ebrahimi  Nl    Letter number
905*22dc650dSSadaf Ebrahimi  No    Other number
906*22dc650dSSadaf Ebrahimi
907*22dc650dSSadaf Ebrahimi  P     Punctuation
908*22dc650dSSadaf Ebrahimi  Pc    Connector punctuation
909*22dc650dSSadaf Ebrahimi  Pd    Dash punctuation
910*22dc650dSSadaf Ebrahimi  Pe    Close punctuation
911*22dc650dSSadaf Ebrahimi  Pf    Final punctuation
912*22dc650dSSadaf Ebrahimi  Pi    Initial punctuation
913*22dc650dSSadaf Ebrahimi  Po    Other punctuation
914*22dc650dSSadaf Ebrahimi  Ps    Open punctuation
915*22dc650dSSadaf Ebrahimi
916*22dc650dSSadaf Ebrahimi  S     Symbol
917*22dc650dSSadaf Ebrahimi  Sc    Currency symbol
918*22dc650dSSadaf Ebrahimi  Sk    Modifier symbol
919*22dc650dSSadaf Ebrahimi  Sm    Mathematical symbol
920*22dc650dSSadaf Ebrahimi  So    Other symbol
921*22dc650dSSadaf Ebrahimi
922*22dc650dSSadaf Ebrahimi  Z     Separator
923*22dc650dSSadaf Ebrahimi  Zl    Line separator
924*22dc650dSSadaf Ebrahimi  Zp    Paragraph separator
925*22dc650dSSadaf Ebrahimi  Zs    Space separator
926*22dc650dSSadaf Ebrahimi</pre>
927*22dc650dSSadaf EbrahimiThe special property LC, which has the synonym L&, is also supported: it
928*22dc650dSSadaf Ebrahimimatches a character that has the Lu, Ll, or Lt property, in other words, a
929*22dc650dSSadaf Ebrahimiletter that is not classified as a modifier or "other".
930*22dc650dSSadaf Ebrahimi</P>
931*22dc650dSSadaf Ebrahimi<P>
932*22dc650dSSadaf EbrahimiThe Cs (Surrogate) property applies only to characters whose code points are in
933*22dc650dSSadaf Ebrahimithe range U+D800 to U+DFFF. These characters are no different to any other
934*22dc650dSSadaf Ebrahimicharacter when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library).
935*22dc650dSSadaf EbrahimiHowever, they are not valid in Unicode strings and so cannot be tested by PCRE2
936*22dc650dSSadaf Ebrahimiin UTF mode, unless UTF validity checking has been turned off (see the
937*22dc650dSSadaf Ebrahimidiscussion of PCRE2_NO_UTF_CHECK in the
938*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
939*22dc650dSSadaf Ebrahimipage).
940*22dc650dSSadaf Ebrahimi</P>
941*22dc650dSSadaf Ebrahimi<P>
942*22dc650dSSadaf EbrahimiThe long synonyms for property names that Perl supports (such as \p{Letter})
943*22dc650dSSadaf Ebrahimiare not supported by PCRE2, nor is it permitted to prefix any of these
944*22dc650dSSadaf Ebrahimiproperties with "Is".
945*22dc650dSSadaf Ebrahimi</P>
946*22dc650dSSadaf Ebrahimi<P>
947*22dc650dSSadaf EbrahimiNo character that is in the Unicode table has the Cn (unassigned) property.
948*22dc650dSSadaf EbrahimiInstead, this property is assumed for any code point that is not in the
949*22dc650dSSadaf EbrahimiUnicode table.
950*22dc650dSSadaf Ebrahimi</P>
951*22dc650dSSadaf Ebrahimi<P>
952*22dc650dSSadaf EbrahimiSpecifying caseless matching does not affect these escape sequences. For
953*22dc650dSSadaf Ebrahimiexample, \p{Lu} always matches only upper case letters. This is different from
954*22dc650dSSadaf Ebrahimithe behaviour of current versions of Perl.
955*22dc650dSSadaf Ebrahimi</P>
956*22dc650dSSadaf Ebrahimi<br><b>
957*22dc650dSSadaf EbrahimiBinary (yes/no) properties for \p and \P
958*22dc650dSSadaf Ebrahimi</b><br>
959*22dc650dSSadaf Ebrahimi<P>
960*22dc650dSSadaf EbrahimiUnicode defines a number of binary properties, that is, properties whose only
961*22dc650dSSadaf Ebrahimivalues are true or false. You can obtain a list of those that are recognized by
962*22dc650dSSadaf Ebrahimi\p and \P, along with their abbreviations, by running this command:
963*22dc650dSSadaf Ebrahimi<pre>
964*22dc650dSSadaf Ebrahimi  pcre2test -LP
965*22dc650dSSadaf Ebrahimi
966*22dc650dSSadaf Ebrahimi</PRE>
967*22dc650dSSadaf Ebrahimi</P>
968*22dc650dSSadaf Ebrahimi<br><b>
969*22dc650dSSadaf EbrahimiThe Bidi_Class property for \p and \P
970*22dc650dSSadaf Ebrahimi</b><br>
971*22dc650dSSadaf Ebrahimi<P>
972*22dc650dSSadaf Ebrahimi<pre>
973*22dc650dSSadaf Ebrahimi  \p{Bidi_Class:&#60;class&#62;}   matches a character with the given class
974*22dc650dSSadaf Ebrahimi  \p{BC:&#60;class&#62;}           matches a character with the given class
975*22dc650dSSadaf Ebrahimi</pre>
976*22dc650dSSadaf EbrahimiThe recognized classes are:
977*22dc650dSSadaf Ebrahimi<pre>
978*22dc650dSSadaf Ebrahimi  AL          Arabic letter
979*22dc650dSSadaf Ebrahimi  AN          Arabic number
980*22dc650dSSadaf Ebrahimi  B           paragraph separator
981*22dc650dSSadaf Ebrahimi  BN          boundary neutral
982*22dc650dSSadaf Ebrahimi  CS          common separator
983*22dc650dSSadaf Ebrahimi  EN          European number
984*22dc650dSSadaf Ebrahimi  ES          European separator
985*22dc650dSSadaf Ebrahimi  ET          European terminator
986*22dc650dSSadaf Ebrahimi  FSI         first strong isolate
987*22dc650dSSadaf Ebrahimi  L           left-to-right
988*22dc650dSSadaf Ebrahimi  LRE         left-to-right embedding
989*22dc650dSSadaf Ebrahimi  LRI         left-to-right isolate
990*22dc650dSSadaf Ebrahimi  LRO         left-to-right override
991*22dc650dSSadaf Ebrahimi  NSM         non-spacing mark
992*22dc650dSSadaf Ebrahimi  ON          other neutral
993*22dc650dSSadaf Ebrahimi  PDF         pop directional format
994*22dc650dSSadaf Ebrahimi  PDI         pop directional isolate
995*22dc650dSSadaf Ebrahimi  R           right-to-left
996*22dc650dSSadaf Ebrahimi  RLE         right-to-left embedding
997*22dc650dSSadaf Ebrahimi  RLI         right-to-left isolate
998*22dc650dSSadaf Ebrahimi  RLO         right-to-left override
999*22dc650dSSadaf Ebrahimi  S           segment separator
1000*22dc650dSSadaf Ebrahimi  WS          which space
1001*22dc650dSSadaf Ebrahimi</pre>
1002*22dc650dSSadaf EbrahimiAn equals sign may be used instead of a colon. The class names are
1003*22dc650dSSadaf Ebrahimicase-insensitive; only the short names listed above are recognized.
1004*22dc650dSSadaf Ebrahimi</P>
1005*22dc650dSSadaf Ebrahimi<br><b>
1006*22dc650dSSadaf EbrahimiExtended grapheme clusters
1007*22dc650dSSadaf Ebrahimi</b><br>
1008*22dc650dSSadaf Ebrahimi<P>
1009*22dc650dSSadaf EbrahimiThe \X escape matches any number of Unicode characters that form an "extended
1010*22dc650dSSadaf Ebrahimigrapheme cluster", and treats the sequence as an atomic group
1011*22dc650dSSadaf Ebrahimi<a href="#atomicgroup">(see below).</a>
1012*22dc650dSSadaf EbrahimiUnicode supports various kinds of composite character by giving each character
1013*22dc650dSSadaf Ebrahimia grapheme breaking property, and having rules that use these properties to
1014*22dc650dSSadaf Ebrahimidefine the boundaries of extended grapheme clusters. The rules are defined in
1015*22dc650dSSadaf EbrahimiUnicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
1016*22dc650dSSadaf Ebrahimiabandoned the use of some previous properties that had been used for emojis.
1017*22dc650dSSadaf EbrahimiInstead it introduced various emoji-specific properties. PCRE2 uses only the
1018*22dc650dSSadaf EbrahimiExtended Pictographic property.
1019*22dc650dSSadaf Ebrahimi</P>
1020*22dc650dSSadaf Ebrahimi<P>
1021*22dc650dSSadaf Ebrahimi\X always matches at least one character. Then it decides whether to add
1022*22dc650dSSadaf Ebrahimiadditional characters according to the following rules for ending a cluster:
1023*22dc650dSSadaf Ebrahimi</P>
1024*22dc650dSSadaf Ebrahimi<P>
1025*22dc650dSSadaf Ebrahimi1. End at the end of the subject string.
1026*22dc650dSSadaf Ebrahimi</P>
1027*22dc650dSSadaf Ebrahimi<P>
1028*22dc650dSSadaf Ebrahimi2. Do not end between CR and LF; otherwise end after any control character.
1029*22dc650dSSadaf Ebrahimi</P>
1030*22dc650dSSadaf Ebrahimi<P>
1031*22dc650dSSadaf Ebrahimi3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
1032*22dc650dSSadaf Ebrahimiare of five types: L, V, T, LV, and LVT. An L character may be followed by an
1033*22dc650dSSadaf EbrahimiL, V, LV, or LVT character; an LV or V character may be followed by a V or T
1034*22dc650dSSadaf Ebrahimicharacter; an LVT or T character may be followed only by a T character.
1035*22dc650dSSadaf Ebrahimi</P>
1036*22dc650dSSadaf Ebrahimi<P>
1037*22dc650dSSadaf Ebrahimi4. Do not end before extending characters or spacing marks or the zero-width
1038*22dc650dSSadaf Ebrahimijoiner (ZWJ) character. Characters with the "mark" property always have the
1039*22dc650dSSadaf Ebrahimi"extend" grapheme breaking property.
1040*22dc650dSSadaf Ebrahimi</P>
1041*22dc650dSSadaf Ebrahimi<P>
1042*22dc650dSSadaf Ebrahimi5. Do not end after prepend characters.
1043*22dc650dSSadaf Ebrahimi</P>
1044*22dc650dSSadaf Ebrahimi<P>
1045*22dc650dSSadaf Ebrahimi6. Do not end within emoji modifier sequences or emoji ZWJ (zero-width
1046*22dc650dSSadaf Ebrahimijoiner) sequences. An emoji ZWJ sequence consists of a character with the
1047*22dc650dSSadaf EbrahimiExtended_Pictographic property, optionally followed by one or more characters
1048*22dc650dSSadaf Ebrahimiwith the Extend property, followed by the ZWJ character, followed by another
1049*22dc650dSSadaf EbrahimiExtended_Pictographic character.
1050*22dc650dSSadaf Ebrahimi</P>
1051*22dc650dSSadaf Ebrahimi<P>
1052*22dc650dSSadaf Ebrahimi7. Do not break within emoji flag sequences. That is, do not break between
1053*22dc650dSSadaf Ebrahimiregional indicator (RI) characters if there are an odd number of RI characters
1054*22dc650dSSadaf Ebrahimibefore the break point.
1055*22dc650dSSadaf Ebrahimi</P>
1056*22dc650dSSadaf Ebrahimi<P>
1057*22dc650dSSadaf Ebrahimi8. Otherwise, end the cluster.
1058*22dc650dSSadaf Ebrahimi<a name="extraprops"></a></P>
1059*22dc650dSSadaf Ebrahimi<br><b>
1060*22dc650dSSadaf EbrahimiPCRE2's additional properties
1061*22dc650dSSadaf Ebrahimi</b><br>
1062*22dc650dSSadaf Ebrahimi<P>
1063*22dc650dSSadaf EbrahimiAs well as the standard Unicode properties described above, PCRE2 supports four
1064*22dc650dSSadaf Ebrahimimore that make it possible to convert traditional escape sequences such as \w
1065*22dc650dSSadaf Ebrahimiand \s to use Unicode properties. PCRE2 uses these non-standard, non-Perl
1066*22dc650dSSadaf Ebrahimiproperties internally when PCRE2_UCP is set. However, they may also be used
1067*22dc650dSSadaf Ebrahimiexplicitly. These properties are:
1068*22dc650dSSadaf Ebrahimi<pre>
1069*22dc650dSSadaf Ebrahimi  Xan   Any alphanumeric character
1070*22dc650dSSadaf Ebrahimi  Xps   Any POSIX space character
1071*22dc650dSSadaf Ebrahimi  Xsp   Any Perl space character
1072*22dc650dSSadaf Ebrahimi  Xwd   Any Perl "word" character
1073*22dc650dSSadaf Ebrahimi</pre>
1074*22dc650dSSadaf EbrahimiXan matches characters that have either the L (letter) or the N (number)
1075*22dc650dSSadaf Ebrahimiproperty. Xps matches the characters tab, linefeed, vertical tab, form feed, or
1076*22dc650dSSadaf Ebrahimicarriage return, and any other character that has the Z (separator) property.
1077*22dc650dSSadaf EbrahimiXsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
1078*22dc650dSSadaf Ebrahimicompatibility, but Perl changed. Xwd matches the same characters as Xan, plus
1079*22dc650dSSadaf Ebrahimithose that match Mn (non-spacing mark) or Pc (connector punctuation, which
1080*22dc650dSSadaf Ebrahimiincludes underscore).
1081*22dc650dSSadaf Ebrahimi</P>
1082*22dc650dSSadaf Ebrahimi<P>
1083*22dc650dSSadaf EbrahimiThere is another non-standard property, Xuc, which matches any character that
1084*22dc650dSSadaf Ebrahimican be represented by a Universal Character Name in C++ and other programming
1085*22dc650dSSadaf Ebrahimilanguages. These are the characters $, @, ` (grave accent), and all characters
1086*22dc650dSSadaf Ebrahimiwith Unicode code points greater than or equal to U+00A0, except for the
1087*22dc650dSSadaf Ebrahimisurrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
1088*22dc650dSSadaf Ebrahimiexcluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH
1089*22dc650dSSadaf Ebrahimiwhere H is a hexadecimal digit. Note that the Xuc property does not match these
1090*22dc650dSSadaf Ebrahimisequences but the characters that they represent.)
1091*22dc650dSSadaf Ebrahimi<a name="resetmatchstart"></a></P>
1092*22dc650dSSadaf Ebrahimi<br><b>
1093*22dc650dSSadaf EbrahimiResetting the match start
1094*22dc650dSSadaf Ebrahimi</b><br>
1095*22dc650dSSadaf Ebrahimi<P>
1096*22dc650dSSadaf EbrahimiIn normal use, the escape sequence \K causes any previously matched characters
1097*22dc650dSSadaf Ebrahiminot to be included in the final matched sequence that is returned. For example,
1098*22dc650dSSadaf Ebrahimithe pattern:
1099*22dc650dSSadaf Ebrahimi<pre>
1100*22dc650dSSadaf Ebrahimi  foo\Kbar
1101*22dc650dSSadaf Ebrahimi</pre>
1102*22dc650dSSadaf Ebrahimimatches "foobar", but reports that it has matched "bar". \K does not interact
1103*22dc650dSSadaf Ebrahimiwith anchoring in any way. The pattern:
1104*22dc650dSSadaf Ebrahimi<pre>
1105*22dc650dSSadaf Ebrahimi  ^foo\Kbar
1106*22dc650dSSadaf Ebrahimi</pre>
1107*22dc650dSSadaf Ebrahimimatches only when the subject begins with "foobar" (in single line mode),
1108*22dc650dSSadaf Ebrahimithough it again reports the matched string as "bar". This feature is similar to
1109*22dc650dSSadaf Ebrahimia lookbehind assertion
1110*22dc650dSSadaf Ebrahimi<a href="#lookbehind">(described below),</a>
1111*22dc650dSSadaf Ebrahimibut the part of the pattern that precedes \K is not constrained to match a
1112*22dc650dSSadaf Ebrahimilimited number of characters, as is required for a lookbehind assertion. The
1113*22dc650dSSadaf Ebrahimiuse of \K does not interfere with the setting of
1114*22dc650dSSadaf Ebrahimi<a href="#group">captured substrings.</a>
1115*22dc650dSSadaf EbrahimiFor example, when the pattern
1116*22dc650dSSadaf Ebrahimi<pre>
1117*22dc650dSSadaf Ebrahimi  (foo)\Kbar
1118*22dc650dSSadaf Ebrahimi</pre>
1119*22dc650dSSadaf Ebrahimimatches "foobar", the first substring is still set to "foo".
1120*22dc650dSSadaf Ebrahimi</P>
1121*22dc650dSSadaf Ebrahimi<P>
1122*22dc650dSSadaf EbrahimiFrom version 5.32.0 Perl forbids the use of \K in lookaround assertions. From
1123*22dc650dSSadaf Ebrahimirelease 10.38 PCRE2 also forbids this by default. However, the
1124*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling
1125*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b> to re-enable the previous behaviour. When this option is
1126*22dc650dSSadaf Ebrahimiset, \K is acted upon when it occurs inside positive assertions, but is
1127*22dc650dSSadaf Ebrahimiignored in negative assertions. Note that when a pattern such as (?=ab\K)
1128*22dc650dSSadaf Ebrahimimatches, the reported start of the match can be greater than the end of the
1129*22dc650dSSadaf Ebrahimimatch. Using \K in a lookbehind assertion at the start of a pattern can also
1130*22dc650dSSadaf Ebrahimilead to odd effects. For example, consider this pattern:
1131*22dc650dSSadaf Ebrahimi<pre>
1132*22dc650dSSadaf Ebrahimi  (?&#60;=\Kfoo)bar
1133*22dc650dSSadaf Ebrahimi</pre>
1134*22dc650dSSadaf EbrahimiIf the subject is "foobar", a call to <b>pcre2_match()</b> with a starting
1135*22dc650dSSadaf Ebrahimioffset of 3 succeeds and reports the matching string as "foobar", that is, the
1136*22dc650dSSadaf Ebrahimistart of the reported match is earlier than where the match started.
1137*22dc650dSSadaf Ebrahimi<a name="smallassertions"></a></P>
1138*22dc650dSSadaf Ebrahimi<br><b>
1139*22dc650dSSadaf EbrahimiSimple assertions
1140*22dc650dSSadaf Ebrahimi</b><br>
1141*22dc650dSSadaf Ebrahimi<P>
1142*22dc650dSSadaf EbrahimiThe final use of backslash is for certain simple assertions. An assertion
1143*22dc650dSSadaf Ebrahimispecifies a condition that has to be met at a particular point in a match,
1144*22dc650dSSadaf Ebrahimiwithout consuming any characters from the subject string. The use of
1145*22dc650dSSadaf Ebrahimigroups for more complicated assertions is described
1146*22dc650dSSadaf Ebrahimi<a href="#bigassertions">below.</a>
1147*22dc650dSSadaf EbrahimiThe backslashed assertions are:
1148*22dc650dSSadaf Ebrahimi<pre>
1149*22dc650dSSadaf Ebrahimi  \b     matches at a word boundary
1150*22dc650dSSadaf Ebrahimi  \B     matches when not at a word boundary
1151*22dc650dSSadaf Ebrahimi  \A     matches at the start of the subject
1152*22dc650dSSadaf Ebrahimi  \Z     matches at the end of the subject
1153*22dc650dSSadaf Ebrahimi          also matches before a newline at the end of the subject
1154*22dc650dSSadaf Ebrahimi  \z     matches only at the end of the subject
1155*22dc650dSSadaf Ebrahimi  \G     matches at the first matching position in the subject
1156*22dc650dSSadaf Ebrahimi</pre>
1157*22dc650dSSadaf EbrahimiInside a character class, \b has a different meaning; it matches the backspace
1158*22dc650dSSadaf Ebrahimicharacter. If any other of these assertions appears in a character class, an
1159*22dc650dSSadaf Ebrahimi"invalid escape sequence" error is generated.
1160*22dc650dSSadaf Ebrahimi</P>
1161*22dc650dSSadaf Ebrahimi<P>
1162*22dc650dSSadaf EbrahimiA word boundary is a position in the subject string where the current character
1163*22dc650dSSadaf Ebrahimiand the previous character do not both match \w or \W (i.e. one matches
1164*22dc650dSSadaf Ebrahimi\w and the other matches \W), or the start or end of the string if the
1165*22dc650dSSadaf Ebrahimifirst or last character matches \w, respectively. When PCRE2 is built with
1166*22dc650dSSadaf EbrahimiUnicode support, the meanings of \w and \W can be changed by setting the
1167*22dc650dSSadaf EbrahimiPCRE2_UCP option. When this is done, it also affects \b and \B. Neither PCRE2
1168*22dc650dSSadaf Ebrahiminor Perl has a separate "start of word" or "end of word" metasequence. However,
1169*22dc650dSSadaf Ebrahimiwhatever follows \b normally determines which it is. For example, the fragment
1170*22dc650dSSadaf Ebrahimi\ba matches "a" at the start of a word.
1171*22dc650dSSadaf Ebrahimi</P>
1172*22dc650dSSadaf Ebrahimi<P>
1173*22dc650dSSadaf EbrahimiThe \A, \Z, and \z assertions differ from the traditional circumflex and
1174*22dc650dSSadaf Ebrahimidollar (described in the next section) in that they only ever match at the very
1175*22dc650dSSadaf Ebrahimistart and end of the subject string, whatever options are set. Thus, they are
1176*22dc650dSSadaf Ebrahimiindependent of multiline mode. These three assertions are not affected by the
1177*22dc650dSSadaf EbrahimiPCRE2_NOTBOL or PCRE2_NOTEOL options, which affect only the behaviour of the
1178*22dc650dSSadaf Ebrahimicircumflex and dollar metacharacters. However, if the <i>startoffset</i>
1179*22dc650dSSadaf Ebrahimiargument of <b>pcre2_match()</b> is non-zero, indicating that matching is to
1180*22dc650dSSadaf Ebrahimistart at a point other than the beginning of the subject, \A can never match.
1181*22dc650dSSadaf EbrahimiThe difference between \Z and \z is that \Z matches before a newline at the
1182*22dc650dSSadaf Ebrahimiend of the string as well as at the very end, whereas \z matches only at the
1183*22dc650dSSadaf Ebrahimiend.
1184*22dc650dSSadaf Ebrahimi</P>
1185*22dc650dSSadaf Ebrahimi<P>
1186*22dc650dSSadaf EbrahimiThe \G assertion is true only when the current matching position is at the
1187*22dc650dSSadaf Ebrahimistart point of the matching process, as specified by the <i>startoffset</i>
1188*22dc650dSSadaf Ebrahimiargument of <b>pcre2_match()</b>. It differs from \A when the value of
1189*22dc650dSSadaf Ebrahimi<i>startoffset</i> is non-zero. By calling <b>pcre2_match()</b> multiple times
1190*22dc650dSSadaf Ebrahimiwith appropriate arguments, you can mimic Perl's /g option, and it is in this
1191*22dc650dSSadaf Ebrahimikind of implementation where \G can be useful.
1192*22dc650dSSadaf Ebrahimi</P>
1193*22dc650dSSadaf Ebrahimi<P>
1194*22dc650dSSadaf EbrahimiNote, however, that PCRE2's implementation of \G, being true at the starting
1195*22dc650dSSadaf Ebrahimicharacter of the matching process, is subtly different from Perl's, which
1196*22dc650dSSadaf Ebrahimidefines it as true at the end of the previous match. In Perl, these can be
1197*22dc650dSSadaf Ebrahimidifferent when the previously matched string was empty. Because PCRE2 does just
1198*22dc650dSSadaf Ebrahimione match at a time, it cannot reproduce this behaviour.
1199*22dc650dSSadaf Ebrahimi</P>
1200*22dc650dSSadaf Ebrahimi<P>
1201*22dc650dSSadaf EbrahimiIf all the alternatives of a pattern begin with \G, the expression is anchored
1202*22dc650dSSadaf Ebrahimito the starting match position, and the "anchored" flag is set in the compiled
1203*22dc650dSSadaf Ebrahimiregular expression.
1204*22dc650dSSadaf Ebrahimi</P>
1205*22dc650dSSadaf Ebrahimi<br><a name="SEC6" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
1206*22dc650dSSadaf Ebrahimi<P>
1207*22dc650dSSadaf EbrahimiThe circumflex and dollar metacharacters are zero-width assertions. That is,
1208*22dc650dSSadaf Ebrahimithey test for a particular condition being true without consuming any
1209*22dc650dSSadaf Ebrahimicharacters from the subject string. These two metacharacters are concerned with
1210*22dc650dSSadaf Ebrahimimatching the starts and ends of lines. If the newline convention is set so that
1211*22dc650dSSadaf Ebrahimionly the two-character sequence CRLF is recognized as a newline, isolated CR
1212*22dc650dSSadaf Ebrahimiand LF characters are treated as ordinary data characters, and are not
1213*22dc650dSSadaf Ebrahimirecognized as newlines.
1214*22dc650dSSadaf Ebrahimi</P>
1215*22dc650dSSadaf Ebrahimi<P>
1216*22dc650dSSadaf EbrahimiOutside a character class, in the default matching mode, the circumflex
1217*22dc650dSSadaf Ebrahimicharacter is an assertion that is true only if the current matching point is at
1218*22dc650dSSadaf Ebrahimithe start of the subject string. If the <i>startoffset</i> argument of
1219*22dc650dSSadaf Ebrahimi<b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can
1220*22dc650dSSadaf Ebrahiminever match if the PCRE2_MULTILINE option is unset. Inside a character class,
1221*22dc650dSSadaf Ebrahimicircumflex has an entirely different meaning
1222*22dc650dSSadaf Ebrahimi<a href="#characterclass">(see below).</a>
1223*22dc650dSSadaf Ebrahimi</P>
1224*22dc650dSSadaf Ebrahimi<P>
1225*22dc650dSSadaf EbrahimiCircumflex need not be the first character of the pattern if a number of
1226*22dc650dSSadaf Ebrahimialternatives are involved, but it should be the first thing in each alternative
1227*22dc650dSSadaf Ebrahimiin which it appears if the pattern is ever to match that branch. If all
1228*22dc650dSSadaf Ebrahimipossible alternatives start with a circumflex, that is, if the pattern is
1229*22dc650dSSadaf Ebrahimiconstrained to match only at the start of the subject, it is said to be an
1230*22dc650dSSadaf Ebrahimi"anchored" pattern. (There are also other constructs that can cause a pattern
1231*22dc650dSSadaf Ebrahimito be anchored.)
1232*22dc650dSSadaf Ebrahimi</P>
1233*22dc650dSSadaf Ebrahimi<P>
1234*22dc650dSSadaf EbrahimiThe dollar character is an assertion that is true only if the current matching
1235*22dc650dSSadaf Ebrahimipoint is at the end of the subject string, or immediately before a newline at
1236*22dc650dSSadaf Ebrahimithe end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
1237*22dc650dSSadaf Ebrahimithat it does not actually match the newline. Dollar need not be the last
1238*22dc650dSSadaf Ebrahimicharacter of the pattern if a number of alternatives are involved, but it
1239*22dc650dSSadaf Ebrahimishould be the last item in any branch in which it appears. Dollar has no
1240*22dc650dSSadaf Ebrahimispecial meaning in a character class.
1241*22dc650dSSadaf Ebrahimi</P>
1242*22dc650dSSadaf Ebrahimi<P>
1243*22dc650dSSadaf EbrahimiThe meaning of dollar can be changed so that it matches only at the very end of
1244*22dc650dSSadaf Ebrahimithe string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
1245*22dc650dSSadaf Ebrahimidoes not affect the \Z assertion.
1246*22dc650dSSadaf Ebrahimi</P>
1247*22dc650dSSadaf Ebrahimi<P>
1248*22dc650dSSadaf EbrahimiThe meanings of the circumflex and dollar metacharacters are changed if the
1249*22dc650dSSadaf EbrahimiPCRE2_MULTILINE option is set. When this is the case, a dollar character
1250*22dc650dSSadaf Ebrahimimatches before any newlines in the string, as well as at the very end, and a
1251*22dc650dSSadaf Ebrahimicircumflex matches immediately after internal newlines as well as at the start
1252*22dc650dSSadaf Ebrahimiof the subject string. It does not match after a newline that ends the string,
1253*22dc650dSSadaf Ebrahimifor compatibility with Perl. However, this can be changed by setting the
1254*22dc650dSSadaf EbrahimiPCRE2_ALT_CIRCUMFLEX option.
1255*22dc650dSSadaf Ebrahimi</P>
1256*22dc650dSSadaf Ebrahimi<P>
1257*22dc650dSSadaf EbrahimiFor example, the pattern /^abc$/ matches the subject string "def\nabc" (where
1258*22dc650dSSadaf Ebrahimi\n represents a newline) in multiline mode, but not otherwise. Consequently,
1259*22dc650dSSadaf Ebrahimipatterns that are anchored in single line mode because all branches start with
1260*22dc650dSSadaf Ebrahimi^ are not anchored in multiline mode, and a match for circumflex is possible
1261*22dc650dSSadaf Ebrahimiwhen the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The
1262*22dc650dSSadaf EbrahimiPCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
1263*22dc650dSSadaf Ebrahimi</P>
1264*22dc650dSSadaf Ebrahimi<P>
1265*22dc650dSSadaf EbrahimiWhen the newline convention (see
1266*22dc650dSSadaf Ebrahimi<a href="#newlines">"Newline conventions"</a>
1267*22dc650dSSadaf Ebrahimibelow) recognizes the two-character sequence CRLF as a newline, this is
1268*22dc650dSSadaf Ebrahimipreferred, even if the single characters CR and LF are also recognized as
1269*22dc650dSSadaf Ebrahiminewlines. For example, if the newline convention is "any", a multiline mode
1270*22dc650dSSadaf Ebrahimicircumflex matches before "xyz" in the string "abc\r\nxyz" rather than after
1271*22dc650dSSadaf EbrahimiCR, even though CR on its own is a valid newline. (It also matches at the very
1272*22dc650dSSadaf Ebrahimistart of the string, of course.)
1273*22dc650dSSadaf Ebrahimi</P>
1274*22dc650dSSadaf Ebrahimi<P>
1275*22dc650dSSadaf EbrahimiNote that the sequences \A, \Z, and \z can be used to match the start and
1276*22dc650dSSadaf Ebrahimiend of the subject in both modes, and if all branches of a pattern start with
1277*22dc650dSSadaf Ebrahimi\A it is always anchored, whether or not PCRE2_MULTILINE is set.
1278*22dc650dSSadaf Ebrahimi<a name="fullstopdot"></a></P>
1279*22dc650dSSadaf Ebrahimi<br><a name="SEC7" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br>
1280*22dc650dSSadaf Ebrahimi<P>
1281*22dc650dSSadaf EbrahimiOutside a character class, a dot in the pattern matches any one character in
1282*22dc650dSSadaf Ebrahimithe subject string except (by default) a character that signifies the end of a
1283*22dc650dSSadaf Ebrahimiline. One or more characters may be specified as line terminators (see
1284*22dc650dSSadaf Ebrahimi<a href="#newlines">"Newline conventions"</a>
1285*22dc650dSSadaf Ebrahimiabove).
1286*22dc650dSSadaf Ebrahimi</P>
1287*22dc650dSSadaf Ebrahimi<P>
1288*22dc650dSSadaf EbrahimiDot never matches a single line-ending character. When the two-character
1289*22dc650dSSadaf Ebrahimisequence CRLF is the only line ending, dot does not match CR if it is
1290*22dc650dSSadaf Ebrahimiimmediately followed by LF, but otherwise it matches all characters (including
1291*22dc650dSSadaf Ebrahimiisolated CRs and LFs). When ANYCRLF is selected for line endings, no occurrences
1292*22dc650dSSadaf Ebrahimiof CR of LF match dot. When all Unicode line endings are being recognized, dot
1293*22dc650dSSadaf Ebrahimidoes not match CR or LF or any of the other line ending characters.
1294*22dc650dSSadaf Ebrahimi</P>
1295*22dc650dSSadaf Ebrahimi<P>
1296*22dc650dSSadaf EbrahimiThe behaviour of dot with regard to newlines can be changed. If the
1297*22dc650dSSadaf EbrahimiPCRE2_DOTALL option is set, a dot matches any one character, without exception.
1298*22dc650dSSadaf EbrahimiIf the two-character sequence CRLF is present in the subject string, it takes
1299*22dc650dSSadaf Ebrahimitwo dots to match it.
1300*22dc650dSSadaf Ebrahimi</P>
1301*22dc650dSSadaf Ebrahimi<P>
1302*22dc650dSSadaf EbrahimiThe handling of dot is entirely independent of the handling of circumflex and
1303*22dc650dSSadaf Ebrahimidollar, the only relationship being that they both involve newlines. Dot has no
1304*22dc650dSSadaf Ebrahimispecial meaning in a character class.
1305*22dc650dSSadaf Ebrahimi</P>
1306*22dc650dSSadaf Ebrahimi<P>
1307*22dc650dSSadaf EbrahimiThe escape sequence \N when not followed by an opening brace behaves like a
1308*22dc650dSSadaf Ebrahimidot, except that it is not affected by the PCRE2_DOTALL option. In other words,
1309*22dc650dSSadaf Ebrahimiit matches any character except one that signifies the end of a line.
1310*22dc650dSSadaf Ebrahimi</P>
1311*22dc650dSSadaf Ebrahimi<P>
1312*22dc650dSSadaf EbrahimiWhen \N is followed by an opening brace it has a different meaning. See the
1313*22dc650dSSadaf Ebrahimisection entitled
1314*22dc650dSSadaf Ebrahimi<a href="digitsafterbackslash">"Non-printing characters"</a>
1315*22dc650dSSadaf Ebrahimiabove for details. Perl also uses \N{name} to specify characters by Unicode
1316*22dc650dSSadaf Ebrahiminame; PCRE2 does not support this.
1317*22dc650dSSadaf Ebrahimi</P>
1318*22dc650dSSadaf Ebrahimi<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br>
1319*22dc650dSSadaf Ebrahimi<P>
1320*22dc650dSSadaf EbrahimiOutside a character class, the escape sequence \C matches any one code unit,
1321*22dc650dSSadaf Ebrahimiwhether or not a UTF mode is set. In the 8-bit library, one code unit is one
1322*22dc650dSSadaf Ebrahimibyte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
1323*22dc650dSSadaf Ebrahimi32-bit unit. Unlike a dot, \C always matches line-ending characters. The
1324*22dc650dSSadaf Ebrahimifeature is provided in Perl in order to match individual bytes in UTF-8 mode,
1325*22dc650dSSadaf Ebrahimibut it is unclear how it can usefully be used.
1326*22dc650dSSadaf Ebrahimi</P>
1327*22dc650dSSadaf Ebrahimi<P>
1328*22dc650dSSadaf EbrahimiBecause \C breaks up characters into individual code units, matching one unit
1329*22dc650dSSadaf Ebrahimiwith \C in UTF-8 or UTF-16 mode means that the rest of the string may start
1330*22dc650dSSadaf Ebrahimiwith a malformed UTF character. This has undefined results, because PCRE2
1331*22dc650dSSadaf Ebrahimiassumes that it is matching character by character in a valid UTF string (by
1332*22dc650dSSadaf Ebrahimidefault it checks the subject string's validity at the start of processing
1333*22dc650dSSadaf Ebrahimiunless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used).
1334*22dc650dSSadaf Ebrahimi</P>
1335*22dc650dSSadaf Ebrahimi<P>
1336*22dc650dSSadaf EbrahimiAn application can lock out the use of \C by setting the
1337*22dc650dSSadaf EbrahimiPCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
1338*22dc650dSSadaf Ebrahimibuild PCRE2 with the use of \C permanently disabled.
1339*22dc650dSSadaf Ebrahimi</P>
1340*22dc650dSSadaf Ebrahimi<P>
1341*22dc650dSSadaf EbrahimiPCRE2 does not allow \C to appear in lookbehind assertions
1342*22dc650dSSadaf Ebrahimi<a href="#lookbehind">(described below)</a>
1343*22dc650dSSadaf Ebrahimiin UTF-8 or UTF-16 modes, because this would make it impossible to calculate
1344*22dc650dSSadaf Ebrahimithe length of the lookbehind. Neither the alternative matching function
1345*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b> nor the JIT optimizer support \C in these UTF modes.
1346*22dc650dSSadaf EbrahimiThe former gives a match-time error; the latter fails to optimize and so the
1347*22dc650dSSadaf Ebrahimimatch is always run using the interpreter.
1348*22dc650dSSadaf Ebrahimi</P>
1349*22dc650dSSadaf Ebrahimi<P>
1350*22dc650dSSadaf EbrahimiIn the 32-bit library, however, \C is always supported (when not explicitly
1351*22dc650dSSadaf Ebrahimilocked out) because it always matches a single code unit, whether or not UTF-32
1352*22dc650dSSadaf Ebrahimiis specified.
1353*22dc650dSSadaf Ebrahimi</P>
1354*22dc650dSSadaf Ebrahimi<P>
1355*22dc650dSSadaf EbrahimiIn general, the \C escape sequence is best avoided. However, one way of using
1356*22dc650dSSadaf Ebrahimiit that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
1357*22dc650dSSadaf Ebrahimilookahead to check the length of the next character, as in this pattern, which
1358*22dc650dSSadaf Ebrahimicould be used with a UTF-8 string (ignore white space and line breaks):
1359*22dc650dSSadaf Ebrahimi<pre>
1360*22dc650dSSadaf Ebrahimi  (?| (?=[\x00-\x7f])(\C) |
1361*22dc650dSSadaf Ebrahimi      (?=[\x80-\x{7ff}])(\C)(\C) |
1362*22dc650dSSadaf Ebrahimi      (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
1363*22dc650dSSadaf Ebrahimi      (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
1364*22dc650dSSadaf Ebrahimi</pre>
1365*22dc650dSSadaf EbrahimiIn this example, a group that starts with (?| resets the capturing parentheses
1366*22dc650dSSadaf Ebrahiminumbers in each alternative (see
1367*22dc650dSSadaf Ebrahimi<a href="#dupgroupnumber">"Duplicate Group Numbers"</a>
1368*22dc650dSSadaf Ebrahimibelow). The assertions at the start of each branch check the next UTF-8
1369*22dc650dSSadaf Ebrahimicharacter for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1370*22dc650dSSadaf Ebrahimicharacter's individual bytes are then captured by the appropriate number of
1371*22dc650dSSadaf Ebrahimi\C groups.
1372*22dc650dSSadaf Ebrahimi<a name="characterclass"></a></P>
1373*22dc650dSSadaf Ebrahimi<br><a name="SEC9" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
1374*22dc650dSSadaf Ebrahimi<P>
1375*22dc650dSSadaf EbrahimiAn opening square bracket introduces a character class, terminated by a closing
1376*22dc650dSSadaf Ebrahimisquare bracket. A closing square bracket on its own is not special by default.
1377*22dc650dSSadaf EbrahimiIf a closing square bracket is required as a member of the class, it should be
1378*22dc650dSSadaf Ebrahimithe first data character in the class (after an initial circumflex, if present)
1379*22dc650dSSadaf Ebrahimior escaped with a backslash. This means that, by default, an empty class cannot
1380*22dc650dSSadaf Ebrahimibe defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing
1381*22dc650dSSadaf Ebrahimisquare bracket at the start does end the (empty) class.
1382*22dc650dSSadaf Ebrahimi</P>
1383*22dc650dSSadaf Ebrahimi<P>
1384*22dc650dSSadaf EbrahimiA character class matches a single character in the subject. A matched
1385*22dc650dSSadaf Ebrahimicharacter must be in the set of characters defined by the class, unless the
1386*22dc650dSSadaf Ebrahimifirst character in the class definition is a circumflex, in which case the
1387*22dc650dSSadaf Ebrahimisubject character must not be in the set defined by the class. If a circumflex
1388*22dc650dSSadaf Ebrahimiis actually required as a member of the class, ensure it is not the first
1389*22dc650dSSadaf Ebrahimicharacter, or escape it with a backslash.
1390*22dc650dSSadaf Ebrahimi</P>
1391*22dc650dSSadaf Ebrahimi<P>
1392*22dc650dSSadaf EbrahimiFor example, the character class [aeiou] matches any lower case vowel, while
1393*22dc650dSSadaf Ebrahimi[^aeiou] matches any character that is not a lower case vowel. Note that a
1394*22dc650dSSadaf Ebrahimicircumflex is just a convenient notation for specifying the characters that
1395*22dc650dSSadaf Ebrahimiare in the class by enumerating those that are not. A class that starts with a
1396*22dc650dSSadaf Ebrahimicircumflex is not an assertion; it still consumes a character from the subject
1397*22dc650dSSadaf Ebrahimistring, and therefore it fails if the current pointer is at the end of the
1398*22dc650dSSadaf Ebrahimistring.
1399*22dc650dSSadaf Ebrahimi</P>
1400*22dc650dSSadaf Ebrahimi<P>
1401*22dc650dSSadaf EbrahimiCharacters in a class may be specified by their code points using \o, \x, or
1402*22dc650dSSadaf Ebrahimi\N{U+hh..} in the usual way. When caseless matching is set, any letters in a
1403*22dc650dSSadaf Ebrahimiclass represent both their upper case and lower case versions, so for example,
1404*22dc650dSSadaf Ebrahimia caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
1405*22dc650dSSadaf Ebrahimimatch "A", whereas a caseful version would. Note that there are two ASCII
1406*22dc650dSSadaf Ebrahimicharacters, K and S, that, in addition to their lower case ASCII equivalents,
1407*22dc650dSSadaf Ebrahimiare case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S)
1408*22dc650dSSadaf Ebrahimirespectively when either PCRE2_UTF or PCRE2_UCP is set.
1409*22dc650dSSadaf Ebrahimi</P>
1410*22dc650dSSadaf Ebrahimi<P>
1411*22dc650dSSadaf EbrahimiCharacters that might indicate line breaks are never treated in any special way
1412*22dc650dSSadaf Ebrahimiwhen matching character classes, whatever line-ending sequence is in use, and
1413*22dc650dSSadaf Ebrahimiwhatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
1414*22dc650dSSadaf Ebrahimiclass such as [^a] always matches one of these characters.
1415*22dc650dSSadaf Ebrahimi</P>
1416*22dc650dSSadaf Ebrahimi<P>
1417*22dc650dSSadaf EbrahimiThe generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
1418*22dc650dSSadaf Ebrahimi\S, \v, \V, \w, and \W may appear in a character class, and add the
1419*22dc650dSSadaf Ebrahimicharacters that they match to the class. For example, [\dABCDEF] matches any
1420*22dc650dSSadaf Ebrahimihexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
1421*22dc650dSSadaf Ebrahimi\d, \s, \w and their upper case partners, just as it does when they appear
1422*22dc650dSSadaf Ebrahimioutside a character class, as described in the section entitled
1423*22dc650dSSadaf Ebrahimi<a href="#genericchartypes">"Generic character types"</a>
1424*22dc650dSSadaf Ebrahimiabove. The escape sequence \b has a different meaning inside a character
1425*22dc650dSSadaf Ebrahimiclass; it matches the backspace character. The sequences \B, \R, and \X are
1426*22dc650dSSadaf Ebrahiminot special inside a character class. Like any other unrecognized escape
1427*22dc650dSSadaf Ebrahimisequences, they cause an error. The same is true for \N when not followed by
1428*22dc650dSSadaf Ebrahimian opening brace.
1429*22dc650dSSadaf Ebrahimi</P>
1430*22dc650dSSadaf Ebrahimi<P>
1431*22dc650dSSadaf EbrahimiThe minus (hyphen) character can be used to specify a range of characters in a
1432*22dc650dSSadaf Ebrahimicharacter class. For example, [d-m] matches any letter between d and m,
1433*22dc650dSSadaf Ebrahimiinclusive. If a minus character is required in a class, it must be escaped with
1434*22dc650dSSadaf Ebrahimia backslash or appear in a position where it cannot be interpreted as
1435*22dc650dSSadaf Ebrahimiindicating a range, typically as the first or last character in the class,
1436*22dc650dSSadaf Ebrahimior immediately after a range. For example, [b-d-z] matches letters in the range
1437*22dc650dSSadaf Ebrahimib to d, a hyphen character, or z.
1438*22dc650dSSadaf Ebrahimi</P>
1439*22dc650dSSadaf Ebrahimi<P>
1440*22dc650dSSadaf EbrahimiPerl treats a hyphen as a literal if it appears before or after a POSIX class
1441*22dc650dSSadaf Ebrahimi(see below) or before or after a character type escape such as \d or \H.
1442*22dc650dSSadaf EbrahimiHowever, unless the hyphen is the last character in the class, Perl outputs a
1443*22dc650dSSadaf Ebrahimiwarning in its warning mode, as this is most likely a user error. As PCRE2 has
1444*22dc650dSSadaf Ebrahimino facility for warning, an error is given in these cases.
1445*22dc650dSSadaf Ebrahimi</P>
1446*22dc650dSSadaf Ebrahimi<P>
1447*22dc650dSSadaf EbrahimiIt is not possible to have the literal character "]" as the end character of a
1448*22dc650dSSadaf Ebrahimirange. A pattern such as [W-]46] is interpreted as a class of two characters
1449*22dc650dSSadaf Ebrahimi("W" and "-") followed by a literal string "46]", so it would match "W46]" or
1450*22dc650dSSadaf Ebrahimi"-46]". However, if the "]" is escaped with a backslash it is interpreted as
1451*22dc650dSSadaf Ebrahimithe end of range, so [W-\]46] is interpreted as a class containing a range
1452*22dc650dSSadaf Ebrahimifollowed by two other characters. The octal or hexadecimal representation of
1453*22dc650dSSadaf Ebrahimi"]" can also be used to end a range.
1454*22dc650dSSadaf Ebrahimi</P>
1455*22dc650dSSadaf Ebrahimi<P>
1456*22dc650dSSadaf EbrahimiRanges normally include all code points between the start and end characters,
1457*22dc650dSSadaf Ebrahimiinclusive. They can also be used for code points specified numerically, for
1458*22dc650dSSadaf Ebrahimiexample [\000-\037]. Ranges can include any characters that are valid for the
1459*22dc650dSSadaf Ebrahimicurrent mode. In any UTF mode, the so-called "surrogate" characters (those
1460*22dc650dSSadaf Ebrahimiwhose code points lie between 0xd800 and 0xdfff inclusive) may not be specified
1461*22dc650dSSadaf Ebrahimiexplicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables
1462*22dc650dSSadaf Ebrahimithis check). However, ranges such as [\x{d7ff}-\x{e000}], which include the
1463*22dc650dSSadaf Ebrahimisurrogates, are always permitted.
1464*22dc650dSSadaf Ebrahimi</P>
1465*22dc650dSSadaf Ebrahimi<P>
1466*22dc650dSSadaf EbrahimiThere is a special case in EBCDIC environments for ranges whose end points are
1467*22dc650dSSadaf Ebrahimiboth specified as literal letters in the same case. For compatibility with
1468*22dc650dSSadaf EbrahimiPerl, EBCDIC code points within the range that are not letters are omitted. For
1469*22dc650dSSadaf Ebrahimiexample, [h-k] matches only four characters, even though the codes for h and k
1470*22dc650dSSadaf Ebrahimiare 0x88 and 0x92, a range of 11 code points. However, if the range is
1471*22dc650dSSadaf Ebrahimispecified numerically, for example, [\x88-\x92] or [h-\x92], all code points
1472*22dc650dSSadaf Ebrahimiare included.
1473*22dc650dSSadaf Ebrahimi</P>
1474*22dc650dSSadaf Ebrahimi<P>
1475*22dc650dSSadaf EbrahimiIf a range that includes letters is used when caseless matching is set, it
1476*22dc650dSSadaf Ebrahimimatches the letters in either case. For example, [W-c] is equivalent to
1477*22dc650dSSadaf Ebrahimi[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1478*22dc650dSSadaf Ebrahimitables for a French locale are in use, [\xc8-\xcb] matches accented E
1479*22dc650dSSadaf Ebrahimicharacters in both cases.
1480*22dc650dSSadaf Ebrahimi</P>
1481*22dc650dSSadaf Ebrahimi<P>
1482*22dc650dSSadaf EbrahimiA circumflex can conveniently be used with the upper case character types to
1483*22dc650dSSadaf Ebrahimispecify a more restricted set of characters than the matching lower case type.
1484*22dc650dSSadaf EbrahimiFor example, the class [^\W_] matches any letter or digit, but not underscore,
1485*22dc650dSSadaf Ebrahimiwhereas [\w] includes underscore. A positive character class should be read as
1486*22dc650dSSadaf Ebrahimi"something OR something OR ..." and a negative class as "NOT something AND NOT
1487*22dc650dSSadaf Ebrahimisomething AND NOT ...".
1488*22dc650dSSadaf Ebrahimi</P>
1489*22dc650dSSadaf Ebrahimi<P>
1490*22dc650dSSadaf EbrahimiThe only metacharacters that are recognized in character classes are backslash,
1491*22dc650dSSadaf Ebrahimihyphen (only where it can be interpreted as specifying a range), circumflex
1492*22dc650dSSadaf Ebrahimi(only at the start), opening square bracket (only when it can be interpreted as
1493*22dc650dSSadaf Ebrahimiintroducing a POSIX class name, or for a special compatibility feature - see
1494*22dc650dSSadaf Ebrahimithe next two sections), and the terminating closing square bracket. However,
1495*22dc650dSSadaf Ebrahimiescaping other non-alphanumeric characters does no harm.
1496*22dc650dSSadaf Ebrahimi</P>
1497*22dc650dSSadaf Ebrahimi<br><a name="SEC10" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
1498*22dc650dSSadaf Ebrahimi<P>
1499*22dc650dSSadaf EbrahimiPerl supports the POSIX notation for character classes. This uses names
1500*22dc650dSSadaf Ebrahimienclosed by [: and :] within the enclosing square brackets. PCRE2 also supports
1501*22dc650dSSadaf Ebrahimithis notation. For example,
1502*22dc650dSSadaf Ebrahimi<pre>
1503*22dc650dSSadaf Ebrahimi  [01[:alpha:]%]
1504*22dc650dSSadaf Ebrahimi</pre>
1505*22dc650dSSadaf Ebrahimimatches "0", "1", any alphabetic character, or "%". The supported class names
1506*22dc650dSSadaf Ebrahimiare:
1507*22dc650dSSadaf Ebrahimi<pre>
1508*22dc650dSSadaf Ebrahimi  alnum    letters and digits
1509*22dc650dSSadaf Ebrahimi  alpha    letters
1510*22dc650dSSadaf Ebrahimi  ascii    character codes 0 - 127
1511*22dc650dSSadaf Ebrahimi  blank    space or tab only
1512*22dc650dSSadaf Ebrahimi  cntrl    control characters
1513*22dc650dSSadaf Ebrahimi  digit    decimal digits (same as \d)
1514*22dc650dSSadaf Ebrahimi  graph    printing characters, excluding space
1515*22dc650dSSadaf Ebrahimi  lower    lower case letters
1516*22dc650dSSadaf Ebrahimi  print    printing characters, including space
1517*22dc650dSSadaf Ebrahimi  punct    printing characters, excluding letters and digits and space
1518*22dc650dSSadaf Ebrahimi  space    white space (the same as \s from PCRE2 8.34)
1519*22dc650dSSadaf Ebrahimi  upper    upper case letters
1520*22dc650dSSadaf Ebrahimi  word     "word" characters (same as \w)
1521*22dc650dSSadaf Ebrahimi  xdigit   hexadecimal digits
1522*22dc650dSSadaf Ebrahimi</pre>
1523*22dc650dSSadaf EbrahimiThe default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1524*22dc650dSSadaf Ebrahimiand space (32). If locale-specific matching is taking place, the list of space
1525*22dc650dSSadaf Ebrahimicharacters may be different; there may be fewer or more of them. "Space" and
1526*22dc650dSSadaf Ebrahimi\s match the same set of characters, as do "word" and \w.
1527*22dc650dSSadaf Ebrahimi</P>
1528*22dc650dSSadaf Ebrahimi<P>
1529*22dc650dSSadaf EbrahimiThe name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1530*22dc650dSSadaf Ebrahimi5.8. Another Perl extension is negation, which is indicated by a ^ character
1531*22dc650dSSadaf Ebrahimiafter the colon. For example,
1532*22dc650dSSadaf Ebrahimi<pre>
1533*22dc650dSSadaf Ebrahimi  [12[:^digit:]]
1534*22dc650dSSadaf Ebrahimi</pre>
1535*22dc650dSSadaf Ebrahimimatches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
1536*22dc650dSSadaf Ebrahimisyntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1537*22dc650dSSadaf Ebrahimisupported, and an error is given if they are encountered.
1538*22dc650dSSadaf Ebrahimi</P>
1539*22dc650dSSadaf Ebrahimi<P>
1540*22dc650dSSadaf EbrahimiBy default, characters with values greater than 127 do not match any of the
1541*22dc650dSSadaf EbrahimiPOSIX character classes, although this may be different for characters in the
1542*22dc650dSSadaf Ebrahimirange 128-255 when locale-specific matching is happening. However, in UCP mode,
1543*22dc650dSSadaf Ebrahimiunless certain options are set (see below), some of the classes are changed so
1544*22dc650dSSadaf Ebrahimithat Unicode character properties are used. This is achieved by replacing
1545*22dc650dSSadaf EbrahimiPOSIX classes with other sequences, as follows:
1546*22dc650dSSadaf Ebrahimi<pre>
1547*22dc650dSSadaf Ebrahimi  [:alnum:]  becomes  \p{Xan}
1548*22dc650dSSadaf Ebrahimi  [:alpha:]  becomes  \p{L}
1549*22dc650dSSadaf Ebrahimi  [:blank:]  becomes  \h
1550*22dc650dSSadaf Ebrahimi  [:cntrl:]  becomes  \p{Cc}
1551*22dc650dSSadaf Ebrahimi  [:digit:]  becomes  \p{Nd}
1552*22dc650dSSadaf Ebrahimi  [:lower:]  becomes  \p{Ll}
1553*22dc650dSSadaf Ebrahimi  [:space:]  becomes  \p{Xps}
1554*22dc650dSSadaf Ebrahimi  [:upper:]  becomes  \p{Lu}
1555*22dc650dSSadaf Ebrahimi  [:word:]   becomes  \p{Xwd}
1556*22dc650dSSadaf Ebrahimi</pre>
1557*22dc650dSSadaf EbrahimiNegated versions, such as [:^alpha:] use \P instead of \p. Four other POSIX
1558*22dc650dSSadaf Ebrahimiclasses are handled specially in UCP mode:
1559*22dc650dSSadaf Ebrahimi</P>
1560*22dc650dSSadaf Ebrahimi<P>
1561*22dc650dSSadaf Ebrahimi[:graph:]
1562*22dc650dSSadaf EbrahimiThis matches characters that have glyphs that mark the page when printed. In
1563*22dc650dSSadaf EbrahimiUnicode property terms, it matches all characters with the L, M, N, P, S, or Cf
1564*22dc650dSSadaf Ebrahimiproperties, except for:
1565*22dc650dSSadaf Ebrahimi<pre>
1566*22dc650dSSadaf Ebrahimi  U+061C           Arabic Letter Mark
1567*22dc650dSSadaf Ebrahimi  U+180E           Mongolian Vowel Separator
1568*22dc650dSSadaf Ebrahimi  U+2066 - U+2069  Various "isolate"s
1569*22dc650dSSadaf Ebrahimi
1570*22dc650dSSadaf Ebrahimi</PRE>
1571*22dc650dSSadaf Ebrahimi</P>
1572*22dc650dSSadaf Ebrahimi<P>
1573*22dc650dSSadaf Ebrahimi[:print:]
1574*22dc650dSSadaf EbrahimiThis matches the same characters as [:graph:] plus space characters that are
1575*22dc650dSSadaf Ebrahiminot controls, that is, characters with the Zs property.
1576*22dc650dSSadaf Ebrahimi</P>
1577*22dc650dSSadaf Ebrahimi<P>
1578*22dc650dSSadaf Ebrahimi[:punct:]
1579*22dc650dSSadaf EbrahimiThis matches all characters that have the Unicode P (punctuation) property,
1580*22dc650dSSadaf Ebrahimiplus those characters with code points less than 256 that have the S (Symbol)
1581*22dc650dSSadaf Ebrahimiproperty.
1582*22dc650dSSadaf Ebrahimi</P>
1583*22dc650dSSadaf Ebrahimi<P>
1584*22dc650dSSadaf Ebrahimi[:xdigit:]
1585*22dc650dSSadaf EbrahimiIn addition to the ASCII hexadecimal digits, this also matches the "fullwidth"
1586*22dc650dSSadaf Ebrahimiversions of those characters, whose Unicode code points start at U+FF10. This
1587*22dc650dSSadaf Ebrahimiis a change that was made in PCRE release 10.43 for Perl compatibility.
1588*22dc650dSSadaf Ebrahimi</P>
1589*22dc650dSSadaf Ebrahimi<P>
1590*22dc650dSSadaf EbrahimiThe other POSIX classes are unchanged by PCRE2_UCP, and match only characters
1591*22dc650dSSadaf Ebrahimiwith code points less than 256.
1592*22dc650dSSadaf Ebrahimi</P>
1593*22dc650dSSadaf Ebrahimi<P>
1594*22dc650dSSadaf EbrahimiThere are two options that can be used to restrict the POSIX classes to ASCII
1595*22dc650dSSadaf Ebrahimicharacters when PCRE2_UCP is set. The option PCRE2_EXTRA_ASCII_DIGIT affects
1596*22dc650dSSadaf Ebrahimijust [:digit:] and [:xdigit:]. Within a pattern, this can be set and unset by
1597*22dc650dSSadaf Ebrahimi(?aT) and (?-aT). The PCRE2_EXTRA_ASCII_POSIX option disables UCP processing
1598*22dc650dSSadaf Ebrahimifor all POSIX classes, including [:digit:] and [:xdigit:]. Within a pattern,
1599*22dc650dSSadaf Ebrahimi(?aP) and (?-aP) set and unset both these options for consistency.
1600*22dc650dSSadaf Ebrahimi</P>
1601*22dc650dSSadaf Ebrahimi<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
1602*22dc650dSSadaf Ebrahimi<P>
1603*22dc650dSSadaf EbrahimiIn the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
1604*22dc650dSSadaf Ebrahimisyntax [[:&#60;:]] and [[:&#62;:]] is used for matching "start of word" and "end of
1605*22dc650dSSadaf Ebrahimiword". PCRE2 treats these items as follows:
1606*22dc650dSSadaf Ebrahimi<pre>
1607*22dc650dSSadaf Ebrahimi  [[:&#60;:]]  is converted to  \b(?=\w)
1608*22dc650dSSadaf Ebrahimi  [[:&#62;:]]  is converted to  \b(?&#60;=\w)
1609*22dc650dSSadaf Ebrahimi</pre>
1610*22dc650dSSadaf EbrahimiOnly these exact character sequences are recognized. A sequence such as
1611*22dc650dSSadaf Ebrahimi[a[:&#60;:]b] provokes error for an unrecognized POSIX class name. This support is
1612*22dc650dSSadaf Ebrahiminot compatible with Perl. It is provided to help migrations from other
1613*22dc650dSSadaf Ebrahimienvironments, and is best not used in any new patterns. Note that \b matches
1614*22dc650dSSadaf Ebrahimiat the start and the end of a word (see
1615*22dc650dSSadaf Ebrahimi<a href="#smallassertions">"Simple assertions"</a>
1616*22dc650dSSadaf Ebrahimiabove), and in a Perl-style pattern the preceding or following character
1617*22dc650dSSadaf Ebrahiminormally shows which is wanted, without the need for the assertions that are
1618*22dc650dSSadaf Ebrahimiused above in order to give exactly the POSIX behaviour. Note also that the
1619*22dc650dSSadaf EbrahimiPCRE2_UCP option changes the meaning of \w (and therefore \b) by default, so
1620*22dc650dSSadaf Ebrahimiit also affects these POSIX sequences.
1621*22dc650dSSadaf Ebrahimi</P>
1622*22dc650dSSadaf Ebrahimi<br><a name="SEC12" href="#TOC1">VERTICAL BAR</a><br>
1623*22dc650dSSadaf Ebrahimi<P>
1624*22dc650dSSadaf EbrahimiVertical bar characters are used to separate alternative patterns. For example,
1625*22dc650dSSadaf Ebrahimithe pattern
1626*22dc650dSSadaf Ebrahimi<pre>
1627*22dc650dSSadaf Ebrahimi  gilbert|sullivan
1628*22dc650dSSadaf Ebrahimi</pre>
1629*22dc650dSSadaf Ebrahimimatches either "gilbert" or "sullivan". Any number of alternatives may appear,
1630*22dc650dSSadaf Ebrahimiand an empty alternative is permitted (matching the empty string). The matching
1631*22dc650dSSadaf Ebrahimiprocess tries each alternative in turn, from left to right, and the first one
1632*22dc650dSSadaf Ebrahimithat succeeds is used. If the alternatives are within a group
1633*22dc650dSSadaf Ebrahimi<a href="#group">(defined below),</a>
1634*22dc650dSSadaf Ebrahimi"succeeds" means matching the rest of the main pattern as well as the
1635*22dc650dSSadaf Ebrahimialternative in the group.
1636*22dc650dSSadaf Ebrahimi<a name="internaloptions"></a></P>
1637*22dc650dSSadaf Ebrahimi<br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
1638*22dc650dSSadaf Ebrahimi<P>
1639*22dc650dSSadaf EbrahimiThe settings of several options can be changed within a pattern by a sequence
1640*22dc650dSSadaf Ebrahimiof letters enclosed between "(?" and ")". The following are Perl-compatible,
1641*22dc650dSSadaf Ebrahimiand are described in detail in the
1642*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
1643*22dc650dSSadaf Ebrahimidocumentation. The option letters are:
1644*22dc650dSSadaf Ebrahimi<pre>
1645*22dc650dSSadaf Ebrahimi  i  for PCRE2_CASELESS
1646*22dc650dSSadaf Ebrahimi  m  for PCRE2_MULTILINE
1647*22dc650dSSadaf Ebrahimi  n  for PCRE2_NO_AUTO_CAPTURE
1648*22dc650dSSadaf Ebrahimi  s  for PCRE2_DOTALL
1649*22dc650dSSadaf Ebrahimi  x  for PCRE2_EXTENDED
1650*22dc650dSSadaf Ebrahimi  xx for PCRE2_EXTENDED_MORE
1651*22dc650dSSadaf Ebrahimi</pre>
1652*22dc650dSSadaf EbrahimiFor example, (?im) sets caseless, multiline matching. It is also possible to
1653*22dc650dSSadaf Ebrahimiunset these options by preceding the relevant letters with a hyphen, for
1654*22dc650dSSadaf Ebrahimiexample (?-im). The two "extended" options are not independent; unsetting
1655*22dc650dSSadaf Ebrahimieither one cancels the effects of both of them.
1656*22dc650dSSadaf Ebrahimi</P>
1657*22dc650dSSadaf Ebrahimi<P>
1658*22dc650dSSadaf EbrahimiA combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
1659*22dc650dSSadaf Ebrahimiand PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
1660*22dc650dSSadaf Ebrahimipermitted. Only one hyphen may appear in the options string. If a letter
1661*22dc650dSSadaf Ebrahimiappears both before and after the hyphen, the option is unset. An empty options
1662*22dc650dSSadaf Ebrahimisetting "(?)" is allowed. Needless to say, it has no effect.
1663*22dc650dSSadaf Ebrahimi</P>
1664*22dc650dSSadaf Ebrahimi<P>
1665*22dc650dSSadaf EbrahimiIf the first character following (? is a circumflex, it causes all of the above
1666*22dc650dSSadaf Ebrahimioptions to be unset. Letters may follow the circumflex to cause some options to
1667*22dc650dSSadaf Ebrahimibe re-instated, but a hyphen may not appear.
1668*22dc650dSSadaf Ebrahimi</P>
1669*22dc650dSSadaf Ebrahimi<P>
1670*22dc650dSSadaf EbrahimiSome PCRE2-specific options can be changed by the same mechanism using these
1671*22dc650dSSadaf Ebrahimipairs or individual letters:
1672*22dc650dSSadaf Ebrahimi<pre>
1673*22dc650dSSadaf Ebrahimi  aD for PCRE2_EXTRA_ASCII_BSD
1674*22dc650dSSadaf Ebrahimi  aS for PCRE2_EXTRA_ASCII_BSS
1675*22dc650dSSadaf Ebrahimi  aW for PCRE2_EXTRA_ASCII_BSW
1676*22dc650dSSadaf Ebrahimi  aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT
1677*22dc650dSSadaf Ebrahimi  aT for PCRE2_EXTRA_ASCII_DIGIT
1678*22dc650dSSadaf Ebrahimi  r  for PCRE2_EXTRA_CASELESS_RESTRICT
1679*22dc650dSSadaf Ebrahimi  J  for PCRE2_DUPNAMES
1680*22dc650dSSadaf Ebrahimi  U  for PCRE2_UNGREEDY
1681*22dc650dSSadaf Ebrahimi</pre>
1682*22dc650dSSadaf EbrahimiHowever, except for 'r', these are not unset by (?^), which is equivalent to
1683*22dc650dSSadaf Ebrahimi(?-imnrsx). If 'a' is not followed by any of the upper case letters shown
1684*22dc650dSSadaf Ebrahimiabove, it sets (or unsets) all the ASCII options.
1685*22dc650dSSadaf Ebrahimi</P>
1686*22dc650dSSadaf Ebrahimi<P>
1687*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EXTRA_ASCII_POSIX
1688*22dc650dSSadaf Ebrahimiis set, but including it in (?aP) means that (?-aP) suppresses all ASCII
1689*22dc650dSSadaf Ebrahimirestrictions for POSIX classes.
1690*22dc650dSSadaf Ebrahimi</P>
1691*22dc650dSSadaf Ebrahimi<P>
1692*22dc650dSSadaf EbrahimiWhen one of these option changes occurs at top level (that is, not inside group
1693*22dc650dSSadaf Ebrahimiparentheses), the change applies until a subsequent change, or the end of the
1694*22dc650dSSadaf Ebrahimipattern. An option change within a group (see below for a description of
1695*22dc650dSSadaf Ebrahimigroups) affects only that part of the group that follows it. At the end of the
1696*22dc650dSSadaf Ebrahimigroup these options are reset to the state they were before the group. For
1697*22dc650dSSadaf Ebrahimiexample,
1698*22dc650dSSadaf Ebrahimi<pre>
1699*22dc650dSSadaf Ebrahimi  (a(?i)b)c
1700*22dc650dSSadaf Ebrahimi</pre>
1701*22dc650dSSadaf Ebrahimimatches abc and aBc and no other strings (assuming PCRE2_CASELESS is not set
1702*22dc650dSSadaf Ebrahimiexternally). Any changes made in one alternative do carry on into subsequent
1703*22dc650dSSadaf Ebrahimibranches within the same group. For example,
1704*22dc650dSSadaf Ebrahimi<pre>
1705*22dc650dSSadaf Ebrahimi  (a(?i)b|c)
1706*22dc650dSSadaf Ebrahimi</pre>
1707*22dc650dSSadaf Ebrahimimatches "ab", "aB", "c", and "C", even though when matching "C" the first
1708*22dc650dSSadaf Ebrahimibranch is abandoned before the option setting. This is because the effects of
1709*22dc650dSSadaf Ebrahimioption settings happen at compile time. There would be some very weird
1710*22dc650dSSadaf Ebrahimibehaviour otherwise.
1711*22dc650dSSadaf Ebrahimi</P>
1712*22dc650dSSadaf Ebrahimi<P>
1713*22dc650dSSadaf EbrahimiAs a convenient shorthand, if any option settings are required at the start of
1714*22dc650dSSadaf Ebrahimia non-capturing group (see the next section), the option letters may
1715*22dc650dSSadaf Ebrahimiappear between the "?" and the ":". Thus the two patterns
1716*22dc650dSSadaf Ebrahimi<pre>
1717*22dc650dSSadaf Ebrahimi  (?i:saturday|sunday)
1718*22dc650dSSadaf Ebrahimi  (?:(?i)saturday|sunday)
1719*22dc650dSSadaf Ebrahimi</pre>
1720*22dc650dSSadaf Ebrahimimatch exactly the same set of strings.
1721*22dc650dSSadaf Ebrahimi</P>
1722*22dc650dSSadaf Ebrahimi<P>
1723*22dc650dSSadaf Ebrahimi<b>Note:</b> There are other PCRE2-specific options, applying to the whole
1724*22dc650dSSadaf Ebrahimipattern, which can be set by the application when the compiling function is
1725*22dc650dSSadaf Ebrahimicalled. In addition, the pattern can contain special leading sequences such as
1726*22dc650dSSadaf Ebrahimi(*CRLF) to override what the application has set or what has been defaulted.
1727*22dc650dSSadaf EbrahimiDetails are given in the section entitled
1728*22dc650dSSadaf Ebrahimi<a href="#newlineseq">"Newline sequences"</a>
1729*22dc650dSSadaf Ebrahimiabove. There are also the (*UTF) and (*UCP) leading sequences that can be used
1730*22dc650dSSadaf Ebrahimito set UTF and Unicode property modes; they are equivalent to setting the
1731*22dc650dSSadaf EbrahimiPCRE2_UTF and PCRE2_UCP options, respectively. However, the application can set
1732*22dc650dSSadaf Ebrahimithe PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, which lock out the use of the
1733*22dc650dSSadaf Ebrahimi(*UTF) and (*UCP) sequences.
1734*22dc650dSSadaf Ebrahimi<a name="group"></a></P>
1735*22dc650dSSadaf Ebrahimi<br><a name="SEC14" href="#TOC1">GROUPS</a><br>
1736*22dc650dSSadaf Ebrahimi<P>
1737*22dc650dSSadaf EbrahimiGroups are delimited by parentheses (round brackets), which can be nested.
1738*22dc650dSSadaf EbrahimiTurning part of a pattern into a group does two things:
1739*22dc650dSSadaf Ebrahimi<br>
1740*22dc650dSSadaf Ebrahimi<br>
1741*22dc650dSSadaf Ebrahimi1. It localizes a set of alternatives. For example, the pattern
1742*22dc650dSSadaf Ebrahimi<pre>
1743*22dc650dSSadaf Ebrahimi  cat(aract|erpillar|)
1744*22dc650dSSadaf Ebrahimi</pre>
1745*22dc650dSSadaf Ebrahimimatches "cataract", "caterpillar", or "cat". Without the parentheses, it would
1746*22dc650dSSadaf Ebrahimimatch "cataract", "erpillar" or an empty string.
1747*22dc650dSSadaf Ebrahimi<br>
1748*22dc650dSSadaf Ebrahimi<br>
1749*22dc650dSSadaf Ebrahimi2. It creates a "capture group". This means that, when the whole pattern
1750*22dc650dSSadaf Ebrahimimatches, the portion of the subject string that matched the group is passed
1751*22dc650dSSadaf Ebrahimiback to the caller, separately from the portion that matched the whole pattern.
1752*22dc650dSSadaf Ebrahimi(This applies only to the traditional matching function; the DFA matching
1753*22dc650dSSadaf Ebrahimifunction does not support capturing.)
1754*22dc650dSSadaf Ebrahimi</P>
1755*22dc650dSSadaf Ebrahimi<P>
1756*22dc650dSSadaf EbrahimiOpening parentheses are counted from left to right (starting from 1) to obtain
1757*22dc650dSSadaf Ebrahiminumbers for capture groups. For example, if the string "the red king" is
1758*22dc650dSSadaf Ebrahimimatched against the pattern
1759*22dc650dSSadaf Ebrahimi<pre>
1760*22dc650dSSadaf Ebrahimi  the ((red|white) (king|queen))
1761*22dc650dSSadaf Ebrahimi</pre>
1762*22dc650dSSadaf Ebrahimithe captured substrings are "red king", "red", and "king", and are numbered 1,
1763*22dc650dSSadaf Ebrahimi2, and 3, respectively.
1764*22dc650dSSadaf Ebrahimi</P>
1765*22dc650dSSadaf Ebrahimi<P>
1766*22dc650dSSadaf EbrahimiThe fact that plain parentheses fulfil two functions is not always helpful.
1767*22dc650dSSadaf EbrahimiThere are often times when grouping is required without capturing. If an
1768*22dc650dSSadaf Ebrahimiopening parenthesis is followed by a question mark and a colon, the group
1769*22dc650dSSadaf Ebrahimidoes not do any capturing, and is not counted when computing the number of any
1770*22dc650dSSadaf Ebrahimisubsequent capture groups. For example, if the string "the white queen"
1771*22dc650dSSadaf Ebrahimiis matched against the pattern
1772*22dc650dSSadaf Ebrahimi<pre>
1773*22dc650dSSadaf Ebrahimi  the ((?:red|white) (king|queen))
1774*22dc650dSSadaf Ebrahimi</pre>
1775*22dc650dSSadaf Ebrahimithe captured substrings are "white queen" and "queen", and are numbered 1 and
1776*22dc650dSSadaf Ebrahimi2. The maximum number of capture groups is 65535.
1777*22dc650dSSadaf Ebrahimi</P>
1778*22dc650dSSadaf Ebrahimi<P>
1779*22dc650dSSadaf EbrahimiAs a convenient shorthand, if any option settings are required at the start of
1780*22dc650dSSadaf Ebrahimia non-capturing group, the option letters may appear between the "?" and the
1781*22dc650dSSadaf Ebrahimi":". Thus the two patterns
1782*22dc650dSSadaf Ebrahimi<pre>
1783*22dc650dSSadaf Ebrahimi  (?i:saturday|sunday)
1784*22dc650dSSadaf Ebrahimi  (?:(?i)saturday|sunday)
1785*22dc650dSSadaf Ebrahimi</pre>
1786*22dc650dSSadaf Ebrahimimatch exactly the same set of strings. Because alternative branches are tried
1787*22dc650dSSadaf Ebrahimifrom left to right, and options are not reset until the end of the group is
1788*22dc650dSSadaf Ebrahimireached, an option setting in one branch does affect subsequent branches, so
1789*22dc650dSSadaf Ebrahimithe above patterns match "SUNDAY" as well as "Saturday".
1790*22dc650dSSadaf Ebrahimi<a name="dupgroupnumber"></a></P>
1791*22dc650dSSadaf Ebrahimi<br><a name="SEC15" href="#TOC1">DUPLICATE GROUP NUMBERS</a><br>
1792*22dc650dSSadaf Ebrahimi<P>
1793*22dc650dSSadaf EbrahimiPerl 5.10 introduced a feature whereby each alternative in a group uses the
1794*22dc650dSSadaf Ebrahimisame numbers for its capturing parentheses. Such a group starts with (?| and is
1795*22dc650dSSadaf Ebrahimiitself a non-capturing group. For example, consider this pattern:
1796*22dc650dSSadaf Ebrahimi<pre>
1797*22dc650dSSadaf Ebrahimi  (?|(Sat)ur|(Sun))day
1798*22dc650dSSadaf Ebrahimi</pre>
1799*22dc650dSSadaf EbrahimiBecause the two alternatives are inside a (?| group, both sets of capturing
1800*22dc650dSSadaf Ebrahimiparentheses are numbered one. Thus, when the pattern matches, you can look
1801*22dc650dSSadaf Ebrahimiat captured substring number one, whichever alternative matched. This construct
1802*22dc650dSSadaf Ebrahimiis useful when you want to capture part, but not all, of one of a number of
1803*22dc650dSSadaf Ebrahimialternatives. Inside a (?| group, parentheses are numbered as usual, but the
1804*22dc650dSSadaf Ebrahiminumber is reset at the start of each branch. The numbers of any capturing
1805*22dc650dSSadaf Ebrahimiparentheses that follow the whole group start after the highest number used in
1806*22dc650dSSadaf Ebrahimiany branch. The following example is taken from the Perl documentation. The
1807*22dc650dSSadaf Ebrahiminumbers underneath show in which buffer the captured content will be stored.
1808*22dc650dSSadaf Ebrahimi<pre>
1809*22dc650dSSadaf Ebrahimi  # before  ---------------branch-reset----------- after
1810*22dc650dSSadaf Ebrahimi  / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1811*22dc650dSSadaf Ebrahimi  # 1            2         2  3        2     3     4
1812*22dc650dSSadaf Ebrahimi</pre>
1813*22dc650dSSadaf EbrahimiA backreference to a capture group uses the most recent value that is set for
1814*22dc650dSSadaf Ebrahimithe group. The following pattern matches "abcabc" or "defdef":
1815*22dc650dSSadaf Ebrahimi<pre>
1816*22dc650dSSadaf Ebrahimi  /(?|(abc)|(def))\1/
1817*22dc650dSSadaf Ebrahimi</pre>
1818*22dc650dSSadaf EbrahimiIn contrast, a subroutine call to a capture group always refers to the
1819*22dc650dSSadaf Ebrahimifirst one in the pattern with the given number. The following pattern matches
1820*22dc650dSSadaf Ebrahimi"abcabc" or "defabc":
1821*22dc650dSSadaf Ebrahimi<pre>
1822*22dc650dSSadaf Ebrahimi  /(?|(abc)|(def))(?1)/
1823*22dc650dSSadaf Ebrahimi</pre>
1824*22dc650dSSadaf EbrahimiA relative reference such as (?-1) is no different: it is just a convenient way
1825*22dc650dSSadaf Ebrahimiof computing an absolute group number.
1826*22dc650dSSadaf Ebrahimi</P>
1827*22dc650dSSadaf Ebrahimi<P>
1828*22dc650dSSadaf EbrahimiIf a
1829*22dc650dSSadaf Ebrahimi<a href="#conditions">condition test</a>
1830*22dc650dSSadaf Ebrahimifor a group's having matched refers to a non-unique number, the test is
1831*22dc650dSSadaf Ebrahimitrue if any group with that number has matched.
1832*22dc650dSSadaf Ebrahimi</P>
1833*22dc650dSSadaf Ebrahimi<P>
1834*22dc650dSSadaf EbrahimiAn alternative approach to using this "branch reset" feature is to use
1835*22dc650dSSadaf Ebrahimiduplicate named groups, as described in the next section.
1836*22dc650dSSadaf Ebrahimi</P>
1837*22dc650dSSadaf Ebrahimi<br><a name="SEC16" href="#TOC1">NAMED CAPTURE GROUPS</a><br>
1838*22dc650dSSadaf Ebrahimi<P>
1839*22dc650dSSadaf EbrahimiIdentifying capture groups by number is simple, but it can be very hard to keep
1840*22dc650dSSadaf Ebrahimitrack of the numbers in complicated patterns. Furthermore, if an expression is
1841*22dc650dSSadaf Ebrahimimodified, the numbers may change. To help with this difficulty, PCRE2 supports
1842*22dc650dSSadaf Ebrahimithe naming of capture groups. This feature was not added to Perl until release
1843*22dc650dSSadaf Ebrahimi5.10. Python had the feature earlier, and PCRE1 introduced it at release 4.0,
1844*22dc650dSSadaf Ebrahimiusing the Python syntax. PCRE2 supports both the Perl and the Python syntax.
1845*22dc650dSSadaf Ebrahimi</P>
1846*22dc650dSSadaf Ebrahimi<P>
1847*22dc650dSSadaf EbrahimiIn PCRE2, a capture group can be named in one of three ways: (?&#60;name&#62;...) or
1848*22dc650dSSadaf Ebrahimi(?'name'...) as in Perl, or (?P&#60;name&#62;...) as in Python. Names may be up to 128
1849*22dc650dSSadaf Ebrahimicode units long. When PCRE2_UTF is not set, they may contain only ASCII
1850*22dc650dSSadaf Ebrahimialphanumeric characters and underscores, but must start with a non-digit. When
1851*22dc650dSSadaf EbrahimiPCRE2_UTF is set, the syntax of group names is extended to allow any Unicode
1852*22dc650dSSadaf Ebrahimiletter or Unicode decimal digit. In other words, group names must match one of
1853*22dc650dSSadaf Ebrahimithese patterns:
1854*22dc650dSSadaf Ebrahimi<pre>
1855*22dc650dSSadaf Ebrahimi  ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set
1856*22dc650dSSadaf Ebrahimi  ^[_\p{L}][_\p{L}\p{Nd}]*\z  when PCRE2_UTF is set
1857*22dc650dSSadaf Ebrahimi</pre>
1858*22dc650dSSadaf EbrahimiReferences to capture groups from other parts of the pattern, such as
1859*22dc650dSSadaf Ebrahimi<a href="#backreferences">backreferences,</a>
1860*22dc650dSSadaf Ebrahimi<a href="#recursion">recursion,</a>
1861*22dc650dSSadaf Ebrahimiand
1862*22dc650dSSadaf Ebrahimi<a href="#conditions">conditions,</a>
1863*22dc650dSSadaf Ebrahimican all be made by name as well as by number.
1864*22dc650dSSadaf Ebrahimi</P>
1865*22dc650dSSadaf Ebrahimi<P>
1866*22dc650dSSadaf EbrahimiNamed capture groups are allocated numbers as well as names, exactly as
1867*22dc650dSSadaf Ebrahimiif the names were not present. In both PCRE2 and Perl, capture groups
1868*22dc650dSSadaf Ebrahimiare primarily identified by numbers; any names are just aliases for these
1869*22dc650dSSadaf Ebrahiminumbers. The PCRE2 API provides function calls for extracting the complete
1870*22dc650dSSadaf Ebrahiminame-to-number translation table from a compiled pattern, as well as
1871*22dc650dSSadaf Ebrahimiconvenience functions for extracting captured substrings by name.
1872*22dc650dSSadaf Ebrahimi</P>
1873*22dc650dSSadaf Ebrahimi<P>
1874*22dc650dSSadaf Ebrahimi<b>Warning:</b> When more than one capture group has the same number, as
1875*22dc650dSSadaf Ebrahimidescribed in the previous section, a name given to one of them applies to all
1876*22dc650dSSadaf Ebrahimiof them. Perl allows identically numbered groups to have different names.
1877*22dc650dSSadaf EbrahimiConsider this pattern, where there are two capture groups, both numbered 1:
1878*22dc650dSSadaf Ebrahimi<pre>
1879*22dc650dSSadaf Ebrahimi  (?|(?&#60;AA&#62;aa)|(?&#60;BB&#62;bb))
1880*22dc650dSSadaf Ebrahimi</pre>
1881*22dc650dSSadaf EbrahimiPerl allows this, with both names AA and BB as aliases of group 1. Thus, after
1882*22dc650dSSadaf Ebrahimia successful match, both names yield the same value (either "aa" or "bb").
1883*22dc650dSSadaf Ebrahimi</P>
1884*22dc650dSSadaf Ebrahimi<P>
1885*22dc650dSSadaf EbrahimiIn an attempt to reduce confusion, PCRE2 does not allow the same group number
1886*22dc650dSSadaf Ebrahimito be associated with more than one name. The example above provokes a
1887*22dc650dSSadaf Ebrahimicompile-time error. However, there is still scope for confusion. Consider this
1888*22dc650dSSadaf Ebrahimipattern:
1889*22dc650dSSadaf Ebrahimi<pre>
1890*22dc650dSSadaf Ebrahimi  (?|(?&#60;AA&#62;aa)|(bb))
1891*22dc650dSSadaf Ebrahimi</pre>
1892*22dc650dSSadaf EbrahimiAlthough the second group number 1 is not explicitly named, the name AA is
1893*22dc650dSSadaf Ebrahimistill an alias for any group 1. Whether the pattern matches "aa" or "bb", a
1894*22dc650dSSadaf Ebrahimireference by name to group AA yields the matched string.
1895*22dc650dSSadaf Ebrahimi</P>
1896*22dc650dSSadaf Ebrahimi<P>
1897*22dc650dSSadaf EbrahimiBy default, a name must be unique within a pattern, except that duplicate names
1898*22dc650dSSadaf Ebrahimiare permitted for groups with the same number, for example:
1899*22dc650dSSadaf Ebrahimi<pre>
1900*22dc650dSSadaf Ebrahimi  (?|(?&#60;AA&#62;aa)|(?&#60;AA&#62;bb))
1901*22dc650dSSadaf Ebrahimi</pre>
1902*22dc650dSSadaf EbrahimiThe duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
1903*22dc650dSSadaf Ebrahimioption at compile time, or by the use of (?J) within the pattern, as described
1904*22dc650dSSadaf Ebrahimiin the section entitled
1905*22dc650dSSadaf Ebrahimi<a href="#internaloptions">"Internal Option Setting"</a>
1906*22dc650dSSadaf Ebrahimiabove.
1907*22dc650dSSadaf Ebrahimi</P>
1908*22dc650dSSadaf Ebrahimi<P>
1909*22dc650dSSadaf EbrahimiDuplicate names can be useful for patterns where only one instance of the named
1910*22dc650dSSadaf Ebrahimicapture group can match. Suppose you want to match the name of a weekday,
1911*22dc650dSSadaf Ebrahimieither as a 3-letter abbreviation or as the full name, and in both cases you
1912*22dc650dSSadaf Ebrahimiwant to extract the abbreviation. This pattern (ignoring the line breaks) does
1913*22dc650dSSadaf Ebrahimithe job:
1914*22dc650dSSadaf Ebrahimi<pre>
1915*22dc650dSSadaf Ebrahimi  (?J)
1916*22dc650dSSadaf Ebrahimi  (?&#60;DN&#62;Mon|Fri|Sun)(?:day)?|
1917*22dc650dSSadaf Ebrahimi  (?&#60;DN&#62;Tue)(?:sday)?|
1918*22dc650dSSadaf Ebrahimi  (?&#60;DN&#62;Wed)(?:nesday)?|
1919*22dc650dSSadaf Ebrahimi  (?&#60;DN&#62;Thu)(?:rsday)?|
1920*22dc650dSSadaf Ebrahimi  (?&#60;DN&#62;Sat)(?:urday)?
1921*22dc650dSSadaf Ebrahimi</pre>
1922*22dc650dSSadaf EbrahimiThere are five capture groups, but only one is ever set after a match. The
1923*22dc650dSSadaf Ebrahimiconvenience functions for extracting the data by name returns the substring for
1924*22dc650dSSadaf Ebrahimithe first (and in this example, the only) group of that name that matched. This
1925*22dc650dSSadaf Ebrahimisaves searching to find which numbered group it was. (An alternative way of
1926*22dc650dSSadaf Ebrahimisolving this problem is to use a "branch reset" group, as described in the
1927*22dc650dSSadaf Ebrahimiprevious section.)
1928*22dc650dSSadaf Ebrahimi</P>
1929*22dc650dSSadaf Ebrahimi<P>
1930*22dc650dSSadaf EbrahimiIf you make a backreference to a non-unique named group from elsewhere in the
1931*22dc650dSSadaf Ebrahimipattern, the groups to which the name refers are checked in the order in which
1932*22dc650dSSadaf Ebrahimithey appear in the overall pattern. The first one that is set is used for the
1933*22dc650dSSadaf Ebrahimireference. For example, this pattern matches both "foofoo" and "barbar" but not
1934*22dc650dSSadaf Ebrahimi"foobar" or "barfoo":
1935*22dc650dSSadaf Ebrahimi<pre>
1936*22dc650dSSadaf Ebrahimi  (?J)(?:(?&#60;n&#62;foo)|(?&#60;n&#62;bar))\k&#60;n&#62;
1937*22dc650dSSadaf Ebrahimi
1938*22dc650dSSadaf Ebrahimi</PRE>
1939*22dc650dSSadaf Ebrahimi</P>
1940*22dc650dSSadaf Ebrahimi<P>
1941*22dc650dSSadaf EbrahimiIf you make a subroutine call to a non-unique named group, the one that
1942*22dc650dSSadaf Ebrahimicorresponds to the first occurrence of the name is used. In the absence of
1943*22dc650dSSadaf Ebrahimiduplicate numbers this is the one with the lowest number.
1944*22dc650dSSadaf Ebrahimi</P>
1945*22dc650dSSadaf Ebrahimi<P>
1946*22dc650dSSadaf EbrahimiIf you use a named reference in a condition
1947*22dc650dSSadaf Ebrahimitest (see the
1948*22dc650dSSadaf Ebrahimi<a href="#conditions">section about conditions</a>
1949*22dc650dSSadaf Ebrahimibelow), either to check whether a capture group has matched, or to check for
1950*22dc650dSSadaf Ebrahimirecursion, all groups with the same name are tested. If the condition is true
1951*22dc650dSSadaf Ebrahimifor any one of them, the overall condition is true. This is the same behaviour
1952*22dc650dSSadaf Ebrahimias testing by number. For further details of the interfaces for handling named
1953*22dc650dSSadaf Ebrahimicapture groups, see the
1954*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
1955*22dc650dSSadaf Ebrahimidocumentation.
1956*22dc650dSSadaf Ebrahimi</P>
1957*22dc650dSSadaf Ebrahimi<br><a name="SEC17" href="#TOC1">REPETITION</a><br>
1958*22dc650dSSadaf Ebrahimi<P>
1959*22dc650dSSadaf EbrahimiRepetition is specified by quantifiers, which may follow any one of these
1960*22dc650dSSadaf Ebrahimiitems:
1961*22dc650dSSadaf Ebrahimi<pre>
1962*22dc650dSSadaf Ebrahimi  a literal data character
1963*22dc650dSSadaf Ebrahimi  the dot metacharacter
1964*22dc650dSSadaf Ebrahimi  the \C escape sequence
1965*22dc650dSSadaf Ebrahimi  the \R escape sequence
1966*22dc650dSSadaf Ebrahimi  the \X escape sequence
1967*22dc650dSSadaf Ebrahimi  any escape sequence that matches a single character
1968*22dc650dSSadaf Ebrahimi  a character class
1969*22dc650dSSadaf Ebrahimi  a backreference
1970*22dc650dSSadaf Ebrahimi  a parenthesized group (including lookaround assertions)
1971*22dc650dSSadaf Ebrahimi  a subroutine call (recursive or otherwise)
1972*22dc650dSSadaf Ebrahimi</pre>
1973*22dc650dSSadaf EbrahimiIf a quantifier does not follow a repeatable item, an error occurs. The
1974*22dc650dSSadaf Ebrahimigeneral repetition quantifier specifies a minimum and maximum number of
1975*22dc650dSSadaf Ebrahimipermitted matches by giving two numbers in curly brackets (braces), separated
1976*22dc650dSSadaf Ebrahimiby a comma. The numbers must be less than 65536, and the first must be less
1977*22dc650dSSadaf Ebrahimithan or equal to the second. For example,
1978*22dc650dSSadaf Ebrahimi<pre>
1979*22dc650dSSadaf Ebrahimi  z{2,4}
1980*22dc650dSSadaf Ebrahimi</pre>
1981*22dc650dSSadaf Ebrahimimatches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
1982*22dc650dSSadaf Ebrahimicharacter. If the second number is omitted, but the comma is present, there is
1983*22dc650dSSadaf Ebrahimino upper limit; if the second number and the comma are both omitted, the
1984*22dc650dSSadaf Ebrahimiquantifier specifies an exact number of required matches. Thus
1985*22dc650dSSadaf Ebrahimi<pre>
1986*22dc650dSSadaf Ebrahimi  [aeiou]{3,}
1987*22dc650dSSadaf Ebrahimi</pre>
1988*22dc650dSSadaf Ebrahimimatches at least 3 successive vowels, but may match many more, whereas
1989*22dc650dSSadaf Ebrahimi<pre>
1990*22dc650dSSadaf Ebrahimi  \d{8}
1991*22dc650dSSadaf Ebrahimi</pre>
1992*22dc650dSSadaf Ebrahimimatches exactly 8 digits. If the first number is omitted, the lower limit is
1993*22dc650dSSadaf Ebrahimitaken as zero; in this case the upper limit must be present.
1994*22dc650dSSadaf Ebrahimi<pre>
1995*22dc650dSSadaf Ebrahimi  X{,4} is interpreted as X{0,4}
1996*22dc650dSSadaf Ebrahimi</pre>
1997*22dc650dSSadaf EbrahimiThis is a change in behaviour that happened in Perl 5.34.0 and PCRE2 10.43. In
1998*22dc650dSSadaf Ebrahimiearlier versions such a sequence was not interpreted as a quantifier. Other
1999*22dc650dSSadaf Ebrahimiregular expression engines may behave either way.
2000*22dc650dSSadaf Ebrahimi</P>
2001*22dc650dSSadaf Ebrahimi<P>
2002*22dc650dSSadaf EbrahimiIf the characters that follow an opening brace do not match the syntax of a
2003*22dc650dSSadaf Ebrahimiquantifier, the brace is taken as a literal character. In particular, this
2004*22dc650dSSadaf Ebrahimimeans that {,} is a literal string of three characters.
2005*22dc650dSSadaf Ebrahimi</P>
2006*22dc650dSSadaf Ebrahimi<P>
2007*22dc650dSSadaf EbrahimiNote that not every opening brace is potentially the start of a quantifier
2008*22dc650dSSadaf Ebrahimibecause braces are used in other items such as \N{U+345} or \k{name}.
2009*22dc650dSSadaf Ebrahimi</P>
2010*22dc650dSSadaf Ebrahimi<P>
2011*22dc650dSSadaf EbrahimiIn UTF modes, quantifiers apply to characters rather than to individual code
2012*22dc650dSSadaf Ebrahimiunits. Thus, for example, \x{100}{2} matches two characters, each of
2013*22dc650dSSadaf Ebrahimiwhich is represented by a two-byte sequence in a UTF-8 string. Similarly,
2014*22dc650dSSadaf Ebrahimi\X{3} matches three Unicode extended grapheme clusters, each of which may be
2015*22dc650dSSadaf Ebrahimiseveral code units long (and they may be of different lengths).
2016*22dc650dSSadaf Ebrahimi</P>
2017*22dc650dSSadaf Ebrahimi<P>
2018*22dc650dSSadaf EbrahimiThe quantifier {0} is permitted, causing the expression to behave as if the
2019*22dc650dSSadaf Ebrahimiprevious item and the quantifier were not present. This may be useful for
2020*22dc650dSSadaf Ebrahimicapture groups that are referenced as
2021*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">subroutines</a>
2022*22dc650dSSadaf Ebrahimifrom elsewhere in the pattern (but see also the section entitled
2023*22dc650dSSadaf Ebrahimi<a href="#subdefine">"Defining capture groups for use by reference only"</a>
2024*22dc650dSSadaf Ebrahimibelow). Except for parenthesized groups, items that have a {0} quantifier are
2025*22dc650dSSadaf Ebrahimiomitted from the compiled pattern.
2026*22dc650dSSadaf Ebrahimi</P>
2027*22dc650dSSadaf Ebrahimi<P>
2028*22dc650dSSadaf EbrahimiFor convenience, the three most common quantifiers have single-character
2029*22dc650dSSadaf Ebrahimiabbreviations:
2030*22dc650dSSadaf Ebrahimi<pre>
2031*22dc650dSSadaf Ebrahimi  *    is equivalent to {0,}
2032*22dc650dSSadaf Ebrahimi  +    is equivalent to {1,}
2033*22dc650dSSadaf Ebrahimi  ?    is equivalent to {0,1}
2034*22dc650dSSadaf Ebrahimi</pre>
2035*22dc650dSSadaf EbrahimiIt is possible to construct infinite loops by following a group that can match
2036*22dc650dSSadaf Ebrahimino characters with a quantifier that has no upper limit, for example:
2037*22dc650dSSadaf Ebrahimi<pre>
2038*22dc650dSSadaf Ebrahimi  (a?)*
2039*22dc650dSSadaf Ebrahimi</pre>
2040*22dc650dSSadaf EbrahimiEarlier versions of Perl and PCRE1 used to give an error at compile time for
2041*22dc650dSSadaf Ebrahimisuch patterns. However, because there are cases where this can be useful, such
2042*22dc650dSSadaf Ebrahimipatterns are now accepted, but whenever an iteration of such a group matches no
2043*22dc650dSSadaf Ebrahimicharacters, matching moves on to the next item in the pattern instead of
2044*22dc650dSSadaf Ebrahimirepeatedly matching an empty string. This does not prevent backtracking into
2045*22dc650dSSadaf Ebrahimiany of the iterations if a subsequent item fails to match.
2046*22dc650dSSadaf Ebrahimi</P>
2047*22dc650dSSadaf Ebrahimi<P>
2048*22dc650dSSadaf EbrahimiBy default, quantifiers are "greedy", that is, they match as much as possible
2049*22dc650dSSadaf Ebrahimi(up to the maximum number of permitted repetitions), without causing the rest
2050*22dc650dSSadaf Ebrahimiof the pattern to fail. The classic example of where this gives problems is in
2051*22dc650dSSadaf Ebrahimitrying to match comments in C programs. These appear between /* and */ and
2052*22dc650dSSadaf Ebrahimiwithin the comment, individual * and / characters may appear. An attempt to
2053*22dc650dSSadaf Ebrahimimatch C comments by applying the pattern
2054*22dc650dSSadaf Ebrahimi<pre>
2055*22dc650dSSadaf Ebrahimi  /\*.*\*/
2056*22dc650dSSadaf Ebrahimi</pre>
2057*22dc650dSSadaf Ebrahimito the string
2058*22dc650dSSadaf Ebrahimi<pre>
2059*22dc650dSSadaf Ebrahimi  /* first comment */  not comment  /* second comment */
2060*22dc650dSSadaf Ebrahimi</pre>
2061*22dc650dSSadaf Ebrahimifails, because it matches the entire string owing to the greediness of the .*
2062*22dc650dSSadaf Ebrahimiitem. However, if a quantifier is followed by a question mark, it ceases to be
2063*22dc650dSSadaf Ebrahimigreedy, and instead matches the minimum number of times possible, so the
2064*22dc650dSSadaf Ebrahimipattern
2065*22dc650dSSadaf Ebrahimi<pre>
2066*22dc650dSSadaf Ebrahimi  /\*.*?\*/
2067*22dc650dSSadaf Ebrahimi</pre>
2068*22dc650dSSadaf Ebrahimidoes the right thing with C comments. The meaning of the various quantifiers is
2069*22dc650dSSadaf Ebrahiminot otherwise changed, just the preferred number of matches. Do not confuse
2070*22dc650dSSadaf Ebrahimithis use of question mark with its use as a quantifier in its own right.
2071*22dc650dSSadaf EbrahimiBecause it has two uses, it can sometimes appear doubled, as in
2072*22dc650dSSadaf Ebrahimi<pre>
2073*22dc650dSSadaf Ebrahimi  \d??\d
2074*22dc650dSSadaf Ebrahimi</pre>
2075*22dc650dSSadaf Ebrahimiwhich matches one digit by preference, but can match two if that is the only
2076*22dc650dSSadaf Ebrahimiway the rest of the pattern matches.
2077*22dc650dSSadaf Ebrahimi</P>
2078*22dc650dSSadaf Ebrahimi<P>
2079*22dc650dSSadaf EbrahimiIf the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
2080*22dc650dSSadaf Ebrahimithe quantifiers are not greedy by default, but individual ones can be made
2081*22dc650dSSadaf Ebrahimigreedy by following them with a question mark. In other words, it inverts the
2082*22dc650dSSadaf Ebrahimidefault behaviour.
2083*22dc650dSSadaf Ebrahimi</P>
2084*22dc650dSSadaf Ebrahimi<P>
2085*22dc650dSSadaf EbrahimiWhen a parenthesized group is quantified with a minimum repeat count that
2086*22dc650dSSadaf Ebrahimiis greater than 1 or with a limited maximum, more memory is required for the
2087*22dc650dSSadaf Ebrahimicompiled pattern, in proportion to the size of the minimum or maximum.
2088*22dc650dSSadaf Ebrahimi</P>
2089*22dc650dSSadaf Ebrahimi<P>
2090*22dc650dSSadaf EbrahimiIf a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent
2091*22dc650dSSadaf Ebrahimito Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
2092*22dc650dSSadaf Ebrahimiimplicitly anchored, because whatever follows will be tried against every
2093*22dc650dSSadaf Ebrahimicharacter position in the subject string, so there is no point in retrying the
2094*22dc650dSSadaf Ebrahimioverall match at any position after the first. PCRE2 normally treats such a
2095*22dc650dSSadaf Ebrahimipattern as though it were preceded by \A.
2096*22dc650dSSadaf Ebrahimi</P>
2097*22dc650dSSadaf Ebrahimi<P>
2098*22dc650dSSadaf EbrahimiIn cases where it is known that the subject string contains no newlines, it is
2099*22dc650dSSadaf Ebrahimiworth setting PCRE2_DOTALL in order to obtain this optimization, or
2100*22dc650dSSadaf Ebrahimialternatively, using ^ to indicate anchoring explicitly.
2101*22dc650dSSadaf Ebrahimi</P>
2102*22dc650dSSadaf Ebrahimi<P>
2103*22dc650dSSadaf EbrahimiHowever, there are some cases where the optimization cannot be used. When .*
2104*22dc650dSSadaf Ebrahimiis inside capturing parentheses that are the subject of a backreference
2105*22dc650dSSadaf Ebrahimielsewhere in the pattern, a match at the start may fail where a later one
2106*22dc650dSSadaf Ebrahimisucceeds. Consider, for example:
2107*22dc650dSSadaf Ebrahimi<pre>
2108*22dc650dSSadaf Ebrahimi  (.*)abc\1
2109*22dc650dSSadaf Ebrahimi</pre>
2110*22dc650dSSadaf EbrahimiIf the subject is "xyz123abc123" the match point is the fourth character. For
2111*22dc650dSSadaf Ebrahimithis reason, such a pattern is not implicitly anchored.
2112*22dc650dSSadaf Ebrahimi</P>
2113*22dc650dSSadaf Ebrahimi<P>
2114*22dc650dSSadaf EbrahimiAnother case where implicit anchoring is not applied is when the leading .* is
2115*22dc650dSSadaf Ebrahimiinside an atomic group. Once again, a match at the start may fail where a later
2116*22dc650dSSadaf Ebrahimione succeeds. Consider this pattern:
2117*22dc650dSSadaf Ebrahimi<pre>
2118*22dc650dSSadaf Ebrahimi  (?&#62;.*?a)b
2119*22dc650dSSadaf Ebrahimi</pre>
2120*22dc650dSSadaf EbrahimiIt matches "ab" in the subject "aab". The use of the backtracking control verbs
2121*22dc650dSSadaf Ebrahimi(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
2122*22dc650dSSadaf EbrahimiPCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
2123*22dc650dSSadaf Ebrahimi</P>
2124*22dc650dSSadaf Ebrahimi<P>
2125*22dc650dSSadaf EbrahimiWhen a capture group is repeated, the value captured is the substring that
2126*22dc650dSSadaf Ebrahimimatched the final iteration. For example, after
2127*22dc650dSSadaf Ebrahimi<pre>
2128*22dc650dSSadaf Ebrahimi  (tweedle[dume]{3}\s*)+
2129*22dc650dSSadaf Ebrahimi</pre>
2130*22dc650dSSadaf Ebrahimihas matched "tweedledum tweedledee" the value of the captured substring is
2131*22dc650dSSadaf Ebrahimi"tweedledee". However, if there are nested capture groups, the corresponding
2132*22dc650dSSadaf Ebrahimicaptured values may have been set in previous iterations. For example, after
2133*22dc650dSSadaf Ebrahimi<pre>
2134*22dc650dSSadaf Ebrahimi  (a|(b))+
2135*22dc650dSSadaf Ebrahimi</pre>
2136*22dc650dSSadaf Ebrahimimatches "aba" the value of the second captured substring is "b".
2137*22dc650dSSadaf Ebrahimi<a name="atomicgroup"></a></P>
2138*22dc650dSSadaf Ebrahimi<br><a name="SEC18" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
2139*22dc650dSSadaf Ebrahimi<P>
2140*22dc650dSSadaf EbrahimiWith both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
2141*22dc650dSSadaf Ebrahimirepetition, failure of what follows normally causes the repeated item to be
2142*22dc650dSSadaf Ebrahimire-evaluated to see if a different number of repeats allows the rest of the
2143*22dc650dSSadaf Ebrahimipattern to match. Sometimes it is useful to prevent this, either to change the
2144*22dc650dSSadaf Ebrahiminature of the match, or to cause it fail earlier than it otherwise might, when
2145*22dc650dSSadaf Ebrahimithe author of the pattern knows there is no point in carrying on.
2146*22dc650dSSadaf Ebrahimi</P>
2147*22dc650dSSadaf Ebrahimi<P>
2148*22dc650dSSadaf EbrahimiConsider, for example, the pattern \d+foo when applied to the subject line
2149*22dc650dSSadaf Ebrahimi<pre>
2150*22dc650dSSadaf Ebrahimi  123456bar
2151*22dc650dSSadaf Ebrahimi</pre>
2152*22dc650dSSadaf EbrahimiAfter matching all 6 digits and then failing to match "foo", the normal
2153*22dc650dSSadaf Ebrahimiaction of the matcher is to try again with only 5 digits matching the \d+
2154*22dc650dSSadaf Ebrahimiitem, and then with 4, and so on, before ultimately failing. "Atomic grouping"
2155*22dc650dSSadaf Ebrahimi(a term taken from Jeffrey Friedl's book) provides the means for specifying
2156*22dc650dSSadaf Ebrahimithat once a group has matched, it is not to be re-evaluated in this way.
2157*22dc650dSSadaf Ebrahimi</P>
2158*22dc650dSSadaf Ebrahimi<P>
2159*22dc650dSSadaf EbrahimiIf we use atomic grouping for the previous example, the matcher gives up
2160*22dc650dSSadaf Ebrahimiimmediately on failing to match "foo" the first time. The notation is a kind of
2161*22dc650dSSadaf Ebrahimispecial parenthesis, starting with (?&#62; as in this example:
2162*22dc650dSSadaf Ebrahimi<pre>
2163*22dc650dSSadaf Ebrahimi  (?&#62;\d+)foo
2164*22dc650dSSadaf Ebrahimi</pre>
2165*22dc650dSSadaf EbrahimiPerl 5.28 introduced an experimental alphabetic form starting with (* which may
2166*22dc650dSSadaf Ebrahimibe easier to remember:
2167*22dc650dSSadaf Ebrahimi<pre>
2168*22dc650dSSadaf Ebrahimi  (*atomic:\d+)foo
2169*22dc650dSSadaf Ebrahimi</pre>
2170*22dc650dSSadaf EbrahimiThis kind of parenthesized group "locks up" the part of the pattern it contains
2171*22dc650dSSadaf Ebrahimionce it has matched, and a failure further into the pattern is prevented from
2172*22dc650dSSadaf Ebrahimibacktracking into it. Backtracking past it to previous items, however, works as
2173*22dc650dSSadaf Ebrahiminormal.
2174*22dc650dSSadaf Ebrahimi</P>
2175*22dc650dSSadaf Ebrahimi<P>
2176*22dc650dSSadaf EbrahimiAn alternative description is that a group of this type matches exactly the
2177*22dc650dSSadaf Ebrahimistring of characters that an identical standalone pattern would match, if
2178*22dc650dSSadaf Ebrahimianchored at the current point in the subject string.
2179*22dc650dSSadaf Ebrahimi</P>
2180*22dc650dSSadaf Ebrahimi<P>
2181*22dc650dSSadaf EbrahimiAtomic groups are not capture groups. Simple cases such as the above example
2182*22dc650dSSadaf Ebrahimican be thought of as a maximizing repeat that must swallow everything it can.
2183*22dc650dSSadaf EbrahimiSo, while both \d+ and \d+? are prepared to adjust the number of digits they
2184*22dc650dSSadaf Ebrahimimatch in order to make the rest of the pattern match, (?&#62;\d+) can only match
2185*22dc650dSSadaf Ebrahimian entire sequence of digits.
2186*22dc650dSSadaf Ebrahimi</P>
2187*22dc650dSSadaf Ebrahimi<P>
2188*22dc650dSSadaf EbrahimiAtomic groups in general can of course contain arbitrarily complicated
2189*22dc650dSSadaf Ebrahimiexpressions, and can be nested. However, when the contents of an atomic
2190*22dc650dSSadaf Ebrahimigroup is just a single repeated item, as in the example above, a simpler
2191*22dc650dSSadaf Ebrahiminotation, called a "possessive quantifier" can be used. This consists of an
2192*22dc650dSSadaf Ebrahimiadditional + character following a quantifier. Using this notation, the
2193*22dc650dSSadaf Ebrahimiprevious example can be rewritten as
2194*22dc650dSSadaf Ebrahimi<pre>
2195*22dc650dSSadaf Ebrahimi  \d++foo
2196*22dc650dSSadaf Ebrahimi</pre>
2197*22dc650dSSadaf EbrahimiNote that a possessive quantifier can be used with an entire group, for
2198*22dc650dSSadaf Ebrahimiexample:
2199*22dc650dSSadaf Ebrahimi<pre>
2200*22dc650dSSadaf Ebrahimi  (abc|xyz){2,3}+
2201*22dc650dSSadaf Ebrahimi</pre>
2202*22dc650dSSadaf EbrahimiPossessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
2203*22dc650dSSadaf Ebrahimioption is ignored. They are a convenient notation for the simpler forms of
2204*22dc650dSSadaf Ebrahimiatomic group. However, there is no difference in the meaning of a possessive
2205*22dc650dSSadaf Ebrahimiquantifier and the equivalent atomic group, though there may be a performance
2206*22dc650dSSadaf Ebrahimidifference; possessive quantifiers should be slightly faster.
2207*22dc650dSSadaf Ebrahimi</P>
2208*22dc650dSSadaf Ebrahimi<P>
2209*22dc650dSSadaf EbrahimiThe possessive quantifier syntax is an extension to the Perl 5.8 syntax.
2210*22dc650dSSadaf EbrahimiJeffrey Friedl originated the idea (and the name) in the first edition of his
2211*22dc650dSSadaf Ebrahimibook. Mike McCloskey liked it, so implemented it when he built Sun's Java
2212*22dc650dSSadaf Ebrahimipackage, and PCRE1 copied it from there. It found its way into Perl at release
2213*22dc650dSSadaf Ebrahimi5.10.
2214*22dc650dSSadaf Ebrahimi</P>
2215*22dc650dSSadaf Ebrahimi<P>
2216*22dc650dSSadaf EbrahimiPCRE2 has an optimization that automatically "possessifies" certain simple
2217*22dc650dSSadaf Ebrahimipattern constructs. For example, the sequence A+B is treated as A++B because
2218*22dc650dSSadaf Ebrahimithere is no point in backtracking into a sequence of A's when B must follow.
2219*22dc650dSSadaf EbrahimiThis feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
2220*22dc650dSSadaf Ebrahimithe pattern with (*NO_AUTO_POSSESS).
2221*22dc650dSSadaf Ebrahimi</P>
2222*22dc650dSSadaf Ebrahimi<P>
2223*22dc650dSSadaf EbrahimiWhen a pattern contains an unlimited repeat inside a group that can itself be
2224*22dc650dSSadaf Ebrahimirepeated an unlimited number of times, the use of an atomic group is the only
2225*22dc650dSSadaf Ebrahimiway to avoid some failing matches taking a very long time indeed. The pattern
2226*22dc650dSSadaf Ebrahimi<pre>
2227*22dc650dSSadaf Ebrahimi  (\D+|&#60;\d+&#62;)*[!?]
2228*22dc650dSSadaf Ebrahimi</pre>
2229*22dc650dSSadaf Ebrahimimatches an unlimited number of substrings that either consist of non-digits, or
2230*22dc650dSSadaf Ebrahimidigits enclosed in &#60;&#62;, followed by either ! or ?. When it matches, it runs
2231*22dc650dSSadaf Ebrahimiquickly. However, if it is applied to
2232*22dc650dSSadaf Ebrahimi<pre>
2233*22dc650dSSadaf Ebrahimi  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2234*22dc650dSSadaf Ebrahimi</pre>
2235*22dc650dSSadaf Ebrahimiit takes a long time before reporting failure. This is because the string can
2236*22dc650dSSadaf Ebrahimibe divided between the internal \D+ repeat and the external * repeat in a
2237*22dc650dSSadaf Ebrahimilarge number of ways, and all have to be tried. (The example uses [!?] rather
2238*22dc650dSSadaf Ebrahimithan a single character at the end, because both PCRE2 and Perl have an
2239*22dc650dSSadaf Ebrahimioptimization that allows for fast failure when a single character is used. They
2240*22dc650dSSadaf Ebrahimiremember the last single character that is required for a match, and fail early
2241*22dc650dSSadaf Ebrahimiif it is not present in the string.) If the pattern is changed so that it uses
2242*22dc650dSSadaf Ebrahimian atomic group, like this:
2243*22dc650dSSadaf Ebrahimi<pre>
2244*22dc650dSSadaf Ebrahimi  ((?&#62;\D+)|&#60;\d+&#62;)*[!?]
2245*22dc650dSSadaf Ebrahimi</pre>
2246*22dc650dSSadaf Ebrahimisequences of non-digits cannot be broken, and failure happens quickly.
2247*22dc650dSSadaf Ebrahimi<a name="backreferences"></a></P>
2248*22dc650dSSadaf Ebrahimi<br><a name="SEC19" href="#TOC1">BACKREFERENCES</a><br>
2249*22dc650dSSadaf Ebrahimi<P>
2250*22dc650dSSadaf EbrahimiOutside a character class, a backslash followed by a digit greater than 0 (and
2251*22dc650dSSadaf Ebrahimipossibly further digits) is a backreference to a capture group earlier (that
2252*22dc650dSSadaf Ebrahimiis, to its left) in the pattern, provided there have been that many previous
2253*22dc650dSSadaf Ebrahimicapture groups.
2254*22dc650dSSadaf Ebrahimi</P>
2255*22dc650dSSadaf Ebrahimi<P>
2256*22dc650dSSadaf EbrahimiHowever, if the decimal number following the backslash is less than 8, it is
2257*22dc650dSSadaf Ebrahimialways taken as a backreference, and causes an error only if there are not that
2258*22dc650dSSadaf Ebrahimimany capture groups in the entire pattern. In other words, the group that is
2259*22dc650dSSadaf Ebrahimireferenced need not be to the left of the reference for numbers less than 8. A
2260*22dc650dSSadaf Ebrahimi"forward backreference" of this type can make sense when a repetition is
2261*22dc650dSSadaf Ebrahimiinvolved and the group to the right has participated in an earlier iteration.
2262*22dc650dSSadaf Ebrahimi</P>
2263*22dc650dSSadaf Ebrahimi<P>
2264*22dc650dSSadaf EbrahimiIt is not possible to have a numerical "forward backreference" to a group whose
2265*22dc650dSSadaf Ebrahiminumber is 8 or more using this syntax because a sequence such as \50 is
2266*22dc650dSSadaf Ebrahimiinterpreted as a character defined in octal. See the subsection entitled
2267*22dc650dSSadaf Ebrahimi"Non-printing characters"
2268*22dc650dSSadaf Ebrahimi<a href="#digitsafterbackslash">above</a>
2269*22dc650dSSadaf Ebrahimifor further details of the handling of digits following a backslash. Other
2270*22dc650dSSadaf Ebrahimiforms of backreferencing do not suffer from this restriction. In particular,
2271*22dc650dSSadaf Ebrahimithere is no problem when named capture groups are used (see below).
2272*22dc650dSSadaf Ebrahimi</P>
2273*22dc650dSSadaf Ebrahimi<P>
2274*22dc650dSSadaf EbrahimiAnother way of avoiding the ambiguity inherent in the use of digits following a
2275*22dc650dSSadaf Ebrahimibackslash is to use the \g escape sequence. This escape must be followed by a
2276*22dc650dSSadaf Ebrahimisigned or unsigned number, optionally enclosed in braces. These examples are
2277*22dc650dSSadaf Ebrahimiall identical:
2278*22dc650dSSadaf Ebrahimi<pre>
2279*22dc650dSSadaf Ebrahimi  (ring), \1
2280*22dc650dSSadaf Ebrahimi  (ring), \g1
2281*22dc650dSSadaf Ebrahimi  (ring), \g{1}
2282*22dc650dSSadaf Ebrahimi</pre>
2283*22dc650dSSadaf EbrahimiAn unsigned number specifies an absolute reference without the ambiguity that
2284*22dc650dSSadaf Ebrahimiis present in the older syntax. It is also useful when literal digits follow
2285*22dc650dSSadaf Ebrahimithe reference. A signed number is a relative reference. Consider this example:
2286*22dc650dSSadaf Ebrahimi<pre>
2287*22dc650dSSadaf Ebrahimi  (abc(def)ghi)\g{-1}
2288*22dc650dSSadaf Ebrahimi</pre>
2289*22dc650dSSadaf EbrahimiThe sequence \g{-1} is a reference to the capture group whose number is one
2290*22dc650dSSadaf Ebrahimiless than the number of the next group to be started, so in this example (where
2291*22dc650dSSadaf Ebrahimithe next group would be numbered 3) is it equivalent to \2, and \g{-2} would
2292*22dc650dSSadaf Ebrahimibe equivalent to \1. Note that if this construct is inside a capture group,
2293*22dc650dSSadaf Ebrahimithat group is included in the count, so in this example \g{-2} also refers to
2294*22dc650dSSadaf Ebrahimigroup 1:
2295*22dc650dSSadaf Ebrahimi<pre>
2296*22dc650dSSadaf Ebrahimi  (A)(\g{-2}B)
2297*22dc650dSSadaf Ebrahimi</pre>
2298*22dc650dSSadaf EbrahimiThe use of relative references can be helpful in long patterns, and also in
2299*22dc650dSSadaf Ebrahimipatterns that are created by joining together fragments that contain references
2300*22dc650dSSadaf Ebrahimiwithin themselves.
2301*22dc650dSSadaf Ebrahimi</P>
2302*22dc650dSSadaf Ebrahimi<P>
2303*22dc650dSSadaf EbrahimiThe sequence \g{+1} is a reference to the next capture group that is started
2304*22dc650dSSadaf Ebrahimiafter this item, and \g{+2} refers to the one after that, and so on. This kind
2305*22dc650dSSadaf Ebrahimiof forward reference can be useful in patterns that repeat. Perl does not
2306*22dc650dSSadaf Ebrahimisupport the use of + in this way.
2307*22dc650dSSadaf Ebrahimi</P>
2308*22dc650dSSadaf Ebrahimi<P>
2309*22dc650dSSadaf EbrahimiA backreference matches whatever actually most recently matched the capture
2310*22dc650dSSadaf Ebrahimigroup in the current subject string, rather than anything at all that matches
2311*22dc650dSSadaf Ebrahimithe group (see
2312*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">"Groups as subroutines"</a>
2313*22dc650dSSadaf Ebrahimibelow for a way of doing that). So the pattern
2314*22dc650dSSadaf Ebrahimi<pre>
2315*22dc650dSSadaf Ebrahimi  (sens|respons)e and \1ibility
2316*22dc650dSSadaf Ebrahimi</pre>
2317*22dc650dSSadaf Ebrahimimatches "sense and sensibility" and "response and responsibility", but not
2318*22dc650dSSadaf Ebrahimi"sense and responsibility". If caseful matching is in force at the time of the
2319*22dc650dSSadaf Ebrahimibackreference, the case of letters is relevant. For example,
2320*22dc650dSSadaf Ebrahimi<pre>
2321*22dc650dSSadaf Ebrahimi  ((?i)rah)\s+\1
2322*22dc650dSSadaf Ebrahimi</pre>
2323*22dc650dSSadaf Ebrahimimatches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
2324*22dc650dSSadaf Ebrahimicapture group is matched caselessly.
2325*22dc650dSSadaf Ebrahimi</P>
2326*22dc650dSSadaf Ebrahimi<P>
2327*22dc650dSSadaf EbrahimiThere are several different ways of writing backreferences to named capture
2328*22dc650dSSadaf Ebrahimigroups. The .NET syntax is \k{name}, the Python syntax is (?=name), and the
2329*22dc650dSSadaf Ebrahimioriginal Perl syntax is \k&#60;name&#62; or \k'name'. All of these are now supported
2330*22dc650dSSadaf Ebrahimiby both Perl and PCRE2. Perl 5.10's unified backreference syntax, in which \g
2331*22dc650dSSadaf Ebrahimican be used for both numeric and named references, is also supported by PCRE2.
2332*22dc650dSSadaf EbrahimiWe could rewrite the above example in any of the following ways:
2333*22dc650dSSadaf Ebrahimi<pre>
2334*22dc650dSSadaf Ebrahimi  (?&#60;p1&#62;(?i)rah)\s+\k&#60;p1&#62;
2335*22dc650dSSadaf Ebrahimi  (?'p1'(?i)rah)\s+\k{p1}
2336*22dc650dSSadaf Ebrahimi  (?P&#60;p1&#62;(?i)rah)\s+(?P=p1)
2337*22dc650dSSadaf Ebrahimi  (?&#60;p1&#62;(?i)rah)\s+\g{p1}
2338*22dc650dSSadaf Ebrahimi</pre>
2339*22dc650dSSadaf EbrahimiA capture group that is referenced by name may appear in the pattern before or
2340*22dc650dSSadaf Ebrahimiafter the reference.
2341*22dc650dSSadaf Ebrahimi</P>
2342*22dc650dSSadaf Ebrahimi<P>
2343*22dc650dSSadaf EbrahimiThere may be more than one backreference to the same group. If a group has not
2344*22dc650dSSadaf Ebrahimiactually been used in a particular match, backreferences to it always fail by
2345*22dc650dSSadaf Ebrahimidefault. For example, the pattern
2346*22dc650dSSadaf Ebrahimi<pre>
2347*22dc650dSSadaf Ebrahimi  (a|(bc))\2
2348*22dc650dSSadaf Ebrahimi</pre>
2349*22dc650dSSadaf Ebrahimialways fails if it starts to match "a" rather than "bc". However, if the
2350*22dc650dSSadaf EbrahimiPCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
2351*22dc650dSSadaf Ebrahimiunset value matches an empty string.
2352*22dc650dSSadaf Ebrahimi</P>
2353*22dc650dSSadaf Ebrahimi<P>
2354*22dc650dSSadaf EbrahimiBecause there may be many capture groups in a pattern, all digits following a
2355*22dc650dSSadaf Ebrahimibackslash are taken as part of a potential backreference number. If the pattern
2356*22dc650dSSadaf Ebrahimicontinues with a digit character, some delimiter must be used to terminate the
2357*22dc650dSSadaf Ebrahimibackreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this
2358*22dc650dSSadaf Ebrahimican be white space. Otherwise, the \g{} syntax or an empty comment (see
2359*22dc650dSSadaf Ebrahimi<a href="#comments">"Comments"</a>
2360*22dc650dSSadaf Ebrahimibelow) can be used.
2361*22dc650dSSadaf Ebrahimi</P>
2362*22dc650dSSadaf Ebrahimi<br><b>
2363*22dc650dSSadaf EbrahimiRecursive backreferences
2364*22dc650dSSadaf Ebrahimi</b><br>
2365*22dc650dSSadaf Ebrahimi<P>
2366*22dc650dSSadaf EbrahimiA backreference that occurs inside the group to which it refers fails when the
2367*22dc650dSSadaf Ebrahimigroup is first used, so, for example, (a\1) never matches. However, such
2368*22dc650dSSadaf Ebrahimireferences can be useful inside repeated groups. For example, the pattern
2369*22dc650dSSadaf Ebrahimi<pre>
2370*22dc650dSSadaf Ebrahimi  (a|b\1)+
2371*22dc650dSSadaf Ebrahimi</pre>
2372*22dc650dSSadaf Ebrahimimatches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
2373*22dc650dSSadaf Ebrahimithe group, the backreference matches the character string corresponding to the
2374*22dc650dSSadaf Ebrahimiprevious iteration. In order for this to work, the pattern must be such that
2375*22dc650dSSadaf Ebrahimithe first iteration does not need to match the backreference. This can be done
2376*22dc650dSSadaf Ebrahimiusing alternation, as in the example above, or by a quantifier with a minimum
2377*22dc650dSSadaf Ebrahimiof zero.
2378*22dc650dSSadaf Ebrahimi</P>
2379*22dc650dSSadaf Ebrahimi<P>
2380*22dc650dSSadaf EbrahimiFor versions of PCRE2 less than 10.25, backreferences of this type used to
2381*22dc650dSSadaf Ebrahimicause the group that they reference to be treated as an
2382*22dc650dSSadaf Ebrahimi<a href="#atomicgroup">atomic group.</a>
2383*22dc650dSSadaf EbrahimiThis restriction no longer applies, and backtracking into such groups can occur
2384*22dc650dSSadaf Ebrahimias normal.
2385*22dc650dSSadaf Ebrahimi<a name="bigassertions"></a></P>
2386*22dc650dSSadaf Ebrahimi<br><a name="SEC20" href="#TOC1">ASSERTIONS</a><br>
2387*22dc650dSSadaf Ebrahimi<P>
2388*22dc650dSSadaf EbrahimiAn assertion is a test on the characters following or preceding the current
2389*22dc650dSSadaf Ebrahimimatching point that does not consume any characters. The simple assertions
2390*22dc650dSSadaf Ebrahimicoded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
2391*22dc650dSSadaf Ebrahimi<a href="#smallassertions">above.</a>
2392*22dc650dSSadaf Ebrahimi</P>
2393*22dc650dSSadaf Ebrahimi<P>
2394*22dc650dSSadaf EbrahimiMore complicated assertions are coded as parenthesized groups. There are two
2395*22dc650dSSadaf Ebrahimikinds: those that look ahead of the current position in the subject string, and
2396*22dc650dSSadaf Ebrahimithose that look behind it, and in each case an assertion may be positive (must
2397*22dc650dSSadaf Ebrahimimatch for the assertion to be true) or negative (must not match for the
2398*22dc650dSSadaf Ebrahimiassertion to be true). An assertion group is matched in the normal way,
2399*22dc650dSSadaf Ebrahimiand if it is true, matching continues after it, but with the matching position
2400*22dc650dSSadaf Ebrahimiin the subject string reset to what it was before the assertion was processed.
2401*22dc650dSSadaf Ebrahimi</P>
2402*22dc650dSSadaf Ebrahimi<P>
2403*22dc650dSSadaf EbrahimiThe Perl-compatible lookaround assertions are atomic. If an assertion is true,
2404*22dc650dSSadaf Ebrahimibut there is a subsequent matching failure, there is no backtracking into the
2405*22dc650dSSadaf Ebrahimiassertion. However, there are some cases where non-atomic assertions can be
2406*22dc650dSSadaf Ebrahimiuseful. PCRE2 has some support for these, described in the section entitled
2407*22dc650dSSadaf Ebrahimi<a href="#nonatomicassertions">"Non-atomic assertions"</a>
2408*22dc650dSSadaf Ebrahimibelow, but they are not Perl-compatible.
2409*22dc650dSSadaf Ebrahimi</P>
2410*22dc650dSSadaf Ebrahimi<P>
2411*22dc650dSSadaf EbrahimiA lookaround assertion may appear as the condition in a
2412*22dc650dSSadaf Ebrahimi<a href="#conditions">conditional group</a>
2413*22dc650dSSadaf Ebrahimi(see below). In this case, the result of matching the assertion determines
2414*22dc650dSSadaf Ebrahimiwhich branch of the condition is followed.
2415*22dc650dSSadaf Ebrahimi</P>
2416*22dc650dSSadaf Ebrahimi<P>
2417*22dc650dSSadaf EbrahimiAssertion groups are not capture groups. If an assertion contains capture
2418*22dc650dSSadaf Ebrahimigroups within it, these are counted for the purposes of numbering the capture
2419*22dc650dSSadaf Ebrahimigroups in the whole pattern. Within each branch of an assertion, locally
2420*22dc650dSSadaf Ebrahimicaptured substrings may be referenced in the usual way. For example, a sequence
2421*22dc650dSSadaf Ebrahimisuch as (.)\g{-1} can be used to check that two adjacent characters are the
2422*22dc650dSSadaf Ebrahimisame.
2423*22dc650dSSadaf Ebrahimi</P>
2424*22dc650dSSadaf Ebrahimi<P>
2425*22dc650dSSadaf EbrahimiWhen a branch within an assertion fails to match, any substrings that were
2426*22dc650dSSadaf Ebrahimicaptured are discarded (as happens with any pattern branch that fails to
2427*22dc650dSSadaf Ebrahimimatch). A negative assertion is true only when all its branches fail to match;
2428*22dc650dSSadaf Ebrahimithis means that no captured substrings are ever retained after a successful
2429*22dc650dSSadaf Ebrahiminegative assertion. When an assertion contains a matching branch, what happens
2430*22dc650dSSadaf Ebrahimidepends on the type of assertion.
2431*22dc650dSSadaf Ebrahimi</P>
2432*22dc650dSSadaf Ebrahimi<P>
2433*22dc650dSSadaf EbrahimiFor a positive assertion, internally captured substrings in the successful
2434*22dc650dSSadaf Ebrahimibranch are retained, and matching continues with the next pattern item after
2435*22dc650dSSadaf Ebrahimithe assertion. For a negative assertion, a matching branch means that the
2436*22dc650dSSadaf Ebrahimiassertion is not true. If such an assertion is being used as a condition in a
2437*22dc650dSSadaf Ebrahimi<a href="#conditions">conditional group</a>
2438*22dc650dSSadaf Ebrahimi(see below), captured substrings are retained, because matching continues with
2439*22dc650dSSadaf Ebrahimithe "no" branch of the condition. For other failing negative assertions,
2440*22dc650dSSadaf Ebrahimicontrol passes to the previous backtracking point, thus discarding any captured
2441*22dc650dSSadaf Ebrahimistrings within the assertion.
2442*22dc650dSSadaf Ebrahimi</P>
2443*22dc650dSSadaf Ebrahimi<P>
2444*22dc650dSSadaf EbrahimiMost assertion groups may be repeated; though it makes no sense to assert the
2445*22dc650dSSadaf Ebrahimisame thing several times, the side effect of capturing in positive assertions
2446*22dc650dSSadaf Ebrahimimay occasionally be useful. However, an assertion that forms the condition for
2447*22dc650dSSadaf Ebrahimia conditional group may not be quantified. PCRE2 used to restrict the
2448*22dc650dSSadaf Ebrahimirepetition of assertions, but from release 10.35 the only restriction is that
2449*22dc650dSSadaf Ebrahimian unlimited maximum repetition is changed to be one more than the minimum. For
2450*22dc650dSSadaf Ebrahimiexample, {3,} is treated as {3,4}.
2451*22dc650dSSadaf Ebrahimi</P>
2452*22dc650dSSadaf Ebrahimi<br><b>
2453*22dc650dSSadaf EbrahimiAlphabetic assertion names
2454*22dc650dSSadaf Ebrahimi</b><br>
2455*22dc650dSSadaf Ebrahimi<P>
2456*22dc650dSSadaf EbrahimiTraditionally, symbolic sequences such as (?= and (?&#60;= have been used to
2457*22dc650dSSadaf Ebrahimispecify lookaround assertions. Perl 5.28 introduced some experimental
2458*22dc650dSSadaf Ebrahimialphabetic alternatives which might be easier to remember. They all start with
2459*22dc650dSSadaf Ebrahimi(* instead of (? and must be written using lower case letters. PCRE2 supports
2460*22dc650dSSadaf Ebrahimithe following synonyms:
2461*22dc650dSSadaf Ebrahimi<pre>
2462*22dc650dSSadaf Ebrahimi  (*positive_lookahead:  or (*pla: is the same as (?=
2463*22dc650dSSadaf Ebrahimi  (*negative_lookahead:  or (*nla: is the same as (?!
2464*22dc650dSSadaf Ebrahimi  (*positive_lookbehind: or (*plb: is the same as (?&#60;=
2465*22dc650dSSadaf Ebrahimi  (*negative_lookbehind: or (*nlb: is the same as (?&#60;!
2466*22dc650dSSadaf Ebrahimi</pre>
2467*22dc650dSSadaf EbrahimiFor example, (*pla:foo) is the same assertion as (?=foo). In the following
2468*22dc650dSSadaf Ebrahimisections, the various assertions are described using the original symbolic
2469*22dc650dSSadaf Ebrahimiforms.
2470*22dc650dSSadaf Ebrahimi</P>
2471*22dc650dSSadaf Ebrahimi<br><b>
2472*22dc650dSSadaf EbrahimiLookahead assertions
2473*22dc650dSSadaf Ebrahimi</b><br>
2474*22dc650dSSadaf Ebrahimi<P>
2475*22dc650dSSadaf EbrahimiLookahead assertions start with (?= for positive assertions and (?! for
2476*22dc650dSSadaf Ebrahiminegative assertions. For example,
2477*22dc650dSSadaf Ebrahimi<pre>
2478*22dc650dSSadaf Ebrahimi  \w+(?=;)
2479*22dc650dSSadaf Ebrahimi</pre>
2480*22dc650dSSadaf Ebrahimimatches a word followed by a semicolon, but does not include the semicolon in
2481*22dc650dSSadaf Ebrahimithe match, and
2482*22dc650dSSadaf Ebrahimi<pre>
2483*22dc650dSSadaf Ebrahimi  foo(?!bar)
2484*22dc650dSSadaf Ebrahimi</pre>
2485*22dc650dSSadaf Ebrahimimatches any occurrence of "foo" that is not followed by "bar". Note that the
2486*22dc650dSSadaf Ebrahimiapparently similar pattern
2487*22dc650dSSadaf Ebrahimi<pre>
2488*22dc650dSSadaf Ebrahimi  (?!foo)bar
2489*22dc650dSSadaf Ebrahimi</pre>
2490*22dc650dSSadaf Ebrahimidoes not find an occurrence of "bar" that is preceded by something other than
2491*22dc650dSSadaf Ebrahimi"foo"; it finds any occurrence of "bar" whatsoever, because the assertion
2492*22dc650dSSadaf Ebrahimi(?!foo) is always true when the next three characters are "bar". A
2493*22dc650dSSadaf Ebrahimilookbehind assertion is needed to achieve the other effect.
2494*22dc650dSSadaf Ebrahimi</P>
2495*22dc650dSSadaf Ebrahimi<P>
2496*22dc650dSSadaf EbrahimiIf you want to force a matching failure at some point in a pattern, the most
2497*22dc650dSSadaf Ebrahimiconvenient way to do it is with (?!) because an empty string always matches, so
2498*22dc650dSSadaf Ebrahimian assertion that requires there not to be an empty string must always fail.
2499*22dc650dSSadaf EbrahimiThe backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
2500*22dc650dSSadaf Ebrahimi<a name="lookbehind"></a></P>
2501*22dc650dSSadaf Ebrahimi<br><b>
2502*22dc650dSSadaf EbrahimiLookbehind assertions
2503*22dc650dSSadaf Ebrahimi</b><br>
2504*22dc650dSSadaf Ebrahimi<P>
2505*22dc650dSSadaf EbrahimiLookbehind assertions start with (?&#60;= for positive assertions and (?&#60;! for
2506*22dc650dSSadaf Ebrahiminegative assertions. For example,
2507*22dc650dSSadaf Ebrahimi<pre>
2508*22dc650dSSadaf Ebrahimi  (?&#60;!foo)bar
2509*22dc650dSSadaf Ebrahimi</pre>
2510*22dc650dSSadaf Ebrahimidoes find an occurrence of "bar" that is not preceded by "foo". The contents of
2511*22dc650dSSadaf Ebrahimia lookbehind assertion are restricted such that there must be a known maximum
2512*22dc650dSSadaf Ebrahimito the lengths of all the strings it matches. There are two cases:
2513*22dc650dSSadaf Ebrahimi</P>
2514*22dc650dSSadaf Ebrahimi<P>
2515*22dc650dSSadaf EbrahimiIf every top-level alternative matches a fixed length, for example
2516*22dc650dSSadaf Ebrahimi<pre>
2517*22dc650dSSadaf Ebrahimi  (?&#60;=colour|color)
2518*22dc650dSSadaf Ebrahimi</pre>
2519*22dc650dSSadaf Ebrahimithere is a limit of 65535 characters to the lengths, which do not have to be
2520*22dc650dSSadaf Ebrahimithe same, as this example demonstrates. This is the only kind of lookbehind
2521*22dc650dSSadaf Ebrahimisupported by PCRE2 versions earlier than 10.43 and by the alternative matching
2522*22dc650dSSadaf Ebrahimifunction <b>pcre2_dfa_match()</b>.
2523*22dc650dSSadaf Ebrahimi</P>
2524*22dc650dSSadaf Ebrahimi<P>
2525*22dc650dSSadaf EbrahimiIn PCRE2 10.43 and later, <b>pcre2_match()</b> supports lookbehind assertions in
2526*22dc650dSSadaf Ebrahimiwhich one or more top-level alternatives can match more than one string length,
2527*22dc650dSSadaf Ebrahimifor example
2528*22dc650dSSadaf Ebrahimi<pre>
2529*22dc650dSSadaf Ebrahimi  (?&#60;=colou?r)
2530*22dc650dSSadaf Ebrahimi</pre>
2531*22dc650dSSadaf EbrahimiThe maximum matching length for any branch of the lookbehind is limited to a
2532*22dc650dSSadaf Ebrahimivalue set by the calling program (default 255 characters). Unlimited repetition
2533*22dc650dSSadaf Ebrahimi(for example \d*) is not supported. In some cases, the escape sequence \K
2534*22dc650dSSadaf Ebrahimi<a href="#resetmatchstart">(see above)</a>
2535*22dc650dSSadaf Ebrahimican be used instead of a lookbehind assertion at the start of a pattern to get
2536*22dc650dSSadaf Ebrahimiround the length limit restriction.
2537*22dc650dSSadaf Ebrahimi</P>
2538*22dc650dSSadaf Ebrahimi<P>
2539*22dc650dSSadaf EbrahimiIn UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a
2540*22dc650dSSadaf Ebrahimisingle code unit even in a UTF mode) to appear in lookbehind assertions,
2541*22dc650dSSadaf Ebrahimibecause it makes it impossible to calculate the length of the lookbehind. The
2542*22dc650dSSadaf Ebrahimi\X and \R escapes, which can match different numbers of code units, are never
2543*22dc650dSSadaf Ebrahimipermitted in lookbehinds.
2544*22dc650dSSadaf Ebrahimi</P>
2545*22dc650dSSadaf Ebrahimi<P>
2546*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">"Subroutine"</a>
2547*22dc650dSSadaf Ebrahimicalls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
2548*22dc650dSSadaf Ebrahimias the called capture group matches a limited-length string. However,
2549*22dc650dSSadaf Ebrahimi<a href="#recursion">recursion,</a>
2550*22dc650dSSadaf Ebrahimithat is, a "subroutine" call into a group that is already active,
2551*22dc650dSSadaf Ebrahimiis not supported.
2552*22dc650dSSadaf Ebrahimi</P>
2553*22dc650dSSadaf Ebrahimi<P>
2554*22dc650dSSadaf EbrahimiPCRE2 supports backreferences in lookbehinds, but only if certain conditions
2555*22dc650dSSadaf Ebrahimiare met. The PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no
2556*22dc650dSSadaf Ebrahimiuse of (?| in the pattern (it creates duplicate group numbers), and if the
2557*22dc650dSSadaf Ebrahimibackreference is by name, the name must be unique. Of course, the referenced
2558*22dc650dSSadaf Ebrahimigroup must itself match a limited length substring. The following pattern
2559*22dc650dSSadaf Ebrahimimatches words containing at least two characters that begin and end with the
2560*22dc650dSSadaf Ebrahimisame character:
2561*22dc650dSSadaf Ebrahimi<pre>
2562*22dc650dSSadaf Ebrahimi   \b(\w)\w++(?&#60;=\1)
2563*22dc650dSSadaf Ebrahimi</PRE>
2564*22dc650dSSadaf Ebrahimi</P>
2565*22dc650dSSadaf Ebrahimi<P>
2566*22dc650dSSadaf EbrahimiPossessive quantifiers can be used in conjunction with lookbehind assertions to
2567*22dc650dSSadaf Ebrahimispecify efficient matching at the end of subject strings. Consider a simple
2568*22dc650dSSadaf Ebrahimipattern such as
2569*22dc650dSSadaf Ebrahimi<pre>
2570*22dc650dSSadaf Ebrahimi  abcd$
2571*22dc650dSSadaf Ebrahimi</pre>
2572*22dc650dSSadaf Ebrahimiwhen applied to a long string that does not match. Because matching proceeds
2573*22dc650dSSadaf Ebrahimifrom left to right, PCRE2 will look for each "a" in the subject and then see if
2574*22dc650dSSadaf Ebrahimiwhat follows matches the rest of the pattern. If the pattern is specified as
2575*22dc650dSSadaf Ebrahimi<pre>
2576*22dc650dSSadaf Ebrahimi  ^.*abcd$
2577*22dc650dSSadaf Ebrahimi</pre>
2578*22dc650dSSadaf Ebrahimithe initial .* matches the entire string at first, but when this fails (because
2579*22dc650dSSadaf Ebrahimithere is no following "a"), it backtracks to match all but the last character,
2580*22dc650dSSadaf Ebrahimithen all but the last two characters, and so on. Once again the search for "a"
2581*22dc650dSSadaf Ebrahimicovers the entire string, from right to left, so we are no better off. However,
2582*22dc650dSSadaf Ebrahimiif the pattern is written as
2583*22dc650dSSadaf Ebrahimi<pre>
2584*22dc650dSSadaf Ebrahimi  ^.*+(?&#60;=abcd)
2585*22dc650dSSadaf Ebrahimi</pre>
2586*22dc650dSSadaf Ebrahimithere can be no backtracking for the .*+ item because of the possessive
2587*22dc650dSSadaf Ebrahimiquantifier; it can match only the entire string. The subsequent lookbehind
2588*22dc650dSSadaf Ebrahimiassertion does a single test on the last four characters. If it fails, the
2589*22dc650dSSadaf Ebrahimimatch fails immediately. For long strings, this approach makes a significant
2590*22dc650dSSadaf Ebrahimidifference to the processing time.
2591*22dc650dSSadaf Ebrahimi</P>
2592*22dc650dSSadaf Ebrahimi<br><b>
2593*22dc650dSSadaf EbrahimiUsing multiple assertions
2594*22dc650dSSadaf Ebrahimi</b><br>
2595*22dc650dSSadaf Ebrahimi<P>
2596*22dc650dSSadaf EbrahimiSeveral assertions (of any sort) may occur in succession. For example,
2597*22dc650dSSadaf Ebrahimi<pre>
2598*22dc650dSSadaf Ebrahimi  (?&#60;=\d{3})(?&#60;!999)foo
2599*22dc650dSSadaf Ebrahimi</pre>
2600*22dc650dSSadaf Ebrahimimatches "foo" preceded by three digits that are not "999". Notice that each of
2601*22dc650dSSadaf Ebrahimithe assertions is applied independently at the same point in the subject
2602*22dc650dSSadaf Ebrahimistring. First there is a check that the previous three characters are all
2603*22dc650dSSadaf Ebrahimidigits, and then there is a check that the same three characters are not "999".
2604*22dc650dSSadaf EbrahimiThis pattern does <i>not</i> match "foo" preceded by six characters, the first
2605*22dc650dSSadaf Ebrahimiof which are digits and the last three of which are not "999". For example, it
2606*22dc650dSSadaf Ebrahimidoesn't match "123abcfoo". A pattern to do that is
2607*22dc650dSSadaf Ebrahimi<pre>
2608*22dc650dSSadaf Ebrahimi  (?&#60;=\d{3}...)(?&#60;!999)foo
2609*22dc650dSSadaf Ebrahimi</pre>
2610*22dc650dSSadaf EbrahimiThis time the first assertion looks at the preceding six characters, checking
2611*22dc650dSSadaf Ebrahimithat the first three are digits, and then the second assertion checks that the
2612*22dc650dSSadaf Ebrahimipreceding three characters are not "999".
2613*22dc650dSSadaf Ebrahimi</P>
2614*22dc650dSSadaf Ebrahimi<P>
2615*22dc650dSSadaf EbrahimiAssertions can be nested in any combination. For example,
2616*22dc650dSSadaf Ebrahimi<pre>
2617*22dc650dSSadaf Ebrahimi  (?&#60;=(?&#60;!foo)bar)baz
2618*22dc650dSSadaf Ebrahimi</pre>
2619*22dc650dSSadaf Ebrahimimatches an occurrence of "baz" that is preceded by "bar" which in turn is not
2620*22dc650dSSadaf Ebrahimipreceded by "foo", while
2621*22dc650dSSadaf Ebrahimi<pre>
2622*22dc650dSSadaf Ebrahimi  (?&#60;=\d{3}(?!999)...)foo
2623*22dc650dSSadaf Ebrahimi</pre>
2624*22dc650dSSadaf Ebrahimiis another pattern that matches "foo" preceded by three digits and any three
2625*22dc650dSSadaf Ebrahimicharacters that are not "999".
2626*22dc650dSSadaf Ebrahimi<a name="nonatomicassertions"></a></P>
2627*22dc650dSSadaf Ebrahimi<br><a name="SEC21" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br>
2628*22dc650dSSadaf Ebrahimi<P>
2629*22dc650dSSadaf EbrahimiTraditional lookaround assertions are atomic. That is, if an assertion is true,
2630*22dc650dSSadaf Ebrahimibut there is a subsequent matching failure, there is no backtracking into the
2631*22dc650dSSadaf Ebrahimiassertion. However, there are some cases where non-atomic positive assertions
2632*22dc650dSSadaf Ebrahimican be useful. PCRE2 provides these using the following syntax:
2633*22dc650dSSadaf Ebrahimi<pre>
2634*22dc650dSSadaf Ebrahimi  (*non_atomic_positive_lookahead:  or (*napla: or (?*
2635*22dc650dSSadaf Ebrahimi  (*non_atomic_positive_lookbehind: or (*naplb: or (?&#60;*
2636*22dc650dSSadaf Ebrahimi</pre>
2637*22dc650dSSadaf EbrahimiConsider the problem of finding the right-most word in a string that also
2638*22dc650dSSadaf Ebrahimiappears earlier in the string, that is, it must appear at least twice in total.
2639*22dc650dSSadaf EbrahimiThis pattern returns the required result as captured substring 1:
2640*22dc650dSSadaf Ebrahimi<pre>
2641*22dc650dSSadaf Ebrahimi  ^(?x)(*napla: .* \b(\w++)) (?&#62; .*? \b\1\b ){2}
2642*22dc650dSSadaf Ebrahimi</pre>
2643*22dc650dSSadaf EbrahimiFor a subject such as "word1 word2 word3 word2 word3 word4" the result is
2644*22dc650dSSadaf Ebrahimi"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the
2645*22dc650dSSadaf Ebrahimi"x" option, which causes white space (introduced for readability) to be
2646*22dc650dSSadaf Ebrahimiignored. Inside the assertion, the greedy .* at first consumes the entire
2647*22dc650dSSadaf Ebrahimistring, but then has to backtrack until the rest of the assertion can match a
2648*22dc650dSSadaf Ebrahimiword, which is captured by group 1. In other words, when the assertion first
2649*22dc650dSSadaf Ebrahimisucceeds, it captures the right-most word in the string.
2650*22dc650dSSadaf Ebrahimi</P>
2651*22dc650dSSadaf Ebrahimi<P>
2652*22dc650dSSadaf EbrahimiThe current matching point is then reset to the start of the subject, and the
2653*22dc650dSSadaf Ebrahimirest of the pattern match checks for two occurrences of the captured word,
2654*22dc650dSSadaf Ebrahimiusing an ungreedy .*? to scan from the left. If this succeeds, we are done, but
2655*22dc650dSSadaf Ebrahimiif the last word in the string does not occur twice, this part of the pattern
2656*22dc650dSSadaf Ebrahimifails. If a traditional atomic lookahead (?= or (*pla: had been used, the
2657*22dc650dSSadaf Ebrahimiassertion could not be re-entered, and the whole match would fail. The pattern
2658*22dc650dSSadaf Ebrahimiwould succeed only if the very last word in the subject was found twice.
2659*22dc650dSSadaf Ebrahimi</P>
2660*22dc650dSSadaf Ebrahimi<P>
2661*22dc650dSSadaf EbrahimiUsing a non-atomic lookahead, however, means that when the last word does not
2662*22dc650dSSadaf Ebrahimioccur twice in the string, the lookahead can backtrack and find the second-last
2663*22dc650dSSadaf Ebrahimiword, and so on, until either the match succeeds, or all words have been
2664*22dc650dSSadaf Ebrahimitested.
2665*22dc650dSSadaf Ebrahimi</P>
2666*22dc650dSSadaf Ebrahimi<P>
2667*22dc650dSSadaf EbrahimiTwo conditions must be met for a non-atomic assertion to be useful: the
2668*22dc650dSSadaf Ebrahimicontents of one or more capturing groups must change after a backtrack into the
2669*22dc650dSSadaf Ebrahimiassertion, and there must be a backreference to a changed group later in the
2670*22dc650dSSadaf Ebrahimipattern. If this is not the case, the rest of the pattern match fails exactly
2671*22dc650dSSadaf Ebrahimias before because nothing has changed, so using a non-atomic assertion just
2672*22dc650dSSadaf Ebrahimiwastes resources.
2673*22dc650dSSadaf Ebrahimi</P>
2674*22dc650dSSadaf Ebrahimi<P>
2675*22dc650dSSadaf EbrahimiThere is one exception to backtracking into a non-atomic assertion. If an
2676*22dc650dSSadaf Ebrahimi(*ACCEPT) control verb is triggered, the assertion succeeds atomically. That
2677*22dc650dSSadaf Ebrahimiis, a subsequent match failure cannot backtrack into the assertion.
2678*22dc650dSSadaf Ebrahimi</P>
2679*22dc650dSSadaf Ebrahimi<P>
2680*22dc650dSSadaf EbrahimiNon-atomic assertions are not supported by the alternative matching function
2681*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b>. They are supported by JIT, but only if they do not
2682*22dc650dSSadaf Ebrahimicontain any control verbs such as (*ACCEPT). (This may change in future). Note
2683*22dc650dSSadaf Ebrahimithat assertions that appear as conditions for
2684*22dc650dSSadaf Ebrahimi<a href="#conditions">conditional groups</a>
2685*22dc650dSSadaf Ebrahimi(see below) must be atomic.
2686*22dc650dSSadaf Ebrahimi</P>
2687*22dc650dSSadaf Ebrahimi<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br>
2688*22dc650dSSadaf Ebrahimi<P>
2689*22dc650dSSadaf EbrahimiIn concept, a script run is a sequence of characters that are all from the same
2690*22dc650dSSadaf EbrahimiUnicode script such as Latin or Greek. However, because some scripts are
2691*22dc650dSSadaf Ebrahimicommonly used together, and because some diacritical and other marks are used
2692*22dc650dSSadaf Ebrahimiwith multiple scripts, it is not that simple. There is a full description of
2693*22dc650dSSadaf Ebrahimithe rules that PCRE2 uses in the section entitled
2694*22dc650dSSadaf Ebrahimi<a href="pcre2unicode.html#scriptruns">"Script Runs"</a>
2695*22dc650dSSadaf Ebrahimiin the
2696*22dc650dSSadaf Ebrahimi<a href="pcre2unicode.html"><b>pcre2unicode</b></a>
2697*22dc650dSSadaf Ebrahimidocumentation.
2698*22dc650dSSadaf Ebrahimi</P>
2699*22dc650dSSadaf Ebrahimi<P>
2700*22dc650dSSadaf EbrahimiIf part of a pattern is enclosed between (*script_run: or (*sr: and a closing
2701*22dc650dSSadaf Ebrahimiparenthesis, it fails if the sequence of characters that it matches are not a
2702*22dc650dSSadaf Ebrahimiscript run. After a failure, normal backtracking occurs. Script runs can be
2703*22dc650dSSadaf Ebrahimiused to detect spoofing attacks using characters that look the same, but are
2704*22dc650dSSadaf Ebrahimifrom different scripts. The string "paypal.com" is an infamous example, where
2705*22dc650dSSadaf Ebrahimithe letters could be a mixture of Latin and Cyrillic. This pattern ensures that
2706*22dc650dSSadaf Ebrahimithe matched characters in a sequence of non-spaces that follow white space are
2707*22dc650dSSadaf Ebrahimia script run:
2708*22dc650dSSadaf Ebrahimi<pre>
2709*22dc650dSSadaf Ebrahimi  \s+(*sr:\S+)
2710*22dc650dSSadaf Ebrahimi</pre>
2711*22dc650dSSadaf EbrahimiTo be sure that they are all from the Latin script (for example), a lookahead
2712*22dc650dSSadaf Ebrahimican be used:
2713*22dc650dSSadaf Ebrahimi<pre>
2714*22dc650dSSadaf Ebrahimi  \s+(?=\p{Latin})(*sr:\S+)
2715*22dc650dSSadaf Ebrahimi</pre>
2716*22dc650dSSadaf EbrahimiThis works as long as the first character is expected to be a character in that
2717*22dc650dSSadaf Ebrahimiscript, and not (for example) punctuation, which is allowed with any script. If
2718*22dc650dSSadaf Ebrahimithis is not the case, a more creative lookahead is needed. For example, if
2719*22dc650dSSadaf Ebrahimidigits, underscore, and dots are permitted at the start:
2720*22dc650dSSadaf Ebrahimi<pre>
2721*22dc650dSSadaf Ebrahimi  \s+(?=[0-9_.]*\p{Latin})(*sr:\S+)
2722*22dc650dSSadaf Ebrahimi
2723*22dc650dSSadaf Ebrahimi</PRE>
2724*22dc650dSSadaf Ebrahimi</P>
2725*22dc650dSSadaf Ebrahimi<P>
2726*22dc650dSSadaf EbrahimiIn many cases, backtracking into a script run pattern fragment is not
2727*22dc650dSSadaf Ebrahimidesirable. The script run can employ an atomic group to prevent this. Because
2728*22dc650dSSadaf Ebrahimithis is a common requirement, a shorthand notation is provided by
2729*22dc650dSSadaf Ebrahimi(*atomic_script_run: or (*asr:
2730*22dc650dSSadaf Ebrahimi<pre>
2731*22dc650dSSadaf Ebrahimi  (*asr:...) is the same as (*sr:(?&#62;...))
2732*22dc650dSSadaf Ebrahimi</pre>
2733*22dc650dSSadaf EbrahimiNote that the atomic group is inside the script run. Putting it outside would
2734*22dc650dSSadaf Ebrahiminot prevent backtracking into the script run pattern.
2735*22dc650dSSadaf Ebrahimi</P>
2736*22dc650dSSadaf Ebrahimi<P>
2737*22dc650dSSadaf EbrahimiSupport for script runs is not available if PCRE2 is compiled without Unicode
2738*22dc650dSSadaf Ebrahimisupport. A compile-time error is given if any of the above constructs is
2739*22dc650dSSadaf Ebrahimiencountered. Script runs are not supported by the alternate matching function,
2740*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b> because they use the same mechanism as capturing
2741*22dc650dSSadaf Ebrahimiparentheses.
2742*22dc650dSSadaf Ebrahimi</P>
2743*22dc650dSSadaf Ebrahimi<P>
2744*22dc650dSSadaf Ebrahimi<b>Warning:</b> The (*ACCEPT) control verb
2745*22dc650dSSadaf Ebrahimi<a href="#acceptverb">(see below)</a>
2746*22dc650dSSadaf Ebrahimishould not be used within a script run group, because it causes an immediate
2747*22dc650dSSadaf Ebrahimiexit from the group, bypassing the script run checking.
2748*22dc650dSSadaf Ebrahimi<a name="conditions"></a></P>
2749*22dc650dSSadaf Ebrahimi<br><a name="SEC23" href="#TOC1">CONDITIONAL GROUPS</a><br>
2750*22dc650dSSadaf Ebrahimi<P>
2751*22dc650dSSadaf EbrahimiIt is possible to cause the matching process to obey a pattern fragment
2752*22dc650dSSadaf Ebrahimiconditionally or to choose between two alternative fragments, depending on
2753*22dc650dSSadaf Ebrahimithe result of an assertion, or whether a specific capture group has
2754*22dc650dSSadaf Ebrahimialready been matched. The two possible forms of conditional group are:
2755*22dc650dSSadaf Ebrahimi<pre>
2756*22dc650dSSadaf Ebrahimi  (?(condition)yes-pattern)
2757*22dc650dSSadaf Ebrahimi  (?(condition)yes-pattern|no-pattern)
2758*22dc650dSSadaf Ebrahimi</pre>
2759*22dc650dSSadaf EbrahimiIf the condition is satisfied, the yes-pattern is used; otherwise the
2760*22dc650dSSadaf Ebrahimino-pattern (if present) is used. An absent no-pattern is equivalent to an empty
2761*22dc650dSSadaf Ebrahimistring (it always matches). If there are more than two alternatives in the
2762*22dc650dSSadaf Ebrahimigroup, a compile-time error occurs. Each of the two alternatives may itself
2763*22dc650dSSadaf Ebrahimicontain nested groups of any form, including conditional groups; the
2764*22dc650dSSadaf Ebrahimirestriction to two alternatives applies only at the level of the condition
2765*22dc650dSSadaf Ebrahimiitself. This pattern fragment is an example where the alternatives are complex:
2766*22dc650dSSadaf Ebrahimi<pre>
2767*22dc650dSSadaf Ebrahimi  (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2768*22dc650dSSadaf Ebrahimi
2769*22dc650dSSadaf Ebrahimi</PRE>
2770*22dc650dSSadaf Ebrahimi</P>
2771*22dc650dSSadaf Ebrahimi<P>
2772*22dc650dSSadaf EbrahimiThere are five kinds of condition: references to capture groups, references to
2773*22dc650dSSadaf Ebrahimirecursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
2774*22dc650dSSadaf Ebrahimi</P>
2775*22dc650dSSadaf Ebrahimi<br><b>
2776*22dc650dSSadaf EbrahimiChecking for a used capture group by number
2777*22dc650dSSadaf Ebrahimi</b><br>
2778*22dc650dSSadaf Ebrahimi<P>
2779*22dc650dSSadaf EbrahimiIf the text between the parentheses consists of a sequence of digits, the
2780*22dc650dSSadaf Ebrahimicondition is true if a capture group of that number has previously matched. If
2781*22dc650dSSadaf Ebrahimithere is more than one capture group with the same number (see the earlier
2782*22dc650dSSadaf Ebrahimi<a href="#recursion">section about duplicate group numbers),</a>
2783*22dc650dSSadaf Ebrahimithe condition is true if any of them have matched. An alternative notation,
2784*22dc650dSSadaf Ebrahimiwhich is a PCRE2 extension, not supported by Perl, is to precede the digits
2785*22dc650dSSadaf Ebrahimiwith a plus or minus sign. In this case, the group number is relative rather
2786*22dc650dSSadaf Ebrahimithan absolute. The most recently opened capture group (which could be enclosing
2787*22dc650dSSadaf Ebrahimithis condition) can be referenced by (?(-1), the next most recent by (?(-2),
2788*22dc650dSSadaf Ebrahimiand so on. Inside loops it can also make sense to refer to subsequent groups.
2789*22dc650dSSadaf EbrahimiThe next capture group to be opened can be referenced as (?(+1), and so on. The
2790*22dc650dSSadaf Ebrahimivalue zero in any of these forms is not used; it provokes a compile-time error.
2791*22dc650dSSadaf Ebrahimi</P>
2792*22dc650dSSadaf Ebrahimi<P>
2793*22dc650dSSadaf EbrahimiConsider the following pattern, which contains non-significant white space to
2794*22dc650dSSadaf Ebrahimimake it more readable (assume the PCRE2_EXTENDED option) and to divide it into
2795*22dc650dSSadaf Ebrahimithree parts for ease of discussion:
2796*22dc650dSSadaf Ebrahimi<pre>
2797*22dc650dSSadaf Ebrahimi  ( \( )?    [^()]+    (?(1) \) )
2798*22dc650dSSadaf Ebrahimi</pre>
2799*22dc650dSSadaf EbrahimiThe first part matches an optional opening parenthesis, and if that
2800*22dc650dSSadaf Ebrahimicharacter is present, sets it as the first captured substring. The second part
2801*22dc650dSSadaf Ebrahimimatches one or more characters that are not parentheses. The third part is a
2802*22dc650dSSadaf Ebrahimiconditional group that tests whether or not the first capture group
2803*22dc650dSSadaf Ebrahimimatched. If it did, that is, if subject started with an opening parenthesis,
2804*22dc650dSSadaf Ebrahimithe condition is true, and so the yes-pattern is executed and a closing
2805*22dc650dSSadaf Ebrahimiparenthesis is required. Otherwise, since no-pattern is not present, the
2806*22dc650dSSadaf Ebrahimiconditional group matches nothing. In other words, this pattern matches a
2807*22dc650dSSadaf Ebrahimisequence of non-parentheses, optionally enclosed in parentheses.
2808*22dc650dSSadaf Ebrahimi</P>
2809*22dc650dSSadaf Ebrahimi<P>
2810*22dc650dSSadaf EbrahimiIf you were embedding this pattern in a larger one, you could use a relative
2811*22dc650dSSadaf Ebrahimireference:
2812*22dc650dSSadaf Ebrahimi<pre>
2813*22dc650dSSadaf Ebrahimi  ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
2814*22dc650dSSadaf Ebrahimi</pre>
2815*22dc650dSSadaf EbrahimiThis makes the fragment independent of the parentheses in the larger pattern.
2816*22dc650dSSadaf Ebrahimi</P>
2817*22dc650dSSadaf Ebrahimi<br><b>
2818*22dc650dSSadaf EbrahimiChecking for a used capture group by name
2819*22dc650dSSadaf Ebrahimi</b><br>
2820*22dc650dSSadaf Ebrahimi<P>
2821*22dc650dSSadaf EbrahimiPerl uses the syntax (?(&#60;name&#62;)...) or (?('name')...) to test for a used
2822*22dc650dSSadaf Ebrahimicapture group by name. For compatibility with earlier versions of PCRE1, which
2823*22dc650dSSadaf Ebrahimihad this facility before Perl, the syntax (?(name)...) is also recognized.
2824*22dc650dSSadaf EbrahimiNote, however, that undelimited names consisting of the letter R followed by
2825*22dc650dSSadaf Ebrahimidigits are ambiguous (see the following section). Rewriting the above example
2826*22dc650dSSadaf Ebrahimito use a named group gives this:
2827*22dc650dSSadaf Ebrahimi<pre>
2828*22dc650dSSadaf Ebrahimi  (?&#60;OPEN&#62; \( )?    [^()]+    (?(&#60;OPEN&#62;) \) )
2829*22dc650dSSadaf Ebrahimi</pre>
2830*22dc650dSSadaf EbrahimiIf the name used in a condition of this kind is a duplicate, the test is
2831*22dc650dSSadaf Ebrahimiapplied to all groups of the same name, and is true if any one of them has
2832*22dc650dSSadaf Ebrahimimatched.
2833*22dc650dSSadaf Ebrahimi</P>
2834*22dc650dSSadaf Ebrahimi<br><b>
2835*22dc650dSSadaf EbrahimiChecking for pattern recursion
2836*22dc650dSSadaf Ebrahimi</b><br>
2837*22dc650dSSadaf Ebrahimi<P>
2838*22dc650dSSadaf Ebrahimi"Recursion" in this sense refers to any subroutine-like call from one part of
2839*22dc650dSSadaf Ebrahimithe pattern to another, whether or not it is actually recursive. See the
2840*22dc650dSSadaf Ebrahimisections entitled
2841*22dc650dSSadaf Ebrahimi<a href="#recursion">"Recursive patterns"</a>
2842*22dc650dSSadaf Ebrahimiand
2843*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">"Groups as subroutines"</a>
2844*22dc650dSSadaf Ebrahimibelow for details of recursion and subroutine calls.
2845*22dc650dSSadaf Ebrahimi</P>
2846*22dc650dSSadaf Ebrahimi<P>
2847*22dc650dSSadaf EbrahimiIf a condition is the string (R), and there is no capture group with the name
2848*22dc650dSSadaf EbrahimiR, the condition is true if matching is currently in a recursion or subroutine
2849*22dc650dSSadaf Ebrahimicall to the whole pattern or any capture group. If digits follow the letter R,
2850*22dc650dSSadaf Ebrahimiand there is no group with that name, the condition is true if the most recent
2851*22dc650dSSadaf Ebrahimicall is into a group with the given number, which must exist somewhere in the
2852*22dc650dSSadaf Ebrahimioverall pattern. This is a contrived example that is equivalent to a+b:
2853*22dc650dSSadaf Ebrahimi<pre>
2854*22dc650dSSadaf Ebrahimi  ((?(R1)a+|(?1)b))
2855*22dc650dSSadaf Ebrahimi</pre>
2856*22dc650dSSadaf EbrahimiHowever, in both cases, if there is a capture group with a matching name, the
2857*22dc650dSSadaf Ebrahimicondition tests for its being set, as described in the section above, instead
2858*22dc650dSSadaf Ebrahimiof testing for recursion. For example, creating a group with the name R1 by
2859*22dc650dSSadaf Ebrahimiadding (?&#60;R1&#62;) to the above pattern completely changes its meaning.
2860*22dc650dSSadaf Ebrahimi</P>
2861*22dc650dSSadaf Ebrahimi<P>
2862*22dc650dSSadaf EbrahimiIf a name preceded by ampersand follows the letter R, for example:
2863*22dc650dSSadaf Ebrahimi<pre>
2864*22dc650dSSadaf Ebrahimi  (?(R&name)...)
2865*22dc650dSSadaf Ebrahimi</pre>
2866*22dc650dSSadaf Ebrahimithe condition is true if the most recent recursion is into a group of that name
2867*22dc650dSSadaf Ebrahimi(which must exist within the pattern).
2868*22dc650dSSadaf Ebrahimi</P>
2869*22dc650dSSadaf Ebrahimi<P>
2870*22dc650dSSadaf EbrahimiThis condition does not check the entire recursion stack. It tests only the
2871*22dc650dSSadaf Ebrahimicurrent level. If the name used in a condition of this kind is a duplicate, the
2872*22dc650dSSadaf Ebrahimitest is applied to all groups of the same name, and is true if any one of
2873*22dc650dSSadaf Ebrahimithem is the most recent recursion.
2874*22dc650dSSadaf Ebrahimi</P>
2875*22dc650dSSadaf Ebrahimi<P>
2876*22dc650dSSadaf EbrahimiAt "top level", all these recursion test conditions are false.
2877*22dc650dSSadaf Ebrahimi<a name="subdefine"></a></P>
2878*22dc650dSSadaf Ebrahimi<br><b>
2879*22dc650dSSadaf EbrahimiDefining capture groups for use by reference only
2880*22dc650dSSadaf Ebrahimi</b><br>
2881*22dc650dSSadaf Ebrahimi<P>
2882*22dc650dSSadaf EbrahimiIf the condition is the string (DEFINE), the condition is always false, even if
2883*22dc650dSSadaf Ebrahimithere is a group with the name DEFINE. In this case, there may be only one
2884*22dc650dSSadaf Ebrahimialternative in the rest of the conditional group. It is always skipped if
2885*22dc650dSSadaf Ebrahimicontrol reaches this point in the pattern; the idea of DEFINE is that it can be
2886*22dc650dSSadaf Ebrahimiused to define subroutines that can be referenced from elsewhere. (The use of
2887*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">subroutines</a>
2888*22dc650dSSadaf Ebrahimiis described below.) For example, a pattern to match an IPv4 address such as
2889*22dc650dSSadaf Ebrahimi"192.168.23.245" could be written like this (ignore white space and line
2890*22dc650dSSadaf Ebrahimibreaks):
2891*22dc650dSSadaf Ebrahimi<pre>
2892*22dc650dSSadaf Ebrahimi  (?(DEFINE) (?&#60;byte&#62; 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2893*22dc650dSSadaf Ebrahimi  \b (?&byte) (\.(?&byte)){3} \b
2894*22dc650dSSadaf Ebrahimi</pre>
2895*22dc650dSSadaf EbrahimiThe first part of the pattern is a DEFINE group inside which another group
2896*22dc650dSSadaf Ebrahiminamed "byte" is defined. This matches an individual component of an IPv4
2897*22dc650dSSadaf Ebrahimiaddress (a number less than 256). When matching takes place, this part of the
2898*22dc650dSSadaf Ebrahimipattern is skipped because DEFINE acts like a false condition. The rest of the
2899*22dc650dSSadaf Ebrahimipattern uses references to the named group to match the four dot-separated
2900*22dc650dSSadaf Ebrahimicomponents of an IPv4 address, insisting on a word boundary at each end.
2901*22dc650dSSadaf Ebrahimi</P>
2902*22dc650dSSadaf Ebrahimi<br><b>
2903*22dc650dSSadaf EbrahimiChecking the PCRE2 version
2904*22dc650dSSadaf Ebrahimi</b><br>
2905*22dc650dSSadaf Ebrahimi<P>
2906*22dc650dSSadaf EbrahimiPrograms that link with a PCRE2 library can check the version by calling
2907*22dc650dSSadaf Ebrahimi<b>pcre2_config()</b> with appropriate arguments. Users of applications that do
2908*22dc650dSSadaf Ebrahiminot have access to the underlying code cannot do this. A special "condition"
2909*22dc650dSSadaf Ebrahimicalled VERSION exists to allow such users to discover which version of PCRE2
2910*22dc650dSSadaf Ebrahimithey are dealing with by using this condition to match a string such as
2911*22dc650dSSadaf Ebrahimi"yesno". VERSION must be followed either by "=" or "&#62;=" and a version number.
2912*22dc650dSSadaf EbrahimiFor example:
2913*22dc650dSSadaf Ebrahimi<pre>
2914*22dc650dSSadaf Ebrahimi  (?(VERSION&#62;=10.4)yes|no)
2915*22dc650dSSadaf Ebrahimi</pre>
2916*22dc650dSSadaf EbrahimiThis pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
2917*22dc650dSSadaf Ebrahimi"no" otherwise. The fractional part of the version number may not contain more
2918*22dc650dSSadaf Ebrahimithan two digits.
2919*22dc650dSSadaf Ebrahimi</P>
2920*22dc650dSSadaf Ebrahimi<br><b>
2921*22dc650dSSadaf EbrahimiAssertion conditions
2922*22dc650dSSadaf Ebrahimi</b><br>
2923*22dc650dSSadaf Ebrahimi<P>
2924*22dc650dSSadaf EbrahimiIf the condition is not in any of the above formats, it must be a parenthesized
2925*22dc650dSSadaf Ebrahimiassertion. This may be a positive or negative lookahead or lookbehind
2926*22dc650dSSadaf Ebrahimiassertion. However, it must be a traditional atomic assertion, not one of the
2927*22dc650dSSadaf Ebrahimi<a href="#nonatomicassertions">non-atomic assertions.</a>
2928*22dc650dSSadaf Ebrahimi</P>
2929*22dc650dSSadaf Ebrahimi<P>
2930*22dc650dSSadaf EbrahimiConsider this pattern, again containing non-significant white space, and with
2931*22dc650dSSadaf Ebrahimithe two alternatives on the second line:
2932*22dc650dSSadaf Ebrahimi<pre>
2933*22dc650dSSadaf Ebrahimi  (?(?=[^a-z]*[a-z])
2934*22dc650dSSadaf Ebrahimi  \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
2935*22dc650dSSadaf Ebrahimi</pre>
2936*22dc650dSSadaf EbrahimiThe condition is a positive lookahead assertion that matches an optional
2937*22dc650dSSadaf Ebrahimisequence of non-letters followed by a letter. In other words, it tests for the
2938*22dc650dSSadaf Ebrahimipresence of at least one letter in the subject. If a letter is found, the
2939*22dc650dSSadaf Ebrahimisubject is matched against the first alternative; otherwise it is matched
2940*22dc650dSSadaf Ebrahimiagainst the second. This pattern matches strings in one of the two forms
2941*22dc650dSSadaf Ebrahimidd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2942*22dc650dSSadaf Ebrahimi</P>
2943*22dc650dSSadaf Ebrahimi<P>
2944*22dc650dSSadaf EbrahimiWhen an assertion that is a condition contains capture groups, any
2945*22dc650dSSadaf Ebrahimicapturing that occurs in a matching branch is retained afterwards, for both
2946*22dc650dSSadaf Ebrahimipositive and negative assertions, because matching always continues after the
2947*22dc650dSSadaf Ebrahimiassertion, whether it succeeds or fails. (Compare non-conditional assertions,
2948*22dc650dSSadaf Ebrahimifor which captures are retained only for positive assertions that succeed.)
2949*22dc650dSSadaf Ebrahimi<a name="comments"></a></P>
2950*22dc650dSSadaf Ebrahimi<br><a name="SEC24" href="#TOC1">COMMENTS</a><br>
2951*22dc650dSSadaf Ebrahimi<P>
2952*22dc650dSSadaf EbrahimiThere are two ways of including comments in patterns that are processed by
2953*22dc650dSSadaf EbrahimiPCRE2. In both cases, the start of the comment must not be in a character
2954*22dc650dSSadaf Ebrahimiclass, nor in the middle of any other sequence of related characters such as
2955*22dc650dSSadaf Ebrahimi(?: or a group name or number. The characters that make up a comment play
2956*22dc650dSSadaf Ebrahimino part in the pattern matching.
2957*22dc650dSSadaf Ebrahimi</P>
2958*22dc650dSSadaf Ebrahimi<P>
2959*22dc650dSSadaf EbrahimiThe sequence (?# marks the start of a comment that continues up to the next
2960*22dc650dSSadaf Ebrahimiclosing parenthesis. Nested parentheses are not permitted. If the
2961*22dc650dSSadaf EbrahimiPCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
2962*22dc650dSSadaf Ebrahimialso introduces a comment, which in this case continues to immediately after
2963*22dc650dSSadaf Ebrahimithe next newline character or character sequence in the pattern. Which
2964*22dc650dSSadaf Ebrahimicharacters are interpreted as newlines is controlled by an option passed to the
2965*22dc650dSSadaf Ebrahimicompiling function or by a special sequence at the start of the pattern, as
2966*22dc650dSSadaf Ebrahimidescribed in the section entitled
2967*22dc650dSSadaf Ebrahimi<a href="#newlines">"Newline conventions"</a>
2968*22dc650dSSadaf Ebrahimiabove. Note that the end of this type of comment is a literal newline sequence
2969*22dc650dSSadaf Ebrahimiin the pattern; escape sequences that happen to represent a newline do not
2970*22dc650dSSadaf Ebrahimicount. For example, consider this pattern when PCRE2_EXTENDED is set, and the
2971*22dc650dSSadaf Ebrahimidefault newline convention (a single linefeed character) is in force:
2972*22dc650dSSadaf Ebrahimi<pre>
2973*22dc650dSSadaf Ebrahimi  abc #comment \n still comment
2974*22dc650dSSadaf Ebrahimi</pre>
2975*22dc650dSSadaf EbrahimiOn encountering the # character, <b>pcre2_compile()</b> skips along, looking for
2976*22dc650dSSadaf Ebrahimia newline in the pattern. The sequence \n is still literal at this stage, so
2977*22dc650dSSadaf Ebrahimiit does not terminate the comment. Only an actual character with the code value
2978*22dc650dSSadaf Ebrahimi0x0a (the default newline) does so.
2979*22dc650dSSadaf Ebrahimi<a name="recursion"></a></P>
2980*22dc650dSSadaf Ebrahimi<br><a name="SEC25" href="#TOC1">RECURSIVE PATTERNS</a><br>
2981*22dc650dSSadaf Ebrahimi<P>
2982*22dc650dSSadaf EbrahimiConsider the problem of matching a string in parentheses, allowing for
2983*22dc650dSSadaf Ebrahimiunlimited nested parentheses. Without the use of recursion, the best that can
2984*22dc650dSSadaf Ebrahimibe done is to use a pattern that matches up to some fixed depth of nesting. It
2985*22dc650dSSadaf Ebrahimiis not possible to handle an arbitrary nesting depth.
2986*22dc650dSSadaf Ebrahimi</P>
2987*22dc650dSSadaf Ebrahimi<P>
2988*22dc650dSSadaf EbrahimiFor some time, Perl has provided a facility that allows regular expressions to
2989*22dc650dSSadaf Ebrahimirecurse (amongst other things). It does this by interpolating Perl code in the
2990*22dc650dSSadaf Ebrahimiexpression at run time, and the code can refer to the expression itself. A Perl
2991*22dc650dSSadaf Ebrahimipattern using code interpolation to solve the parentheses problem can be
2992*22dc650dSSadaf Ebrahimicreated like this:
2993*22dc650dSSadaf Ebrahimi<pre>
2994*22dc650dSSadaf Ebrahimi  $re = qr{\( (?: (?&#62;[^()]+) | (?p{$re}) )* \)}x;
2995*22dc650dSSadaf Ebrahimi</pre>
2996*22dc650dSSadaf EbrahimiThe (?p{...}) item interpolates Perl code at run time, and in this case refers
2997*22dc650dSSadaf Ebrahimirecursively to the pattern in which it appears.
2998*22dc650dSSadaf Ebrahimi</P>
2999*22dc650dSSadaf Ebrahimi<P>
3000*22dc650dSSadaf EbrahimiObviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
3001*22dc650dSSadaf Ebrahimisupports special syntax for recursion of the entire pattern, and also for
3002*22dc650dSSadaf Ebrahimiindividual capture group recursion. After its introduction in PCRE1 and Python,
3003*22dc650dSSadaf Ebrahimithis kind of recursion was subsequently introduced into Perl at release 5.10.
3004*22dc650dSSadaf Ebrahimi</P>
3005*22dc650dSSadaf Ebrahimi<P>
3006*22dc650dSSadaf EbrahimiA special item that consists of (? followed by a number greater than zero and a
3007*22dc650dSSadaf Ebrahimiclosing parenthesis is a recursive subroutine call of the capture group of the
3008*22dc650dSSadaf Ebrahimigiven number, provided that it occurs inside that group. (If not, it is a
3009*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">non-recursive subroutine</a>
3010*22dc650dSSadaf Ebrahimicall, which is described in the next section.) The special item (?R) or (?0) is
3011*22dc650dSSadaf Ebrahimia recursive call of the entire regular expression.
3012*22dc650dSSadaf Ebrahimi</P>
3013*22dc650dSSadaf Ebrahimi<P>
3014*22dc650dSSadaf EbrahimiThis PCRE2 pattern solves the nested parentheses problem (assume the
3015*22dc650dSSadaf EbrahimiPCRE2_EXTENDED option is set so that white space is ignored):
3016*22dc650dSSadaf Ebrahimi<pre>
3017*22dc650dSSadaf Ebrahimi  \( ( [^()]++ | (?R) )* \)
3018*22dc650dSSadaf Ebrahimi</pre>
3019*22dc650dSSadaf EbrahimiFirst it matches an opening parenthesis. Then it matches any number of
3020*22dc650dSSadaf Ebrahimisubstrings which can either be a sequence of non-parentheses, or a recursive
3021*22dc650dSSadaf Ebrahimimatch of the pattern itself (that is, a correctly parenthesized substring).
3022*22dc650dSSadaf EbrahimiFinally there is a closing parenthesis. Note the use of a possessive quantifier
3023*22dc650dSSadaf Ebrahimito avoid backtracking into sequences of non-parentheses.
3024*22dc650dSSadaf Ebrahimi</P>
3025*22dc650dSSadaf Ebrahimi<P>
3026*22dc650dSSadaf EbrahimiIf this were part of a larger pattern, you would not want to recurse the entire
3027*22dc650dSSadaf Ebrahimipattern, so instead you could use this:
3028*22dc650dSSadaf Ebrahimi<pre>
3029*22dc650dSSadaf Ebrahimi  ( \( ( [^()]++ | (?1) )* \) )
3030*22dc650dSSadaf Ebrahimi</pre>
3031*22dc650dSSadaf EbrahimiWe have put the pattern into parentheses, and caused the recursion to refer to
3032*22dc650dSSadaf Ebrahimithem instead of the whole pattern.
3033*22dc650dSSadaf Ebrahimi</P>
3034*22dc650dSSadaf Ebrahimi<P>
3035*22dc650dSSadaf EbrahimiIn a larger pattern, keeping track of parenthesis numbers can be tricky. This
3036*22dc650dSSadaf Ebrahimiis made easier by the use of relative references. Instead of (?1) in the
3037*22dc650dSSadaf Ebrahimipattern above you can write (?-2) to refer to the second most recently opened
3038*22dc650dSSadaf Ebrahimiparentheses preceding the recursion. In other words, a negative number counts
3039*22dc650dSSadaf Ebrahimicapturing parentheses leftwards from the point at which it is encountered.
3040*22dc650dSSadaf Ebrahimi</P>
3041*22dc650dSSadaf Ebrahimi<P>
3042*22dc650dSSadaf EbrahimiBe aware however, that if
3043*22dc650dSSadaf Ebrahimi<a href="#dupgroupnumber">duplicate capture group numbers</a>
3044*22dc650dSSadaf Ebrahimiare in use, relative references refer to the earliest group with the
3045*22dc650dSSadaf Ebrahimiappropriate number. Consider, for example:
3046*22dc650dSSadaf Ebrahimi<pre>
3047*22dc650dSSadaf Ebrahimi  (?|(a)|(b)) (c) (?-2)
3048*22dc650dSSadaf Ebrahimi</pre>
3049*22dc650dSSadaf EbrahimiThe first two capture groups (a) and (b) are both numbered 1, and group (c)
3050*22dc650dSSadaf Ebrahimiis number 2. When the reference (?-2) is encountered, the second most recently
3051*22dc650dSSadaf Ebrahimiopened parentheses has the number 1, but it is the first such group (the (a)
3052*22dc650dSSadaf Ebrahimigroup) to which the recursion refers. This would be the same if an absolute
3053*22dc650dSSadaf Ebrahimireference (?1) was used. In other words, relative references are just a
3054*22dc650dSSadaf Ebrahimishorthand for computing a group number.
3055*22dc650dSSadaf Ebrahimi</P>
3056*22dc650dSSadaf Ebrahimi<P>
3057*22dc650dSSadaf EbrahimiIt is also possible to refer to subsequent capture groups, by writing
3058*22dc650dSSadaf Ebrahimireferences such as (?+2). However, these cannot be recursive because the
3059*22dc650dSSadaf Ebrahimireference is not inside the parentheses that are referenced. They are always
3060*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">non-recursive subroutine</a>
3061*22dc650dSSadaf Ebrahimicalls, as described in the next section.
3062*22dc650dSSadaf Ebrahimi</P>
3063*22dc650dSSadaf Ebrahimi<P>
3064*22dc650dSSadaf EbrahimiAn alternative approach is to use named parentheses. The Perl syntax for this
3065*22dc650dSSadaf Ebrahimiis (?&name); PCRE1's earlier syntax (?P&#62;name) is also supported. We could
3066*22dc650dSSadaf Ebrahimirewrite the above example as follows:
3067*22dc650dSSadaf Ebrahimi<pre>
3068*22dc650dSSadaf Ebrahimi  (?&#60;pn&#62; \( ( [^()]++ | (?&pn) )* \) )
3069*22dc650dSSadaf Ebrahimi</pre>
3070*22dc650dSSadaf EbrahimiIf there is more than one group with the same name, the earliest one is
3071*22dc650dSSadaf Ebrahimiused.
3072*22dc650dSSadaf Ebrahimi</P>
3073*22dc650dSSadaf Ebrahimi<P>
3074*22dc650dSSadaf EbrahimiThe example pattern that we have been looking at contains nested unlimited
3075*22dc650dSSadaf Ebrahimirepeats, and so the use of a possessive quantifier for matching strings of
3076*22dc650dSSadaf Ebrahiminon-parentheses is important when applying the pattern to strings that do not
3077*22dc650dSSadaf Ebrahimimatch. For example, when this pattern is applied to
3078*22dc650dSSadaf Ebrahimi<pre>
3079*22dc650dSSadaf Ebrahimi  (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3080*22dc650dSSadaf Ebrahimi</pre>
3081*22dc650dSSadaf Ebrahimiit yields "no match" quickly. However, if a possessive quantifier is not used,
3082*22dc650dSSadaf Ebrahimithe match runs for a very long time indeed because there are so many different
3083*22dc650dSSadaf Ebrahimiways the + and * repeats can carve up the subject, and all have to be tested
3084*22dc650dSSadaf Ebrahimibefore failure can be reported.
3085*22dc650dSSadaf Ebrahimi</P>
3086*22dc650dSSadaf Ebrahimi<P>
3087*22dc650dSSadaf EbrahimiAt the end of a match, the values of capturing parentheses are those from
3088*22dc650dSSadaf Ebrahimithe outermost level. If you want to obtain intermediate values, a callout
3089*22dc650dSSadaf Ebrahimifunction can be used (see below and the
3090*22dc650dSSadaf Ebrahimi<a href="pcre2callout.html"><b>pcre2callout</b></a>
3091*22dc650dSSadaf Ebrahimidocumentation). If the pattern above is matched against
3092*22dc650dSSadaf Ebrahimi<pre>
3093*22dc650dSSadaf Ebrahimi  (ab(cd)ef)
3094*22dc650dSSadaf Ebrahimi</pre>
3095*22dc650dSSadaf Ebrahimithe value for the inner capturing parentheses (numbered 2) is "ef", which is
3096*22dc650dSSadaf Ebrahimithe last value taken on at the top level. If a capture group is not matched at
3097*22dc650dSSadaf Ebrahimithe top level, its final captured value is unset, even if it was (temporarily)
3098*22dc650dSSadaf Ebrahimiset at a deeper level during the matching process.
3099*22dc650dSSadaf Ebrahimi</P>
3100*22dc650dSSadaf Ebrahimi<P>
3101*22dc650dSSadaf EbrahimiDo not confuse the (?R) item with the condition (R), which tests for recursion.
3102*22dc650dSSadaf EbrahimiConsider this pattern, which matches text in angle brackets, allowing for
3103*22dc650dSSadaf Ebrahimiarbitrary nesting. Only digits are allowed in nested brackets (that is, when
3104*22dc650dSSadaf Ebrahimirecursing), whereas any characters are permitted at the outer level.
3105*22dc650dSSadaf Ebrahimi<pre>
3106*22dc650dSSadaf Ebrahimi  &#60; (?: (?(R) \d++  | [^&#60;&#62;]*+) | (?R)) * &#62;
3107*22dc650dSSadaf Ebrahimi</pre>
3108*22dc650dSSadaf EbrahimiIn this pattern, (?(R) is the start of a conditional group, with two different
3109*22dc650dSSadaf Ebrahimialternatives for the recursive and non-recursive cases. The (?R) item is the
3110*22dc650dSSadaf Ebrahimiactual recursive call.
3111*22dc650dSSadaf Ebrahimi<a name="recursiondifference"></a></P>
3112*22dc650dSSadaf Ebrahimi<br><b>
3113*22dc650dSSadaf EbrahimiDifferences in recursion processing between PCRE2 and Perl
3114*22dc650dSSadaf Ebrahimi</b><br>
3115*22dc650dSSadaf Ebrahimi<P>
3116*22dc650dSSadaf EbrahimiSome former differences between PCRE2 and Perl no longer exist.
3117*22dc650dSSadaf Ebrahimi</P>
3118*22dc650dSSadaf Ebrahimi<P>
3119*22dc650dSSadaf EbrahimiBefore release 10.30, recursion processing in PCRE2 differed from Perl in that
3120*22dc650dSSadaf Ebrahimia recursive subroutine call was always treated as an atomic group. That is,
3121*22dc650dSSadaf Ebrahimionce it had matched some of the subject string, it was never re-entered, even
3122*22dc650dSSadaf Ebrahimiif it contained untried alternatives and there was a subsequent matching
3123*22dc650dSSadaf Ebrahimifailure. (Historical note: PCRE implemented recursion before Perl did.)
3124*22dc650dSSadaf Ebrahimi</P>
3125*22dc650dSSadaf Ebrahimi<P>
3126*22dc650dSSadaf EbrahimiStarting with release 10.30, recursive subroutine calls are no longer treated
3127*22dc650dSSadaf Ebrahimias atomic. That is, they can be re-entered to try unused alternatives if there
3128*22dc650dSSadaf Ebrahimiis a matching failure later in the pattern. This is now compatible with the way
3129*22dc650dSSadaf EbrahimiPerl works. If you want a subroutine call to be atomic, you must explicitly
3130*22dc650dSSadaf Ebrahimienclose it in an atomic group.
3131*22dc650dSSadaf Ebrahimi</P>
3132*22dc650dSSadaf Ebrahimi<P>
3133*22dc650dSSadaf EbrahimiSupporting backtracking into recursions simplifies certain types of recursive
3134*22dc650dSSadaf Ebrahimipattern. For example, this pattern matches palindromic strings:
3135*22dc650dSSadaf Ebrahimi<pre>
3136*22dc650dSSadaf Ebrahimi  ^((.)(?1)\2|.?)$
3137*22dc650dSSadaf Ebrahimi</pre>
3138*22dc650dSSadaf EbrahimiThe second branch in the group matches a single central character in the
3139*22dc650dSSadaf Ebrahimipalindrome when there are an odd number of characters, or nothing when there
3140*22dc650dSSadaf Ebrahimiare an even number of characters, but in order to work it has to be able to try
3141*22dc650dSSadaf Ebrahimithe second case when the rest of the pattern match fails. If you want to match
3142*22dc650dSSadaf Ebrahimitypical palindromic phrases, the pattern has to ignore all non-word characters,
3143*22dc650dSSadaf Ebrahimiwhich can be done like this:
3144*22dc650dSSadaf Ebrahimi<pre>
3145*22dc650dSSadaf Ebrahimi  ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
3146*22dc650dSSadaf Ebrahimi</pre>
3147*22dc650dSSadaf EbrahimiIf run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
3148*22dc650dSSadaf Ebrahimiman, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
3149*22dc650dSSadaf Ebrahimiavoid backtracking into sequences of non-word characters. Without this, PCRE2
3150*22dc650dSSadaf Ebrahimitakes a great deal longer (ten times or more) to match typical phrases, and
3151*22dc650dSSadaf EbrahimiPerl takes so long that you think it has gone into a loop.
3152*22dc650dSSadaf Ebrahimi</P>
3153*22dc650dSSadaf Ebrahimi<P>
3154*22dc650dSSadaf EbrahimiAnother way in which PCRE2 and Perl used to differ in their recursion
3155*22dc650dSSadaf Ebrahimiprocessing is in the handling of captured values. Formerly in Perl, when a
3156*22dc650dSSadaf Ebrahimigroup was called recursively or as a subroutine (see the next section), it
3157*22dc650dSSadaf Ebrahimihad no access to any values that were captured outside the recursion, whereas
3158*22dc650dSSadaf Ebrahimiin PCRE2 these values can be referenced. Consider this pattern:
3159*22dc650dSSadaf Ebrahimi<pre>
3160*22dc650dSSadaf Ebrahimi  ^(.)(\1|a(?2))
3161*22dc650dSSadaf Ebrahimi</pre>
3162*22dc650dSSadaf EbrahimiThis pattern matches "bab". The first capturing parentheses match "b", then in
3163*22dc650dSSadaf Ebrahimithe second group, when the backreference \1 fails to match "b", the second
3164*22dc650dSSadaf Ebrahimialternative matches "a" and then recurses. In the recursion, \1 does now match
3165*22dc650dSSadaf Ebrahimi"b" and so the whole match succeeds. This match used to fail in Perl, but in
3166*22dc650dSSadaf Ebrahimilater versions (I tried 5.024) it now works.
3167*22dc650dSSadaf Ebrahimi<a name="groupsassubroutines"></a></P>
3168*22dc650dSSadaf Ebrahimi<br><a name="SEC26" href="#TOC1">GROUPS AS SUBROUTINES</a><br>
3169*22dc650dSSadaf Ebrahimi<P>
3170*22dc650dSSadaf EbrahimiIf the syntax for a recursive group call (either by number or by name) is used
3171*22dc650dSSadaf Ebrahimioutside the parentheses to which it refers, it operates a bit like a subroutine
3172*22dc650dSSadaf Ebrahimiin a programming language. More accurately, PCRE2 treats the referenced group
3173*22dc650dSSadaf Ebrahimias an independent subpattern which it tries to match at the current matching
3174*22dc650dSSadaf Ebrahimiposition. The called group may be defined before or after the reference. A
3175*22dc650dSSadaf Ebrahiminumbered reference can be absolute or relative, as in these examples:
3176*22dc650dSSadaf Ebrahimi<pre>
3177*22dc650dSSadaf Ebrahimi  (...(absolute)...)...(?2)...
3178*22dc650dSSadaf Ebrahimi  (...(relative)...)...(?-1)...
3179*22dc650dSSadaf Ebrahimi  (...(?+1)...(relative)...
3180*22dc650dSSadaf Ebrahimi</pre>
3181*22dc650dSSadaf EbrahimiAn earlier example pointed out that the pattern
3182*22dc650dSSadaf Ebrahimi<pre>
3183*22dc650dSSadaf Ebrahimi  (sens|respons)e and \1ibility
3184*22dc650dSSadaf Ebrahimi</pre>
3185*22dc650dSSadaf Ebrahimimatches "sense and sensibility" and "response and responsibility", but not
3186*22dc650dSSadaf Ebrahimi"sense and responsibility". If instead the pattern
3187*22dc650dSSadaf Ebrahimi<pre>
3188*22dc650dSSadaf Ebrahimi  (sens|respons)e and (?1)ibility
3189*22dc650dSSadaf Ebrahimi</pre>
3190*22dc650dSSadaf Ebrahimiis used, it does match "sense and responsibility" as well as the other two
3191*22dc650dSSadaf Ebrahimistrings. Another example is given in the discussion of DEFINE above.
3192*22dc650dSSadaf Ebrahimi</P>
3193*22dc650dSSadaf Ebrahimi<P>
3194*22dc650dSSadaf EbrahimiLike recursions, subroutine calls used to be treated as atomic, but this
3195*22dc650dSSadaf Ebrahimichanged at PCRE2 release 10.30, so backtracking into subroutine calls can now
3196*22dc650dSSadaf Ebrahimioccur. However, any capturing parentheses that are set during the subroutine
3197*22dc650dSSadaf Ebrahimicall revert to their previous values afterwards.
3198*22dc650dSSadaf Ebrahimi</P>
3199*22dc650dSSadaf Ebrahimi<P>
3200*22dc650dSSadaf EbrahimiProcessing options such as case-independence are fixed when a group is
3201*22dc650dSSadaf Ebrahimidefined, so if it is used as a subroutine, such options cannot be changed for
3202*22dc650dSSadaf Ebrahimidifferent calls. For example, consider this pattern:
3203*22dc650dSSadaf Ebrahimi<pre>
3204*22dc650dSSadaf Ebrahimi  (abc)(?i:(?-1))
3205*22dc650dSSadaf Ebrahimi</pre>
3206*22dc650dSSadaf EbrahimiIt matches "abcabc". It does not match "abcABC" because the change of
3207*22dc650dSSadaf Ebrahimiprocessing option does not affect the called group.
3208*22dc650dSSadaf Ebrahimi</P>
3209*22dc650dSSadaf Ebrahimi<P>
3210*22dc650dSSadaf EbrahimiThe behaviour of
3211*22dc650dSSadaf Ebrahimi<a href="#backtrackcontrol">backtracking control verbs</a>
3212*22dc650dSSadaf Ebrahimiin groups when called as subroutines is described in the section entitled
3213*22dc650dSSadaf Ebrahimi<a href="#btsub">"Backtracking verbs in subroutines"</a>
3214*22dc650dSSadaf Ebrahimibelow.
3215*22dc650dSSadaf Ebrahimi<a name="onigurumasubroutines"></a></P>
3216*22dc650dSSadaf Ebrahimi<br><a name="SEC27" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
3217*22dc650dSSadaf Ebrahimi<P>
3218*22dc650dSSadaf EbrahimiFor compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
3219*22dc650dSSadaf Ebrahimia number enclosed either in angle brackets or single quotes, is an alternative
3220*22dc650dSSadaf Ebrahimisyntax for calling a group as a subroutine, possibly recursively. Here are two
3221*22dc650dSSadaf Ebrahimiof the examples used above, rewritten using this syntax:
3222*22dc650dSSadaf Ebrahimi<pre>
3223*22dc650dSSadaf Ebrahimi  (?&#60;pn&#62; \( ( (?&#62;[^()]+) | \g&#60;pn&#62; )* \) )
3224*22dc650dSSadaf Ebrahimi  (sens|respons)e and \g'1'ibility
3225*22dc650dSSadaf Ebrahimi</pre>
3226*22dc650dSSadaf EbrahimiPCRE2 supports an extension to Oniguruma: if a number is preceded by a
3227*22dc650dSSadaf Ebrahimiplus or a minus sign it is taken as a relative reference. For example:
3228*22dc650dSSadaf Ebrahimi<pre>
3229*22dc650dSSadaf Ebrahimi  (abc)(?i:\g&#60;-1&#62;)
3230*22dc650dSSadaf Ebrahimi</pre>
3231*22dc650dSSadaf EbrahimiNote that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
3232*22dc650dSSadaf Ebrahimisynonymous. The former is a backreference; the latter is a subroutine call.
3233*22dc650dSSadaf Ebrahimi</P>
3234*22dc650dSSadaf Ebrahimi<br><a name="SEC28" href="#TOC1">CALLOUTS</a><br>
3235*22dc650dSSadaf Ebrahimi<P>
3236*22dc650dSSadaf EbrahimiPerl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
3237*22dc650dSSadaf Ebrahimicode to be obeyed in the middle of matching a regular expression. This makes it
3238*22dc650dSSadaf Ebrahimipossible, amongst other things, to extract different substrings that match the
3239*22dc650dSSadaf Ebrahimisame pair of parentheses when there is a repetition.
3240*22dc650dSSadaf Ebrahimi</P>
3241*22dc650dSSadaf Ebrahimi<P>
3242*22dc650dSSadaf EbrahimiPCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
3243*22dc650dSSadaf Ebrahimicode. The feature is called "callout". The caller of PCRE2 provides an external
3244*22dc650dSSadaf Ebrahimifunction by putting its entry point in a match context using the function
3245*22dc650dSSadaf Ebrahimi<b>pcre2_set_callout()</b>, and then passing that context to <b>pcre2_match()</b>
3246*22dc650dSSadaf Ebrahimior <b>pcre2_dfa_match()</b>. If no match context is passed, or if the callout
3247*22dc650dSSadaf Ebrahimientry point is set to NULL, callouts are disabled.
3248*22dc650dSSadaf Ebrahimi</P>
3249*22dc650dSSadaf Ebrahimi<P>
3250*22dc650dSSadaf EbrahimiWithin a regular expression, (?C&#60;arg&#62;) indicates a point at which the external
3251*22dc650dSSadaf Ebrahimifunction is to be called. There are two kinds of callout: those with a
3252*22dc650dSSadaf Ebrahiminumerical argument and those with a string argument. (?C) on its own with no
3253*22dc650dSSadaf Ebrahimiargument is treated as (?C0). A numerical argument allows the application to
3254*22dc650dSSadaf Ebrahimidistinguish between different callouts. String arguments were added for release
3255*22dc650dSSadaf Ebrahimi10.20 to make it possible for script languages that use PCRE2 to embed short
3256*22dc650dSSadaf Ebrahimiscripts within patterns in a similar way to Perl.
3257*22dc650dSSadaf Ebrahimi</P>
3258*22dc650dSSadaf Ebrahimi<P>
3259*22dc650dSSadaf EbrahimiDuring matching, when PCRE2 reaches a callout point, the external function is
3260*22dc650dSSadaf Ebrahimicalled. It is provided with the number or string argument of the callout, the
3261*22dc650dSSadaf Ebrahimiposition in the pattern, and one item of data that is also set in the match
3262*22dc650dSSadaf Ebrahimiblock. The callout function may cause matching to proceed, to backtrack, or to
3263*22dc650dSSadaf Ebrahimifail.
3264*22dc650dSSadaf Ebrahimi</P>
3265*22dc650dSSadaf Ebrahimi<P>
3266*22dc650dSSadaf EbrahimiBy default, PCRE2 implements a number of optimizations at matching time, and
3267*22dc650dSSadaf Ebrahimione side-effect is that sometimes callouts are skipped. If you need all
3268*22dc650dSSadaf Ebrahimipossible callouts to happen, you need to set options that disable the relevant
3269*22dc650dSSadaf Ebrahimioptimizations. More details, including a complete description of the
3270*22dc650dSSadaf Ebrahimiprogramming interface to the callout function, are given in the
3271*22dc650dSSadaf Ebrahimi<a href="pcre2callout.html"><b>pcre2callout</b></a>
3272*22dc650dSSadaf Ebrahimidocumentation.
3273*22dc650dSSadaf Ebrahimi</P>
3274*22dc650dSSadaf Ebrahimi<br><b>
3275*22dc650dSSadaf EbrahimiCallouts with numerical arguments
3276*22dc650dSSadaf Ebrahimi</b><br>
3277*22dc650dSSadaf Ebrahimi<P>
3278*22dc650dSSadaf EbrahimiIf you just want to have a means of identifying different callout points, put a
3279*22dc650dSSadaf Ebrahiminumber less than 256 after the letter C. For example, this pattern has two
3280*22dc650dSSadaf Ebrahimicallout points:
3281*22dc650dSSadaf Ebrahimi<pre>
3282*22dc650dSSadaf Ebrahimi  (?C1)abc(?C2)def
3283*22dc650dSSadaf Ebrahimi</pre>
3284*22dc650dSSadaf EbrahimiIf the PCRE2_AUTO_CALLOUT flag is passed to <b>pcre2_compile()</b>, numerical
3285*22dc650dSSadaf Ebrahimicallouts are automatically installed before each item in the pattern. They are
3286*22dc650dSSadaf Ebrahimiall numbered 255. If there is a conditional group in the pattern whose
3287*22dc650dSSadaf Ebrahimicondition is an assertion, an additional callout is inserted just before the
3288*22dc650dSSadaf Ebrahimicondition. An explicit callout may also be set at this position, as in this
3289*22dc650dSSadaf Ebrahimiexample:
3290*22dc650dSSadaf Ebrahimi<pre>
3291*22dc650dSSadaf Ebrahimi  (?(?C9)(?=a)abc|def)
3292*22dc650dSSadaf Ebrahimi</pre>
3293*22dc650dSSadaf EbrahimiNote that this applies only to assertion conditions, not to other types of
3294*22dc650dSSadaf Ebrahimicondition.
3295*22dc650dSSadaf Ebrahimi</P>
3296*22dc650dSSadaf Ebrahimi<br><b>
3297*22dc650dSSadaf EbrahimiCallouts with string arguments
3298*22dc650dSSadaf Ebrahimi</b><br>
3299*22dc650dSSadaf Ebrahimi<P>
3300*22dc650dSSadaf EbrahimiA delimited string may be used instead of a number as a callout argument. The
3301*22dc650dSSadaf Ebrahimistarting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
3302*22dc650dSSadaf Ebrahimithe same as the start, except for {, where the ending delimiter is }. If the
3303*22dc650dSSadaf Ebrahimiending delimiter is needed within the string, it must be doubled. For
3304*22dc650dSSadaf Ebrahimiexample:
3305*22dc650dSSadaf Ebrahimi<pre>
3306*22dc650dSSadaf Ebrahimi  (?C'ab ''c'' d')xyz(?C{any text})pqr
3307*22dc650dSSadaf Ebrahimi</pre>
3308*22dc650dSSadaf EbrahimiThe doubling is removed before the string is passed to the callout function.
3309*22dc650dSSadaf Ebrahimi<a name="backtrackcontrol"></a></P>
3310*22dc650dSSadaf Ebrahimi<br><a name="SEC29" href="#TOC1">BACKTRACKING CONTROL</a><br>
3311*22dc650dSSadaf Ebrahimi<P>
3312*22dc650dSSadaf EbrahimiThere are a number of special "Backtracking Control Verbs" (to use Perl's
3313*22dc650dSSadaf Ebrahimiterminology) that modify the behaviour of backtracking during matching. They
3314*22dc650dSSadaf Ebrahimiare generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
3315*22dc650dSSadaf Ebrahimiand may behave differently depending on whether or not a name argument is
3316*22dc650dSSadaf Ebrahimipresent. The names are not required to be unique within the pattern.
3317*22dc650dSSadaf Ebrahimi</P>
3318*22dc650dSSadaf Ebrahimi<P>
3319*22dc650dSSadaf EbrahimiBy default, for compatibility with Perl, a name is any sequence of characters
3320*22dc650dSSadaf Ebrahimithat does not include a closing parenthesis. The name is not processed in
3321*22dc650dSSadaf Ebrahimiany way, and it is not possible to include a closing parenthesis in the name.
3322*22dc650dSSadaf EbrahimiThis can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
3323*22dc650dSSadaf Ebrahimiis no longer Perl-compatible.
3324*22dc650dSSadaf Ebrahimi</P>
3325*22dc650dSSadaf Ebrahimi<P>
3326*22dc650dSSadaf EbrahimiWhen PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
3327*22dc650dSSadaf Ebrahimiand only an unescaped closing parenthesis terminates the name. However, the
3328*22dc650dSSadaf Ebrahimionly backslash items that are permitted are \Q, \E, and sequences such as
3329*22dc650dSSadaf Ebrahimi\x{100} that define character code points. Character type escapes such as \d
3330*22dc650dSSadaf Ebrahimiare faulted.
3331*22dc650dSSadaf Ebrahimi</P>
3332*22dc650dSSadaf Ebrahimi<P>
3333*22dc650dSSadaf EbrahimiA closing parenthesis can be included in a name either as \) or between \Q
3334*22dc650dSSadaf Ebrahimiand \E. In addition to backslash processing, if the PCRE2_EXTENDED or
3335*22dc650dSSadaf EbrahimiPCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
3336*22dc650dSSadaf Ebrahimiskipped, and #-comments are recognized, exactly as in the rest of the pattern.
3337*22dc650dSSadaf EbrahimiPCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
3338*22dc650dSSadaf EbrahimiPCRE2_ALT_VERBNAMES is also set.
3339*22dc650dSSadaf Ebrahimi</P>
3340*22dc650dSSadaf Ebrahimi<P>
3341*22dc650dSSadaf EbrahimiThe maximum length of a name is 255 in the 8-bit library and 65535 in the
3342*22dc650dSSadaf Ebrahimi16-bit and 32-bit libraries. If the name is empty, that is, if the closing
3343*22dc650dSSadaf Ebrahimiparenthesis immediately follows the colon, the effect is as if the colon were
3344*22dc650dSSadaf Ebrahiminot there. Any number of these verbs may occur in a pattern. Except for
3345*22dc650dSSadaf Ebrahimi(*ACCEPT), they may not be quantified.
3346*22dc650dSSadaf Ebrahimi</P>
3347*22dc650dSSadaf Ebrahimi<P>
3348*22dc650dSSadaf EbrahimiSince these verbs are specifically related to backtracking, most of them can be
3349*22dc650dSSadaf Ebrahimiused only when the pattern is to be matched using the traditional matching
3350*22dc650dSSadaf Ebrahimifunction, because that uses a backtracking algorithm. With the exception of
3351*22dc650dSSadaf Ebrahimi(*FAIL), which behaves like a failing negative assertion, the backtracking
3352*22dc650dSSadaf Ebrahimicontrol verbs cause an error if encountered by the DFA matching function.
3353*22dc650dSSadaf Ebrahimi</P>
3354*22dc650dSSadaf Ebrahimi<P>
3355*22dc650dSSadaf EbrahimiThe behaviour of these verbs in
3356*22dc650dSSadaf Ebrahimi<a href="#btrepeat">repeated groups,</a>
3357*22dc650dSSadaf Ebrahimi<a href="#btassert">assertions,</a>
3358*22dc650dSSadaf Ebrahimiand in
3359*22dc650dSSadaf Ebrahimi<a href="#btsub">capture groups called as subroutines</a>
3360*22dc650dSSadaf Ebrahimi(whether or not recursively) is documented below.
3361*22dc650dSSadaf Ebrahimi<a name="nooptimize"></a></P>
3362*22dc650dSSadaf Ebrahimi<br><b>
3363*22dc650dSSadaf EbrahimiOptimizations that affect backtracking verbs
3364*22dc650dSSadaf Ebrahimi</b><br>
3365*22dc650dSSadaf Ebrahimi<P>
3366*22dc650dSSadaf EbrahimiPCRE2 contains some optimizations that are used to speed up matching by running
3367*22dc650dSSadaf Ebrahimisome checks at the start of each match attempt. For example, it may know the
3368*22dc650dSSadaf Ebrahimiminimum length of matching subject, or that a particular character must be
3369*22dc650dSSadaf Ebrahimipresent. When one of these optimizations bypasses the running of a match, any
3370*22dc650dSSadaf Ebrahimiincluded backtracking verbs will not, of course, be processed. You can suppress
3371*22dc650dSSadaf Ebrahimithe start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
3372*22dc650dSSadaf Ebrahimiwhen calling <b>pcre2_compile()</b>, or by starting the pattern with
3373*22dc650dSSadaf Ebrahimi(*NO_START_OPT). There is more discussion of this option in the section
3374*22dc650dSSadaf Ebrahimientitled
3375*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#compiling">"Compiling a pattern"</a>
3376*22dc650dSSadaf Ebrahimiin the
3377*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
3378*22dc650dSSadaf Ebrahimidocumentation.
3379*22dc650dSSadaf Ebrahimi</P>
3380*22dc650dSSadaf Ebrahimi<P>
3381*22dc650dSSadaf EbrahimiExperiments with Perl suggest that it too has similar optimizations, and like
3382*22dc650dSSadaf EbrahimiPCRE2, turning them off can change the result of a match.
3383*22dc650dSSadaf Ebrahimi<a name="acceptverb"></a></P>
3384*22dc650dSSadaf Ebrahimi<br><b>
3385*22dc650dSSadaf EbrahimiVerbs that act immediately
3386*22dc650dSSadaf Ebrahimi</b><br>
3387*22dc650dSSadaf Ebrahimi<P>
3388*22dc650dSSadaf EbrahimiThe following verbs act as soon as they are encountered.
3389*22dc650dSSadaf Ebrahimi<pre>
3390*22dc650dSSadaf Ebrahimi   (*ACCEPT) or (*ACCEPT:NAME)
3391*22dc650dSSadaf Ebrahimi</pre>
3392*22dc650dSSadaf EbrahimiThis verb causes the match to end successfully, skipping the remainder of the
3393*22dc650dSSadaf Ebrahimipattern. However, when it is inside a capture group that is called as a
3394*22dc650dSSadaf Ebrahimisubroutine, only that group is ended successfully. Matching then continues
3395*22dc650dSSadaf Ebrahimiat the outer level. If (*ACCEPT) in triggered in a positive assertion, the
3396*22dc650dSSadaf Ebrahimiassertion succeeds; in a negative assertion, the assertion fails.
3397*22dc650dSSadaf Ebrahimi</P>
3398*22dc650dSSadaf Ebrahimi<P>
3399*22dc650dSSadaf EbrahimiIf (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
3400*22dc650dSSadaf Ebrahimiexample:
3401*22dc650dSSadaf Ebrahimi<pre>
3402*22dc650dSSadaf Ebrahimi  A((?:A|B(*ACCEPT)|C)D)
3403*22dc650dSSadaf Ebrahimi</pre>
3404*22dc650dSSadaf EbrahimiThis matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
3405*22dc650dSSadaf Ebrahimithe outer parentheses.
3406*22dc650dSSadaf Ebrahimi</P>
3407*22dc650dSSadaf Ebrahimi<P>
3408*22dc650dSSadaf Ebrahimi(*ACCEPT) is the only backtracking verb that is allowed to be quantified
3409*22dc650dSSadaf Ebrahimibecause an ungreedy quantification with a minimum of zero acts only when a
3410*22dc650dSSadaf Ebrahimibacktrack happens. Consider, for example,
3411*22dc650dSSadaf Ebrahimi<pre>
3412*22dc650dSSadaf Ebrahimi  (A(*ACCEPT)??B)C
3413*22dc650dSSadaf Ebrahimi</pre>
3414*22dc650dSSadaf Ebrahimiwhere A, B, and C may be complex expressions. After matching "A", the matcher
3415*22dc650dSSadaf Ebrahimiprocesses "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and
3416*22dc650dSSadaf Ebrahimithe match succeeds. In both cases, all but C is captured. Whereas (*COMMIT)
3417*22dc650dSSadaf Ebrahimi(see below) means "fail on backtrack", a repeated (*ACCEPT) of this type means
3418*22dc650dSSadaf Ebrahimi"succeed on backtrack".
3419*22dc650dSSadaf Ebrahimi</P>
3420*22dc650dSSadaf Ebrahimi<P>
3421*22dc650dSSadaf Ebrahimi<b>Warning:</b> (*ACCEPT) should not be used within a script run group, because
3422*22dc650dSSadaf Ebrahimiit causes an immediate exit from the group, bypassing the script run checking.
3423*22dc650dSSadaf Ebrahimi<pre>
3424*22dc650dSSadaf Ebrahimi  (*FAIL) or (*FAIL:NAME)
3425*22dc650dSSadaf Ebrahimi</pre>
3426*22dc650dSSadaf EbrahimiThis verb causes a matching failure, forcing backtracking to occur. It may be
3427*22dc650dSSadaf Ebrahimiabbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
3428*22dc650dSSadaf Ebrahimidocumentation notes that it is probably useful only when combined with (?{}) or
3429*22dc650dSSadaf Ebrahimi(??{}). Those are, of course, Perl features that are not present in PCRE2. The
3430*22dc650dSSadaf Ebrahiminearest equivalent is the callout feature, as for example in this pattern:
3431*22dc650dSSadaf Ebrahimi<pre>
3432*22dc650dSSadaf Ebrahimi  a+(?C)(*FAIL)
3433*22dc650dSSadaf Ebrahimi</pre>
3434*22dc650dSSadaf EbrahimiA match with the string "aaaa" always fails, but the callout is taken before
3435*22dc650dSSadaf Ebrahimieach backtrack happens (in this example, 10 times).
3436*22dc650dSSadaf Ebrahimi</P>
3437*22dc650dSSadaf Ebrahimi<P>
3438*22dc650dSSadaf Ebrahimi(*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*ACCEPT) and
3439*22dc650dSSadaf Ebrahimi(*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before
3440*22dc650dSSadaf Ebrahimithe verb acts.
3441*22dc650dSSadaf Ebrahimi</P>
3442*22dc650dSSadaf Ebrahimi<br><b>
3443*22dc650dSSadaf EbrahimiRecording which path was taken
3444*22dc650dSSadaf Ebrahimi</b><br>
3445*22dc650dSSadaf Ebrahimi<P>
3446*22dc650dSSadaf EbrahimiThere is one verb whose main purpose is to track how a match was arrived at,
3447*22dc650dSSadaf Ebrahimithough it also has a secondary use in conjunction with advancing the match
3448*22dc650dSSadaf Ebrahimistarting point (see (*SKIP) below).
3449*22dc650dSSadaf Ebrahimi<pre>
3450*22dc650dSSadaf Ebrahimi  (*MARK:NAME) or (*:NAME)
3451*22dc650dSSadaf Ebrahimi</pre>
3452*22dc650dSSadaf EbrahimiA name is always required with this verb. For all the other backtracking
3453*22dc650dSSadaf Ebrahimicontrol verbs, a NAME argument is optional.
3454*22dc650dSSadaf Ebrahimi</P>
3455*22dc650dSSadaf Ebrahimi<P>
3456*22dc650dSSadaf EbrahimiWhen a match succeeds, the name of the last-encountered mark name on the
3457*22dc650dSSadaf Ebrahimimatching path is passed back to the caller as described in the section entitled
3458*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#matchotherdata">"Other information about the match"</a>
3459*22dc650dSSadaf Ebrahimiin the
3460*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a>
3461*22dc650dSSadaf Ebrahimidocumentation. This applies to all instances of (*MARK) and other verbs,
3462*22dc650dSSadaf Ebrahimiincluding those inside assertions and atomic groups. However, there are
3463*22dc650dSSadaf Ebrahimidifferences in those cases when (*MARK) is used in conjunction with (*SKIP) as
3464*22dc650dSSadaf Ebrahimidescribed below.
3465*22dc650dSSadaf Ebrahimi</P>
3466*22dc650dSSadaf Ebrahimi<P>
3467*22dc650dSSadaf EbrahimiThe mark name that was last encountered on the matching path is passed back. A
3468*22dc650dSSadaf Ebrahimiverb without a NAME argument is ignored for this purpose. Here is an example of
3469*22dc650dSSadaf Ebrahimi<b>pcre2test</b> output, where the "mark" modifier requests the retrieval and
3470*22dc650dSSadaf Ebrahimioutputting of (*MARK) data:
3471*22dc650dSSadaf Ebrahimi<pre>
3472*22dc650dSSadaf Ebrahimi    re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/mark
3473*22dc650dSSadaf Ebrahimi  data&#62; XY
3474*22dc650dSSadaf Ebrahimi   0: XY
3475*22dc650dSSadaf Ebrahimi  MK: A
3476*22dc650dSSadaf Ebrahimi  XZ
3477*22dc650dSSadaf Ebrahimi   0: XZ
3478*22dc650dSSadaf Ebrahimi  MK: B
3479*22dc650dSSadaf Ebrahimi</pre>
3480*22dc650dSSadaf EbrahimiThe (*MARK) name is tagged with "MK:" in this output, and in this example it
3481*22dc650dSSadaf Ebrahimiindicates which of the two alternatives matched. This is a more efficient way
3482*22dc650dSSadaf Ebrahimiof obtaining this information than putting each alternative in its own
3483*22dc650dSSadaf Ebrahimicapturing parentheses.
3484*22dc650dSSadaf Ebrahimi</P>
3485*22dc650dSSadaf Ebrahimi<P>
3486*22dc650dSSadaf EbrahimiIf a verb with a name is encountered in a positive assertion that is true, the
3487*22dc650dSSadaf Ebrahiminame is recorded and passed back if it is the last-encountered. This does not
3488*22dc650dSSadaf Ebrahimihappen for negative assertions or failing positive assertions.
3489*22dc650dSSadaf Ebrahimi</P>
3490*22dc650dSSadaf Ebrahimi<P>
3491*22dc650dSSadaf EbrahimiAfter a partial match or a failed match, the last encountered name in the
3492*22dc650dSSadaf Ebrahimientire match process is returned. For example:
3493*22dc650dSSadaf Ebrahimi<pre>
3494*22dc650dSSadaf Ebrahimi    re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/mark
3495*22dc650dSSadaf Ebrahimi  data&#62; XP
3496*22dc650dSSadaf Ebrahimi  No match, mark = B
3497*22dc650dSSadaf Ebrahimi</pre>
3498*22dc650dSSadaf EbrahimiNote that in this unanchored example the mark is retained from the match
3499*22dc650dSSadaf Ebrahimiattempt that started at the letter "X" in the subject. Subsequent match
3500*22dc650dSSadaf Ebrahimiattempts starting at "P" and then with an empty string do not get as far as the
3501*22dc650dSSadaf Ebrahimi(*MARK) item, but nevertheless do not reset it.
3502*22dc650dSSadaf Ebrahimi</P>
3503*22dc650dSSadaf Ebrahimi<P>
3504*22dc650dSSadaf EbrahimiIf you are interested in (*MARK) values after failed matches, you should
3505*22dc650dSSadaf Ebrahimiprobably set the PCRE2_NO_START_OPTIMIZE option
3506*22dc650dSSadaf Ebrahimi<a href="#nooptimize">(see above)</a>
3507*22dc650dSSadaf Ebrahimito ensure that the match is always attempted.
3508*22dc650dSSadaf Ebrahimi</P>
3509*22dc650dSSadaf Ebrahimi<br><b>
3510*22dc650dSSadaf EbrahimiVerbs that act after backtracking
3511*22dc650dSSadaf Ebrahimi</b><br>
3512*22dc650dSSadaf Ebrahimi<P>
3513*22dc650dSSadaf EbrahimiThe following verbs do nothing when they are encountered. Matching continues
3514*22dc650dSSadaf Ebrahimiwith what follows, but if there is a subsequent match failure, causing a
3515*22dc650dSSadaf Ebrahimibacktrack to the verb, a failure is forced. That is, backtracking cannot pass
3516*22dc650dSSadaf Ebrahimito the left of the verb. However, when one of these verbs appears inside an
3517*22dc650dSSadaf Ebrahimiatomic group or in a lookaround assertion that is true, its effect is confined
3518*22dc650dSSadaf Ebrahimito that group, because once the group has been matched, there is never any
3519*22dc650dSSadaf Ebrahimibacktracking into it. Backtracking from beyond an assertion or an atomic group
3520*22dc650dSSadaf Ebrahimiignores the entire group, and seeks a preceding backtracking point.
3521*22dc650dSSadaf Ebrahimi</P>
3522*22dc650dSSadaf Ebrahimi<P>
3523*22dc650dSSadaf EbrahimiThese verbs differ in exactly what kind of failure occurs when backtracking
3524*22dc650dSSadaf Ebrahimireaches them. The behaviour described below is what happens when the verb is
3525*22dc650dSSadaf Ebrahiminot in a subroutine or an assertion. Subsequent sections cover these special
3526*22dc650dSSadaf Ebrahimicases.
3527*22dc650dSSadaf Ebrahimi<pre>
3528*22dc650dSSadaf Ebrahimi  (*COMMIT) or (*COMMIT:NAME)
3529*22dc650dSSadaf Ebrahimi</pre>
3530*22dc650dSSadaf EbrahimiThis verb causes the whole match to fail outright if there is a later matching
3531*22dc650dSSadaf Ebrahimifailure that causes backtracking to reach it. Even if the pattern is
3532*22dc650dSSadaf Ebrahimiunanchored, no further attempts to find a match by advancing the starting point
3533*22dc650dSSadaf Ebrahimitake place. If (*COMMIT) is the only backtracking verb that is encountered,
3534*22dc650dSSadaf Ebrahimionce it has been passed <b>pcre2_match()</b> is committed to finding a match at
3535*22dc650dSSadaf Ebrahimithe current starting point, or not at all. For example:
3536*22dc650dSSadaf Ebrahimi<pre>
3537*22dc650dSSadaf Ebrahimi  a+(*COMMIT)b
3538*22dc650dSSadaf Ebrahimi</pre>
3539*22dc650dSSadaf EbrahimiThis matches "xxaab" but not "aacaab". It can be thought of as a kind of
3540*22dc650dSSadaf Ebrahimidynamic anchor, or "I've started, so I must finish."
3541*22dc650dSSadaf Ebrahimi</P>
3542*22dc650dSSadaf Ebrahimi<P>
3543*22dc650dSSadaf EbrahimiThe behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is
3544*22dc650dSSadaf Ebrahimilike (*MARK:NAME) in that the name is remembered for passing back to the
3545*22dc650dSSadaf Ebrahimicaller. However, (*SKIP:NAME) searches only for names that are set with
3546*22dc650dSSadaf Ebrahimi(*MARK), ignoring those set by any of the other backtracking verbs.
3547*22dc650dSSadaf Ebrahimi</P>
3548*22dc650dSSadaf Ebrahimi<P>
3549*22dc650dSSadaf EbrahimiIf there is more than one backtracking verb in a pattern, a different one that
3550*22dc650dSSadaf Ebrahimifollows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a
3551*22dc650dSSadaf Ebrahimimatch does not always guarantee that a match must be at this starting point.
3552*22dc650dSSadaf Ebrahimi</P>
3553*22dc650dSSadaf Ebrahimi<P>
3554*22dc650dSSadaf EbrahimiNote that (*COMMIT) at the start of a pattern is not the same as an anchor,
3555*22dc650dSSadaf Ebrahimiunless PCRE2's start-of-match optimizations are turned off, as shown in this
3556*22dc650dSSadaf Ebrahimioutput from <b>pcre2test</b>:
3557*22dc650dSSadaf Ebrahimi<pre>
3558*22dc650dSSadaf Ebrahimi    re&#62; /(*COMMIT)abc/
3559*22dc650dSSadaf Ebrahimi  data&#62; xyzabc
3560*22dc650dSSadaf Ebrahimi   0: abc
3561*22dc650dSSadaf Ebrahimi  data&#62;
3562*22dc650dSSadaf Ebrahimi  re&#62; /(*COMMIT)abc/no_start_optimize
3563*22dc650dSSadaf Ebrahimi  data&#62; xyzabc
3564*22dc650dSSadaf Ebrahimi  No match
3565*22dc650dSSadaf Ebrahimi</pre>
3566*22dc650dSSadaf EbrahimiFor the first pattern, PCRE2 knows that any match must start with "a", so the
3567*22dc650dSSadaf Ebrahimioptimization skips along the subject to "a" before applying the pattern to the
3568*22dc650dSSadaf Ebrahimifirst set of data. The match attempt then succeeds. The second pattern disables
3569*22dc650dSSadaf Ebrahimithe optimization that skips along to the first character. The pattern is now
3570*22dc650dSSadaf Ebrahimiapplied starting at "x", and so the (*COMMIT) causes the match to fail without
3571*22dc650dSSadaf Ebrahimitrying any other starting points.
3572*22dc650dSSadaf Ebrahimi<pre>
3573*22dc650dSSadaf Ebrahimi  (*PRUNE) or (*PRUNE:NAME)
3574*22dc650dSSadaf Ebrahimi</pre>
3575*22dc650dSSadaf EbrahimiThis verb causes the match to fail at the current starting position in the
3576*22dc650dSSadaf Ebrahimisubject if there is a later matching failure that causes backtracking to reach
3577*22dc650dSSadaf Ebrahimiit. If the pattern is unanchored, the normal "bumpalong" advance to the next
3578*22dc650dSSadaf Ebrahimistarting character then happens. Backtracking can occur as usual to the left of
3579*22dc650dSSadaf Ebrahimi(*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but
3580*22dc650dSSadaf Ebrahimiif there is no match to the right, backtracking cannot cross (*PRUNE). In
3581*22dc650dSSadaf Ebrahimisimple cases, the use of (*PRUNE) is just an alternative to an atomic group or
3582*22dc650dSSadaf Ebrahimipossessive quantifier, but there are some uses of (*PRUNE) that cannot be
3583*22dc650dSSadaf Ebrahimiexpressed in any other way. In an anchored pattern (*PRUNE) has the same effect
3584*22dc650dSSadaf Ebrahimias (*COMMIT).
3585*22dc650dSSadaf Ebrahimi</P>
3586*22dc650dSSadaf Ebrahimi<P>
3587*22dc650dSSadaf EbrahimiThe behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
3588*22dc650dSSadaf Ebrahimilike (*MARK:NAME) in that the name is remembered for passing back to the
3589*22dc650dSSadaf Ebrahimicaller. However, (*SKIP:NAME) searches only for names set with (*MARK),
3590*22dc650dSSadaf Ebrahimiignoring those set by other backtracking verbs.
3591*22dc650dSSadaf Ebrahimi<pre>
3592*22dc650dSSadaf Ebrahimi  (*SKIP)
3593*22dc650dSSadaf Ebrahimi</pre>
3594*22dc650dSSadaf EbrahimiThis verb, when given without a name, is like (*PRUNE), except that if the
3595*22dc650dSSadaf Ebrahimipattern is unanchored, the "bumpalong" advance is not to the next character,
3596*22dc650dSSadaf Ebrahimibut to the position in the subject where (*SKIP) was encountered. (*SKIP)
3597*22dc650dSSadaf Ebrahimisignifies that whatever text was matched leading up to it cannot be part of a
3598*22dc650dSSadaf Ebrahimisuccessful match if there is a later mismatch. Consider:
3599*22dc650dSSadaf Ebrahimi<pre>
3600*22dc650dSSadaf Ebrahimi  a+(*SKIP)b
3601*22dc650dSSadaf Ebrahimi</pre>
3602*22dc650dSSadaf EbrahimiIf the subject is "aaaac...", after the first match attempt fails (starting at
3603*22dc650dSSadaf Ebrahimithe first character in the string), the starting point skips on to start the
3604*22dc650dSSadaf Ebrahiminext attempt at "c". Note that a possessive quantifier does not have the same
3605*22dc650dSSadaf Ebrahimieffect as this example; although it would suppress backtracking during the
3606*22dc650dSSadaf Ebrahimifirst match attempt, the second attempt would start at the second character
3607*22dc650dSSadaf Ebrahimiinstead of skipping on to "c".
3608*22dc650dSSadaf Ebrahimi</P>
3609*22dc650dSSadaf Ebrahimi<P>
3610*22dc650dSSadaf EbrahimiIf (*SKIP) is used to specify a new starting position that is the same as the
3611*22dc650dSSadaf Ebrahimistarting position of the current match, or (by being inside a lookbehind)
3612*22dc650dSSadaf Ebrahimiearlier, the position specified by (*SKIP) is ignored, and instead the normal
3613*22dc650dSSadaf Ebrahimi"bumpalong" occurs.
3614*22dc650dSSadaf Ebrahimi<pre>
3615*22dc650dSSadaf Ebrahimi  (*SKIP:NAME)
3616*22dc650dSSadaf Ebrahimi</pre>
3617*22dc650dSSadaf EbrahimiWhen (*SKIP) has an associated name, its behaviour is modified. When such a
3618*22dc650dSSadaf Ebrahimi(*SKIP) is triggered, the previous path through the pattern is searched for the
3619*22dc650dSSadaf Ebrahimimost recent (*MARK) that has the same name. If one is found, the "bumpalong"
3620*22dc650dSSadaf Ebrahimiadvance is to the subject position that corresponds to that (*MARK) instead of
3621*22dc650dSSadaf Ebrahimito where (*SKIP) was encountered. If no (*MARK) with a matching name is found,
3622*22dc650dSSadaf Ebrahimithe (*SKIP) is ignored.
3623*22dc650dSSadaf Ebrahimi</P>
3624*22dc650dSSadaf Ebrahimi<P>
3625*22dc650dSSadaf EbrahimiThe search for a (*MARK) name uses the normal backtracking mechanism, which
3626*22dc650dSSadaf Ebrahimimeans that it does not see (*MARK) settings that are inside atomic groups or
3627*22dc650dSSadaf Ebrahimiassertions, because they are never re-entered by backtracking. Compare the
3628*22dc650dSSadaf Ebrahimifollowing <b>pcre2test</b> examples:
3629*22dc650dSSadaf Ebrahimi<pre>
3630*22dc650dSSadaf Ebrahimi    re&#62; /a(?&#62;(*MARK:X))(*SKIP:X)(*F)|(.)/
3631*22dc650dSSadaf Ebrahimi  data: abc
3632*22dc650dSSadaf Ebrahimi   0: a
3633*22dc650dSSadaf Ebrahimi   1: a
3634*22dc650dSSadaf Ebrahimi  data:
3635*22dc650dSSadaf Ebrahimi    re&#62; /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
3636*22dc650dSSadaf Ebrahimi  data: abc
3637*22dc650dSSadaf Ebrahimi   0: b
3638*22dc650dSSadaf Ebrahimi   1: b
3639*22dc650dSSadaf Ebrahimi</pre>
3640*22dc650dSSadaf EbrahimiIn the first example, the (*MARK) setting is in an atomic group, so it is not
3641*22dc650dSSadaf Ebrahimiseen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. This allows
3642*22dc650dSSadaf Ebrahimithe second branch of the pattern to be tried at the first character position.
3643*22dc650dSSadaf EbrahimiIn the second example, the (*MARK) setting is not in an atomic group. This
3644*22dc650dSSadaf Ebrahimiallows (*SKIP:X) to find the (*MARK) when it backtracks, and this causes a new
3645*22dc650dSSadaf Ebrahimimatching attempt to start at the second character. This time, the (*MARK) is
3646*22dc650dSSadaf Ebrahiminever seen because "a" does not match "b", so the matcher immediately jumps to
3647*22dc650dSSadaf Ebrahimithe second branch of the pattern.
3648*22dc650dSSadaf Ebrahimi</P>
3649*22dc650dSSadaf Ebrahimi<P>
3650*22dc650dSSadaf EbrahimiNote that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores
3651*22dc650dSSadaf Ebrahiminames that are set by other backtracking verbs.
3652*22dc650dSSadaf Ebrahimi<pre>
3653*22dc650dSSadaf Ebrahimi  (*THEN) or (*THEN:NAME)
3654*22dc650dSSadaf Ebrahimi</pre>
3655*22dc650dSSadaf EbrahimiThis verb causes a skip to the next innermost alternative when backtracking
3656*22dc650dSSadaf Ebrahimireaches it. That is, it cancels any further backtracking within the current
3657*22dc650dSSadaf Ebrahimialternative. Its name comes from the observation that it can be used for a
3658*22dc650dSSadaf Ebrahimipattern-based if-then-else block:
3659*22dc650dSSadaf Ebrahimi<pre>
3660*22dc650dSSadaf Ebrahimi  ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
3661*22dc650dSSadaf Ebrahimi</pre>
3662*22dc650dSSadaf EbrahimiIf the COND1 pattern matches, FOO is tried (and possibly further items after
3663*22dc650dSSadaf Ebrahimithe end of the group if FOO succeeds); on failure, the matcher skips to the
3664*22dc650dSSadaf Ebrahimisecond alternative and tries COND2, without backtracking into COND1. If that
3665*22dc650dSSadaf Ebrahimisucceeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
3666*22dc650dSSadaf Ebrahimimore alternatives, so there is a backtrack to whatever came before the entire
3667*22dc650dSSadaf Ebrahimigroup. If (*THEN) is not inside an alternation, it acts like (*PRUNE).
3668*22dc650dSSadaf Ebrahimi</P>
3669*22dc650dSSadaf Ebrahimi<P>
3670*22dc650dSSadaf EbrahimiThe behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). It is
3671*22dc650dSSadaf Ebrahimilike (*MARK:NAME) in that the name is remembered for passing back to the
3672*22dc650dSSadaf Ebrahimicaller. However, (*SKIP:NAME) searches only for names set with (*MARK),
3673*22dc650dSSadaf Ebrahimiignoring those set by other backtracking verbs.
3674*22dc650dSSadaf Ebrahimi</P>
3675*22dc650dSSadaf Ebrahimi<P>
3676*22dc650dSSadaf EbrahimiA group that does not contain a | character is just a part of the enclosing
3677*22dc650dSSadaf Ebrahimialternative; it is not a nested alternation with only one alternative. The
3678*22dc650dSSadaf Ebrahimieffect of (*THEN) extends beyond such a group to the enclosing alternative.
3679*22dc650dSSadaf EbrahimiConsider this pattern, where A, B, etc. are complex pattern fragments that do
3680*22dc650dSSadaf Ebrahiminot contain any | characters at this level:
3681*22dc650dSSadaf Ebrahimi<pre>
3682*22dc650dSSadaf Ebrahimi  A (B(*THEN)C) | D
3683*22dc650dSSadaf Ebrahimi</pre>
3684*22dc650dSSadaf EbrahimiIf A and B are matched, but there is a failure in C, matching does not
3685*22dc650dSSadaf Ebrahimibacktrack into A; instead it moves to the next alternative, that is, D.
3686*22dc650dSSadaf EbrahimiHowever, if the group containing (*THEN) is given an alternative, it
3687*22dc650dSSadaf Ebrahimibehaves differently:
3688*22dc650dSSadaf Ebrahimi<pre>
3689*22dc650dSSadaf Ebrahimi  A (B(*THEN)C | (*FAIL)) | D
3690*22dc650dSSadaf Ebrahimi</pre>
3691*22dc650dSSadaf EbrahimiThe effect of (*THEN) is now confined to the inner group. After a failure in C,
3692*22dc650dSSadaf Ebrahimimatching moves to (*FAIL), which causes the whole group to fail because there
3693*22dc650dSSadaf Ebrahimiare no more alternatives to try. In this case, matching does backtrack into A.
3694*22dc650dSSadaf Ebrahimi</P>
3695*22dc650dSSadaf Ebrahimi<P>
3696*22dc650dSSadaf EbrahimiNote that a conditional group is not considered as having two alternatives,
3697*22dc650dSSadaf Ebrahimibecause only one is ever used. In other words, the | character in a conditional
3698*22dc650dSSadaf Ebrahimigroup has a different meaning. Ignoring white space, consider:
3699*22dc650dSSadaf Ebrahimi<pre>
3700*22dc650dSSadaf Ebrahimi  ^.*? (?(?=a) a | b(*THEN)c )
3701*22dc650dSSadaf Ebrahimi</pre>
3702*22dc650dSSadaf EbrahimiIf the subject is "ba", this pattern does not match. Because .*? is ungreedy,
3703*22dc650dSSadaf Ebrahimiit initially matches zero characters. The condition (?=a) then fails, the
3704*22dc650dSSadaf Ebrahimicharacter "b" is matched, but "c" is not. At this point, matching does not
3705*22dc650dSSadaf Ebrahimibacktrack to .*? as might perhaps be expected from the presence of the |
3706*22dc650dSSadaf Ebrahimicharacter. The conditional group is part of the single alternative that
3707*22dc650dSSadaf Ebrahimicomprises the whole pattern, and so the match fails. (If there was a backtrack
3708*22dc650dSSadaf Ebrahimiinto .*?, allowing it to match "b", the match would succeed.)
3709*22dc650dSSadaf Ebrahimi</P>
3710*22dc650dSSadaf Ebrahimi<P>
3711*22dc650dSSadaf EbrahimiThe verbs just described provide four different "strengths" of control when
3712*22dc650dSSadaf Ebrahimisubsequent matching fails. (*THEN) is the weakest, carrying on the match at the
3713*22dc650dSSadaf Ebrahiminext alternative. (*PRUNE) comes next, failing the match at the current
3714*22dc650dSSadaf Ebrahimistarting position, but allowing an advance to the next character (for an
3715*22dc650dSSadaf Ebrahimiunanchored pattern). (*SKIP) is similar, except that the advance may be more
3716*22dc650dSSadaf Ebrahimithan one character. (*COMMIT) is the strongest, causing the entire match to
3717*22dc650dSSadaf Ebrahimifail.
3718*22dc650dSSadaf Ebrahimi</P>
3719*22dc650dSSadaf Ebrahimi<br><b>
3720*22dc650dSSadaf EbrahimiMore than one backtracking verb
3721*22dc650dSSadaf Ebrahimi</b><br>
3722*22dc650dSSadaf Ebrahimi<P>
3723*22dc650dSSadaf EbrahimiIf more than one backtracking verb is present in a pattern, the one that is
3724*22dc650dSSadaf Ebrahimibacktracked onto first acts. For example, consider this pattern, where A, B,
3725*22dc650dSSadaf Ebrahimietc. are complex pattern fragments:
3726*22dc650dSSadaf Ebrahimi<pre>
3727*22dc650dSSadaf Ebrahimi  (A(*COMMIT)B(*THEN)C|ABD)
3728*22dc650dSSadaf Ebrahimi</pre>
3729*22dc650dSSadaf EbrahimiIf A matches but B fails, the backtrack to (*COMMIT) causes the entire match to
3730*22dc650dSSadaf Ebrahimifail. However, if A and B match, but C fails, the backtrack to (*THEN) causes
3731*22dc650dSSadaf Ebrahimithe next alternative (ABD) to be tried. This behaviour is consistent, but is
3732*22dc650dSSadaf Ebrahiminot always the same as Perl's. It means that if two or more backtracking verbs
3733*22dc650dSSadaf Ebrahimiappear in succession, all but the last of them has no effect. Consider this
3734*22dc650dSSadaf Ebrahimiexample:
3735*22dc650dSSadaf Ebrahimi<pre>
3736*22dc650dSSadaf Ebrahimi  ...(*COMMIT)(*PRUNE)...
3737*22dc650dSSadaf Ebrahimi</pre>
3738*22dc650dSSadaf EbrahimiIf there is a matching failure to the right, backtracking onto (*PRUNE) causes
3739*22dc650dSSadaf Ebrahimiit to be triggered, and its action is taken. There can never be a backtrack
3740*22dc650dSSadaf Ebrahimionto (*COMMIT).
3741*22dc650dSSadaf Ebrahimi<a name="btrepeat"></a></P>
3742*22dc650dSSadaf Ebrahimi<br><b>
3743*22dc650dSSadaf EbrahimiBacktracking verbs in repeated groups
3744*22dc650dSSadaf Ebrahimi</b><br>
3745*22dc650dSSadaf Ebrahimi<P>
3746*22dc650dSSadaf EbrahimiPCRE2 sometimes differs from Perl in its handling of backtracking verbs in
3747*22dc650dSSadaf Ebrahimirepeated groups. For example, consider:
3748*22dc650dSSadaf Ebrahimi<pre>
3749*22dc650dSSadaf Ebrahimi  /(a(*COMMIT)b)+ac/
3750*22dc650dSSadaf Ebrahimi</pre>
3751*22dc650dSSadaf EbrahimiIf the subject is "abac", Perl matches unless its optimizations are disabled,
3752*22dc650dSSadaf Ebrahimibut PCRE2 always fails because the (*COMMIT) in the second repeat of the group
3753*22dc650dSSadaf Ebrahimiacts.
3754*22dc650dSSadaf Ebrahimi<a name="btassert"></a></P>
3755*22dc650dSSadaf Ebrahimi<br><b>
3756*22dc650dSSadaf EbrahimiBacktracking verbs in assertions
3757*22dc650dSSadaf Ebrahimi</b><br>
3758*22dc650dSSadaf Ebrahimi<P>
3759*22dc650dSSadaf Ebrahimi(*FAIL) in any assertion has its normal effect: it forces an immediate
3760*22dc650dSSadaf Ebrahimibacktrack. The behaviour of the other backtracking verbs depends on whether or
3761*22dc650dSSadaf Ebrahiminot the assertion is standalone or acting as the condition in a conditional
3762*22dc650dSSadaf Ebrahimigroup.
3763*22dc650dSSadaf Ebrahimi</P>
3764*22dc650dSSadaf Ebrahimi<P>
3765*22dc650dSSadaf Ebrahimi(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
3766*22dc650dSSadaf Ebrahimiwithout any further processing; captured strings and a mark name (if set) are
3767*22dc650dSSadaf Ebrahimiretained. In a standalone negative assertion, (*ACCEPT) causes the assertion to
3768*22dc650dSSadaf Ebrahimifail without any further processing; captured substrings and any mark name are
3769*22dc650dSSadaf Ebrahimidiscarded.
3770*22dc650dSSadaf Ebrahimi</P>
3771*22dc650dSSadaf Ebrahimi<P>
3772*22dc650dSSadaf EbrahimiIf the assertion is a condition, (*ACCEPT) causes the condition to be true for
3773*22dc650dSSadaf Ebrahimia positive assertion and false for a negative one; captured substrings are
3774*22dc650dSSadaf Ebrahimiretained in both cases.
3775*22dc650dSSadaf Ebrahimi</P>
3776*22dc650dSSadaf Ebrahimi<P>
3777*22dc650dSSadaf EbrahimiThe remaining verbs act only when a later failure causes a backtrack to
3778*22dc650dSSadaf Ebrahimireach them. This means that, for the Perl-compatible assertions, their effect
3779*22dc650dSSadaf Ebrahimiis confined to the assertion, because Perl lookaround assertions are atomic. A
3780*22dc650dSSadaf Ebrahimibacktrack that occurs after such an assertion is complete does not jump back
3781*22dc650dSSadaf Ebrahimiinto the assertion. Note in particular that a (*MARK) name that is set in an
3782*22dc650dSSadaf Ebrahimiassertion is not "seen" by an instance of (*SKIP:NAME) later in the pattern.
3783*22dc650dSSadaf Ebrahimi</P>
3784*22dc650dSSadaf Ebrahimi<P>
3785*22dc650dSSadaf EbrahimiPCRE2 now supports non-atomic positive assertions, as described in the section
3786*22dc650dSSadaf Ebrahimientitled
3787*22dc650dSSadaf Ebrahimi<a href="#nonatomicassertions">"Non-atomic assertions"</a>
3788*22dc650dSSadaf Ebrahimiabove. These assertions must be standalone (not used as conditions). They are
3789*22dc650dSSadaf Ebrahiminot Perl-compatible. For these assertions, a later backtrack does jump back
3790*22dc650dSSadaf Ebrahimiinto the assertion, and therefore verbs such as (*COMMIT) can be triggered by
3791*22dc650dSSadaf Ebrahimibacktracks from later in the pattern.
3792*22dc650dSSadaf Ebrahimi</P>
3793*22dc650dSSadaf Ebrahimi<P>
3794*22dc650dSSadaf EbrahimiThe effect of (*THEN) is not allowed to escape beyond an assertion. If there
3795*22dc650dSSadaf Ebrahimiare no more branches to try, (*THEN) causes a positive assertion to be false,
3796*22dc650dSSadaf Ebrahimiand a negative assertion to be true.
3797*22dc650dSSadaf Ebrahimi</P>
3798*22dc650dSSadaf Ebrahimi<P>
3799*22dc650dSSadaf EbrahimiThe other backtracking verbs are not treated specially if they appear in a
3800*22dc650dSSadaf Ebrahimistandalone positive assertion. In a conditional positive assertion,
3801*22dc650dSSadaf Ebrahimibacktracking (from within the assertion) into (*COMMIT), (*SKIP), or (*PRUNE)
3802*22dc650dSSadaf Ebrahimicauses the condition to be false. However, for both standalone and conditional
3803*22dc650dSSadaf Ebrahiminegative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes
3804*22dc650dSSadaf Ebrahimithe assertion to be true, without considering any further alternative branches.
3805*22dc650dSSadaf Ebrahimi<a name="btsub"></a></P>
3806*22dc650dSSadaf Ebrahimi<br><b>
3807*22dc650dSSadaf EbrahimiBacktracking verbs in subroutines
3808*22dc650dSSadaf Ebrahimi</b><br>
3809*22dc650dSSadaf Ebrahimi<P>
3810*22dc650dSSadaf EbrahimiThese behaviours occur whether or not the group is called recursively.
3811*22dc650dSSadaf Ebrahimi</P>
3812*22dc650dSSadaf Ebrahimi<P>
3813*22dc650dSSadaf Ebrahimi(*ACCEPT) in a group called as a subroutine causes the subroutine match to
3814*22dc650dSSadaf Ebrahimisucceed without any further processing. Matching then continues after the
3815*22dc650dSSadaf Ebrahimisubroutine call. Perl documents this behaviour. Perl's treatment of the other
3816*22dc650dSSadaf Ebrahimiverbs in subroutines is different in some cases.
3817*22dc650dSSadaf Ebrahimi</P>
3818*22dc650dSSadaf Ebrahimi<P>
3819*22dc650dSSadaf Ebrahimi(*FAIL) in a group called as a subroutine has its normal effect: it forces
3820*22dc650dSSadaf Ebrahimian immediate backtrack.
3821*22dc650dSSadaf Ebrahimi</P>
3822*22dc650dSSadaf Ebrahimi<P>
3823*22dc650dSSadaf Ebrahimi(*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when
3824*22dc650dSSadaf Ebrahimitriggered by being backtracked to in a group called as a subroutine. There is
3825*22dc650dSSadaf Ebrahimithen a backtrack at the outer level.
3826*22dc650dSSadaf Ebrahimi</P>
3827*22dc650dSSadaf Ebrahimi<P>
3828*22dc650dSSadaf Ebrahimi(*THEN), when triggered, skips to the next alternative in the innermost
3829*22dc650dSSadaf Ebrahimienclosing group that has alternatives (its normal behaviour). However, if there
3830*22dc650dSSadaf Ebrahimiis no such group within the subroutine's group, the subroutine match fails and
3831*22dc650dSSadaf Ebrahimithere is a backtrack at the outer level.
3832*22dc650dSSadaf Ebrahimi</P>
3833*22dc650dSSadaf Ebrahimi<br><a name="SEC30" href="#TOC1">SEE ALSO</a><br>
3834*22dc650dSSadaf Ebrahimi<P>
3835*22dc650dSSadaf Ebrahimi<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3),
3836*22dc650dSSadaf Ebrahimi<b>pcre2syntax</b>(3), <b>pcre2</b>(3).
3837*22dc650dSSadaf Ebrahimi</P>
3838*22dc650dSSadaf Ebrahimi<br><a name="SEC31" href="#TOC1">AUTHOR</a><br>
3839*22dc650dSSadaf Ebrahimi<P>
3840*22dc650dSSadaf EbrahimiPhilip Hazel
3841*22dc650dSSadaf Ebrahimi<br>
3842*22dc650dSSadaf EbrahimiRetired from University Computing Service
3843*22dc650dSSadaf Ebrahimi<br>
3844*22dc650dSSadaf EbrahimiCambridge, England.
3845*22dc650dSSadaf Ebrahimi<br>
3846*22dc650dSSadaf Ebrahimi</P>
3847*22dc650dSSadaf Ebrahimi<br><a name="SEC32" href="#TOC1">REVISION</a><br>
3848*22dc650dSSadaf Ebrahimi<P>
3849*22dc650dSSadaf EbrahimiLast updated: 04 June 2024
3850*22dc650dSSadaf Ebrahimi<br>
3851*22dc650dSSadaf EbrahimiCopyright &copy; 1997-2024 University of Cambridge.
3852*22dc650dSSadaf Ebrahimi<br>
3853*22dc650dSSadaf Ebrahimi<p>
3854*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
3855*22dc650dSSadaf Ebrahimi</p>
3856