1*22dc650dSSadaf Ebrahimi<html> 2*22dc650dSSadaf Ebrahimi<head> 3*22dc650dSSadaf Ebrahimi<title>pcre2pattern specification</title> 4*22dc650dSSadaf Ebrahimi</head> 5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6*22dc650dSSadaf Ebrahimi<h1>pcre2pattern man page</h1> 7*22dc650dSSadaf Ebrahimi<p> 8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>. 9*22dc650dSSadaf Ebrahimi</p> 10*22dc650dSSadaf Ebrahimi<p> 11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated 12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it, 13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong. 14*22dc650dSSadaf Ebrahimi<br> 15*22dc650dSSadaf Ebrahimi<ul> 16*22dc650dSSadaf Ebrahimi<li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION DETAILS</a> 17*22dc650dSSadaf Ebrahimi<li><a name="TOC2" href="#SEC2">SPECIAL START-OF-PATTERN ITEMS</a> 18*22dc650dSSadaf Ebrahimi<li><a name="TOC3" href="#SEC3">EBCDIC CHARACTER CODES</a> 19*22dc650dSSadaf Ebrahimi<li><a name="TOC4" href="#SEC4">CHARACTERS AND METACHARACTERS</a> 20*22dc650dSSadaf Ebrahimi<li><a name="TOC5" href="#SEC5">BACKSLASH</a> 21*22dc650dSSadaf Ebrahimi<li><a name="TOC6" href="#SEC6">CIRCUMFLEX AND DOLLAR</a> 22*22dc650dSSadaf Ebrahimi<li><a name="TOC7" href="#SEC7">FULL STOP (PERIOD, DOT) AND \N</a> 23*22dc650dSSadaf Ebrahimi<li><a name="TOC8" href="#SEC8">MATCHING A SINGLE CODE UNIT</a> 24*22dc650dSSadaf Ebrahimi<li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a> 25*22dc650dSSadaf Ebrahimi<li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a> 26*22dc650dSSadaf Ebrahimi<li><a name="TOC11" href="#SEC11">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a> 27*22dc650dSSadaf Ebrahimi<li><a name="TOC12" href="#SEC12">VERTICAL BAR</a> 28*22dc650dSSadaf Ebrahimi<li><a name="TOC13" href="#SEC13">INTERNAL OPTION SETTING</a> 29*22dc650dSSadaf Ebrahimi<li><a name="TOC14" href="#SEC14">GROUPS</a> 30*22dc650dSSadaf Ebrahimi<li><a name="TOC15" href="#SEC15">DUPLICATE GROUP NUMBERS</a> 31*22dc650dSSadaf Ebrahimi<li><a name="TOC16" href="#SEC16">NAMED CAPTURE GROUPS</a> 32*22dc650dSSadaf Ebrahimi<li><a name="TOC17" href="#SEC17">REPETITION</a> 33*22dc650dSSadaf Ebrahimi<li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> 34*22dc650dSSadaf Ebrahimi<li><a name="TOC19" href="#SEC19">BACKREFERENCES</a> 35*22dc650dSSadaf Ebrahimi<li><a name="TOC20" href="#SEC20">ASSERTIONS</a> 36*22dc650dSSadaf Ebrahimi<li><a name="TOC21" href="#SEC21">NON-ATOMIC ASSERTIONS</a> 37*22dc650dSSadaf Ebrahimi<li><a name="TOC22" href="#SEC22">SCRIPT RUNS</a> 38*22dc650dSSadaf Ebrahimi<li><a name="TOC23" href="#SEC23">CONDITIONAL GROUPS</a> 39*22dc650dSSadaf Ebrahimi<li><a name="TOC24" href="#SEC24">COMMENTS</a> 40*22dc650dSSadaf Ebrahimi<li><a name="TOC25" href="#SEC25">RECURSIVE PATTERNS</a> 41*22dc650dSSadaf Ebrahimi<li><a name="TOC26" href="#SEC26">GROUPS AS SUBROUTINES</a> 42*22dc650dSSadaf Ebrahimi<li><a name="TOC27" href="#SEC27">ONIGURUMA SUBROUTINE SYNTAX</a> 43*22dc650dSSadaf Ebrahimi<li><a name="TOC28" href="#SEC28">CALLOUTS</a> 44*22dc650dSSadaf Ebrahimi<li><a name="TOC29" href="#SEC29">BACKTRACKING CONTROL</a> 45*22dc650dSSadaf Ebrahimi<li><a name="TOC30" href="#SEC30">SEE ALSO</a> 46*22dc650dSSadaf Ebrahimi<li><a name="TOC31" href="#SEC31">AUTHOR</a> 47*22dc650dSSadaf Ebrahimi<li><a name="TOC32" href="#SEC32">REVISION</a> 48*22dc650dSSadaf Ebrahimi</ul> 49*22dc650dSSadaf Ebrahimi<br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION DETAILS</a><br> 50*22dc650dSSadaf Ebrahimi<P> 51*22dc650dSSadaf EbrahimiThe syntax and semantics of the regular expressions that are supported by PCRE2 52*22dc650dSSadaf Ebrahimiare described in detail below. There is a quick-reference syntax summary in the 53*22dc650dSSadaf Ebrahimi<a href="pcre2syntax.html"><b>pcre2syntax</b></a> 54*22dc650dSSadaf Ebrahimipage. PCRE2 tries to match Perl syntax and semantics as closely as it can. 55*22dc650dSSadaf EbrahimiPCRE2 also supports some alternative regular expression syntax (which does not 56*22dc650dSSadaf Ebrahimiconflict with the Perl syntax) in order to provide some compatibility with 57*22dc650dSSadaf Ebrahimiregular expressions in Python, .NET, and Oniguruma. 58*22dc650dSSadaf Ebrahimi</P> 59*22dc650dSSadaf Ebrahimi<P> 60*22dc650dSSadaf EbrahimiPerl's regular expressions are described in its own documentation, and regular 61*22dc650dSSadaf Ebrahimiexpressions in general are covered in a number of books, some of which have 62*22dc650dSSadaf Ebrahimicopious examples. Jeffrey Friedl's "Mastering Regular Expressions", published 63*22dc650dSSadaf Ebrahimiby O'Reilly, covers regular expressions in great detail. This description of 64*22dc650dSSadaf EbrahimiPCRE2's regular expressions is intended as reference material. 65*22dc650dSSadaf Ebrahimi</P> 66*22dc650dSSadaf Ebrahimi<P> 67*22dc650dSSadaf EbrahimiThis document discusses the regular expression patterns that are supported by 68*22dc650dSSadaf EbrahimiPCRE2 when its main matching function, <b>pcre2_match()</b>, is used. PCRE2 also 69*22dc650dSSadaf Ebrahimihas an alternative matching function, <b>pcre2_dfa_match()</b>, which matches 70*22dc650dSSadaf Ebrahimiusing a different algorithm that is not Perl-compatible. Some of the features 71*22dc650dSSadaf Ebrahimidiscussed below are not available when DFA matching is used. The advantages and 72*22dc650dSSadaf Ebrahimidisadvantages of the alternative function, and how it differs from the normal 73*22dc650dSSadaf Ebrahimifunction, are discussed in the 74*22dc650dSSadaf Ebrahimi<a href="pcre2matching.html"><b>pcre2matching</b></a> 75*22dc650dSSadaf Ebrahimipage. 76*22dc650dSSadaf Ebrahimi</P> 77*22dc650dSSadaf Ebrahimi<br><a name="SEC2" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br> 78*22dc650dSSadaf Ebrahimi<P> 79*22dc650dSSadaf EbrahimiA number of options that can be passed to <b>pcre2_compile()</b> can also be set 80*22dc650dSSadaf Ebrahimiby special items at the start of a pattern. These are not Perl-compatible, but 81*22dc650dSSadaf Ebrahimiare provided to make these options accessible to pattern writers who are not 82*22dc650dSSadaf Ebrahimiable to change the program that processes the pattern. Any number of these 83*22dc650dSSadaf Ebrahimiitems may appear, but they must all be together right at the start of the 84*22dc650dSSadaf Ebrahimipattern string, and the letters must be in upper case. 85*22dc650dSSadaf Ebrahimi</P> 86*22dc650dSSadaf Ebrahimi<br><b> 87*22dc650dSSadaf EbrahimiUTF support 88*22dc650dSSadaf Ebrahimi</b><br> 89*22dc650dSSadaf Ebrahimi<P> 90*22dc650dSSadaf EbrahimiIn the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as 91*22dc650dSSadaf Ebrahimisingle code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be 92*22dc650dSSadaf Ebrahimispecified for the 32-bit library, in which case it constrains the character 93*22dc650dSSadaf Ebrahimivalues to valid Unicode code points. To process UTF strings, PCRE2 must be 94*22dc650dSSadaf Ebrahimibuilt to include Unicode support (which is the default). When using UTF strings 95*22dc650dSSadaf Ebrahimiyou must either call the compiling function with one or both of the PCRE2_UTF 96*22dc650dSSadaf Ebrahimior PCRE2_MATCH_INVALID_UTF options, or the pattern must start with the special 97*22dc650dSSadaf Ebrahimisequence (*UTF), which is equivalent to setting the relevant PCRE2_UTF. How 98*22dc650dSSadaf Ebrahimisetting a UTF mode affects pattern matching is mentioned in several places 99*22dc650dSSadaf Ebrahimibelow. There is also a summary of features in the 100*22dc650dSSadaf Ebrahimi<a href="pcre2unicode.html"><b>pcre2unicode</b></a> 101*22dc650dSSadaf Ebrahimipage. 102*22dc650dSSadaf Ebrahimi</P> 103*22dc650dSSadaf Ebrahimi<P> 104*22dc650dSSadaf EbrahimiSome applications that allow their users to supply patterns may wish to 105*22dc650dSSadaf Ebrahimirestrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF 106*22dc650dSSadaf Ebrahimioption is passed to <b>pcre2_compile()</b>, (*UTF) is not allowed, and its 107*22dc650dSSadaf Ebrahimiappearance in a pattern causes an error. 108*22dc650dSSadaf Ebrahimi</P> 109*22dc650dSSadaf Ebrahimi<br><b> 110*22dc650dSSadaf EbrahimiUnicode property support 111*22dc650dSSadaf Ebrahimi</b><br> 112*22dc650dSSadaf Ebrahimi<P> 113*22dc650dSSadaf EbrahimiAnother special sequence that may appear at the start of a pattern is (*UCP). 114*22dc650dSSadaf EbrahimiThis has the same effect as setting the PCRE2_UCP option: it causes sequences 115*22dc650dSSadaf Ebrahimisuch as \d and \w to use Unicode properties to determine character types, 116*22dc650dSSadaf Ebrahimiinstead of recognizing only characters with codes less than 256 via a lookup 117*22dc650dSSadaf Ebrahimitable. If also causes upper/lower casing operations to use Unicode properties 118*22dc650dSSadaf Ebrahimifor characters with code points greater than 127, even when UTF is not set. 119*22dc650dSSadaf EbrahimiThese behaviours can be changed within the pattern; see the section entitled 120*22dc650dSSadaf Ebrahimi<a href="#internaloptions">"Internal Option Setting"</a> 121*22dc650dSSadaf Ebrahimibelow. 122*22dc650dSSadaf Ebrahimi</P> 123*22dc650dSSadaf Ebrahimi<P> 124*22dc650dSSadaf EbrahimiSome applications that allow their users to supply patterns may wish to 125*22dc650dSSadaf Ebrahimirestrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to 126*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b>, (*UCP) is not allowed, and its appearance in a pattern 127*22dc650dSSadaf Ebrahimicauses an error. 128*22dc650dSSadaf Ebrahimi</P> 129*22dc650dSSadaf Ebrahimi<br><b> 130*22dc650dSSadaf EbrahimiLocking out empty string matching 131*22dc650dSSadaf Ebrahimi</b><br> 132*22dc650dSSadaf Ebrahimi<P> 133*22dc650dSSadaf EbrahimiStarting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect 134*22dc650dSSadaf Ebrahimias passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever 135*22dc650dSSadaf Ebrahimimatching function is subsequently called to match the pattern. These options 136*22dc650dSSadaf Ebrahimilock out the matching of empty strings, either entirely, or only at the start 137*22dc650dSSadaf Ebrahimiof the subject. 138*22dc650dSSadaf Ebrahimi</P> 139*22dc650dSSadaf Ebrahimi<br><b> 140*22dc650dSSadaf EbrahimiDisabling auto-possessification 141*22dc650dSSadaf Ebrahimi</b><br> 142*22dc650dSSadaf Ebrahimi<P> 143*22dc650dSSadaf EbrahimiIf a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting 144*22dc650dSSadaf Ebrahimithe PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making quantifiers 145*22dc650dSSadaf Ebrahimipossessive when what follows cannot match the repeated item. For example, by 146*22dc650dSSadaf Ebrahimidefault a+b is treated as a++b. For more details, see the 147*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 148*22dc650dSSadaf Ebrahimidocumentation. 149*22dc650dSSadaf Ebrahimi</P> 150*22dc650dSSadaf Ebrahimi<br><b> 151*22dc650dSSadaf EbrahimiDisabling start-up optimizations 152*22dc650dSSadaf Ebrahimi</b><br> 153*22dc650dSSadaf Ebrahimi<P> 154*22dc650dSSadaf EbrahimiIf a pattern starts with (*NO_START_OPT), it has the same effect as setting the 155*22dc650dSSadaf EbrahimiPCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly 156*22dc650dSSadaf Ebrahimireaching "no match" results. For more details, see the 157*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 158*22dc650dSSadaf Ebrahimidocumentation. 159*22dc650dSSadaf Ebrahimi</P> 160*22dc650dSSadaf Ebrahimi<br><b> 161*22dc650dSSadaf EbrahimiDisabling automatic anchoring 162*22dc650dSSadaf Ebrahimi</b><br> 163*22dc650dSSadaf Ebrahimi<P> 164*22dc650dSSadaf EbrahimiIf a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as 165*22dc650dSSadaf Ebrahimisetting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that 166*22dc650dSSadaf Ebrahimiapply to patterns whose top-level branches all start with .* (match any number 167*22dc650dSSadaf Ebrahimiof arbitrary characters). For more details, see the 168*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 169*22dc650dSSadaf Ebrahimidocumentation. 170*22dc650dSSadaf Ebrahimi</P> 171*22dc650dSSadaf Ebrahimi<br><b> 172*22dc650dSSadaf EbrahimiDisabling JIT compilation 173*22dc650dSSadaf Ebrahimi</b><br> 174*22dc650dSSadaf Ebrahimi<P> 175*22dc650dSSadaf EbrahimiIf a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by 176*22dc650dSSadaf Ebrahimithe application to apply the JIT optimization by calling 177*22dc650dSSadaf Ebrahimi<b>pcre2_jit_compile()</b> is ignored. 178*22dc650dSSadaf Ebrahimi</P> 179*22dc650dSSadaf Ebrahimi<br><b> 180*22dc650dSSadaf EbrahimiSetting match resource limits 181*22dc650dSSadaf Ebrahimi</b><br> 182*22dc650dSSadaf Ebrahimi<P> 183*22dc650dSSadaf EbrahimiThe <b>pcre2_match()</b> function contains a counter that is incremented every 184*22dc650dSSadaf Ebrahimitime it goes round its main loop. The caller of <b>pcre2_match()</b> can set a 185*22dc650dSSadaf Ebrahimilimit on this counter, which therefore limits the amount of computing resource 186*22dc650dSSadaf Ebrahimiused for a match. The maximum depth of nested backtracking can also be limited; 187*22dc650dSSadaf Ebrahimithis indirectly restricts the amount of heap memory that is used, but there is 188*22dc650dSSadaf Ebrahimialso an explicit memory limit that can be set. 189*22dc650dSSadaf Ebrahimi</P> 190*22dc650dSSadaf Ebrahimi<P> 191*22dc650dSSadaf EbrahimiThese facilities are provided to catch runaway matches that are provoked by 192*22dc650dSSadaf Ebrahimipatterns with huge matching trees. A common example is a pattern with nested 193*22dc650dSSadaf Ebrahimiunlimited repeats applied to a long string that does not match. When one of 194*22dc650dSSadaf Ebrahimithese limits is reached, <b>pcre2_match()</b> gives an error return. The limits 195*22dc650dSSadaf Ebrahimican also be set by items at the start of the pattern of the form 196*22dc650dSSadaf Ebrahimi<pre> 197*22dc650dSSadaf Ebrahimi (*LIMIT_HEAP=d) 198*22dc650dSSadaf Ebrahimi (*LIMIT_MATCH=d) 199*22dc650dSSadaf Ebrahimi (*LIMIT_DEPTH=d) 200*22dc650dSSadaf Ebrahimi</pre> 201*22dc650dSSadaf Ebrahimiwhere d is any number of decimal digits. However, the value of the setting must 202*22dc650dSSadaf Ebrahimibe less than the value set (or defaulted) by the caller of <b>pcre2_match()</b> 203*22dc650dSSadaf Ebrahimifor it to have any effect. In other words, the pattern writer can lower the 204*22dc650dSSadaf Ebrahimilimits set by the programmer, but not raise them. If there is more than one 205*22dc650dSSadaf Ebrahimisetting of one of these limits, the lower value is used. The heap limit is 206*22dc650dSSadaf Ebrahimispecified in kibibytes (units of 1024 bytes). 207*22dc650dSSadaf Ebrahimi</P> 208*22dc650dSSadaf Ebrahimi<P> 209*22dc650dSSadaf EbrahimiPrior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is 210*22dc650dSSadaf Ebrahimistill recognized for backwards compatibility. 211*22dc650dSSadaf Ebrahimi</P> 212*22dc650dSSadaf Ebrahimi<P> 213*22dc650dSSadaf EbrahimiThe heap limit applies only when the <b>pcre2_match()</b> or 214*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b> interpreters are used for matching. It does not apply 215*22dc650dSSadaf Ebrahimito JIT. The match limit is used (but in a different way) when JIT is being 216*22dc650dSSadaf Ebrahimiused, or when <b>pcre2_dfa_match()</b> is called, to limit computing resource 217*22dc650dSSadaf Ebrahimiusage by those matching functions. The depth limit is ignored by JIT but is 218*22dc650dSSadaf Ebrahimirelevant for DFA matching, which uses function recursion for recursions within 219*22dc650dSSadaf Ebrahimithe pattern and for lookaround assertions and atomic groups. In this case, the 220*22dc650dSSadaf Ebrahimidepth limit controls the depth of such recursion. 221*22dc650dSSadaf Ebrahimi<a name="newlines"></a></P> 222*22dc650dSSadaf Ebrahimi<br><b> 223*22dc650dSSadaf EbrahimiNewline conventions 224*22dc650dSSadaf Ebrahimi</b><br> 225*22dc650dSSadaf Ebrahimi<P> 226*22dc650dSSadaf EbrahimiPCRE2 supports six different conventions for indicating line breaks in 227*22dc650dSSadaf Ebrahimistrings: a single CR (carriage return) character, a single LF (linefeed) 228*22dc650dSSadaf Ebrahimicharacter, the two-character sequence CRLF, any of the three preceding, any 229*22dc650dSSadaf EbrahimiUnicode newline sequence, or the NUL character (binary zero). The 230*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 231*22dc650dSSadaf Ebrahimipage has 232*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#newlines">further discussion</a> 233*22dc650dSSadaf Ebrahimiabout newlines, and shows how to set the newline convention when calling 234*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b>. 235*22dc650dSSadaf Ebrahimi</P> 236*22dc650dSSadaf Ebrahimi<P> 237*22dc650dSSadaf EbrahimiIt is also possible to specify a newline convention by starting a pattern 238*22dc650dSSadaf Ebrahimistring with one of the following sequences: 239*22dc650dSSadaf Ebrahimi<pre> 240*22dc650dSSadaf Ebrahimi (*CR) carriage return 241*22dc650dSSadaf Ebrahimi (*LF) linefeed 242*22dc650dSSadaf Ebrahimi (*CRLF) carriage return, followed by linefeed 243*22dc650dSSadaf Ebrahimi (*ANYCRLF) any of the three above 244*22dc650dSSadaf Ebrahimi (*ANY) all Unicode newline sequences 245*22dc650dSSadaf Ebrahimi (*NUL) the NUL character (binary zero) 246*22dc650dSSadaf Ebrahimi</pre> 247*22dc650dSSadaf EbrahimiThese override the default and the options given to the compiling function. For 248*22dc650dSSadaf Ebrahimiexample, on a Unix system where LF is the default newline sequence, the pattern 249*22dc650dSSadaf Ebrahimi<pre> 250*22dc650dSSadaf Ebrahimi (*CR)a.b 251*22dc650dSSadaf Ebrahimi</pre> 252*22dc650dSSadaf Ebrahimichanges the convention to CR. That pattern matches "a\nb" because LF is no 253*22dc650dSSadaf Ebrahimilonger a newline. If more than one of these settings is present, the last one 254*22dc650dSSadaf Ebrahimiis used. 255*22dc650dSSadaf Ebrahimi</P> 256*22dc650dSSadaf Ebrahimi<P> 257*22dc650dSSadaf EbrahimiThe newline convention affects where the circumflex and dollar assertions are 258*22dc650dSSadaf Ebrahimitrue. It also affects the interpretation of the dot metacharacter when 259*22dc650dSSadaf EbrahimiPCRE2_DOTALL is not set, and the behaviour of \N when not followed by an 260*22dc650dSSadaf Ebrahimiopening brace. However, it does not affect what the \R escape sequence 261*22dc650dSSadaf Ebrahimimatches. By default, this is any Unicode newline sequence, for Perl 262*22dc650dSSadaf Ebrahimicompatibility. However, this can be changed; see the next section and the 263*22dc650dSSadaf Ebrahimidescription of \R in the section entitled 264*22dc650dSSadaf Ebrahimi<a href="#newlineseq">"Newline sequences"</a> 265*22dc650dSSadaf Ebrahimibelow. A change of \R setting can be combined with a change of newline 266*22dc650dSSadaf Ebrahimiconvention. 267*22dc650dSSadaf Ebrahimi</P> 268*22dc650dSSadaf Ebrahimi<br><b> 269*22dc650dSSadaf EbrahimiSpecifying what \R matches 270*22dc650dSSadaf Ebrahimi</b><br> 271*22dc650dSSadaf Ebrahimi<P> 272*22dc650dSSadaf EbrahimiIt is possible to restrict \R to match only CR, LF, or CRLF (instead of the 273*22dc650dSSadaf Ebrahimicomplete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF 274*22dc650dSSadaf Ebrahimiat compile time. This effect can also be achieved by starting a pattern with 275*22dc650dSSadaf Ebrahimi(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized, 276*22dc650dSSadaf Ebrahimicorresponding to PCRE2_BSR_UNICODE. 277*22dc650dSSadaf Ebrahimi</P> 278*22dc650dSSadaf Ebrahimi<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br> 279*22dc650dSSadaf Ebrahimi<P> 280*22dc650dSSadaf EbrahimiPCRE2 can be compiled to run in an environment that uses EBCDIC as its 281*22dc650dSSadaf Ebrahimicharacter code instead of ASCII or Unicode (typically a mainframe system). In 282*22dc650dSSadaf Ebrahimithe sections below, character code values are ASCII or Unicode; in an EBCDIC 283*22dc650dSSadaf Ebrahimienvironment these characters may have different code values, and there are no 284*22dc650dSSadaf Ebrahimicode points greater than 255. 285*22dc650dSSadaf Ebrahimi</P> 286*22dc650dSSadaf Ebrahimi<br><a name="SEC4" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br> 287*22dc650dSSadaf Ebrahimi<P> 288*22dc650dSSadaf EbrahimiA regular expression is a pattern that is matched against a subject string from 289*22dc650dSSadaf Ebrahimileft to right. Most characters stand for themselves in a pattern, and match the 290*22dc650dSSadaf Ebrahimicorresponding characters in the subject. As a trivial example, the pattern 291*22dc650dSSadaf Ebrahimi<pre> 292*22dc650dSSadaf Ebrahimi The quick brown fox 293*22dc650dSSadaf Ebrahimi</pre> 294*22dc650dSSadaf Ebrahimimatches a portion of a subject string that is identical to itself. When 295*22dc650dSSadaf Ebrahimicaseless matching is specified (the PCRE2_CASELESS option or (?i) within the 296*22dc650dSSadaf Ebrahimipattern), letters are matched independently of case. Note that there are two 297*22dc650dSSadaf EbrahimiASCII characters, K and S, that, in addition to their lower case ASCII 298*22dc650dSSadaf Ebrahimiequivalents, are case-equivalent with Unicode U+212A (Kelvin sign) and U+017F 299*22dc650dSSadaf Ebrahimi(long S) respectively when either PCRE2_UTF or PCRE2_UCP is set, unless the 300*22dc650dSSadaf EbrahimiPCRE2_EXTRA_CASELESS_RESTRICT option is in force (either passed to 301*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b> or set by (?r) within the pattern). 302*22dc650dSSadaf Ebrahimi</P> 303*22dc650dSSadaf Ebrahimi<P> 304*22dc650dSSadaf EbrahimiThe power of regular expressions comes from the ability to include wild cards, 305*22dc650dSSadaf Ebrahimicharacter classes, alternatives, and repetitions in the pattern. These are 306*22dc650dSSadaf Ebrahimiencoded in the pattern by the use of <i>metacharacters</i>, which do not stand 307*22dc650dSSadaf Ebrahimifor themselves but instead are interpreted in some special way. 308*22dc650dSSadaf Ebrahimi</P> 309*22dc650dSSadaf Ebrahimi<P> 310*22dc650dSSadaf EbrahimiThere are two different sets of metacharacters: those that are recognized 311*22dc650dSSadaf Ebrahimianywhere in the pattern except within square brackets, and those that are 312*22dc650dSSadaf Ebrahimirecognized within square brackets. Outside square brackets, the metacharacters 313*22dc650dSSadaf Ebrahimiare as follows: 314*22dc650dSSadaf Ebrahimi<pre> 315*22dc650dSSadaf Ebrahimi \ general escape character with several uses 316*22dc650dSSadaf Ebrahimi ^ assert start of string (or line, in multiline mode) 317*22dc650dSSadaf Ebrahimi $ assert end of string (or line, in multiline mode) 318*22dc650dSSadaf Ebrahimi . match any character except newline (by default) 319*22dc650dSSadaf Ebrahimi [ start character class definition 320*22dc650dSSadaf Ebrahimi | start of alternative branch 321*22dc650dSSadaf Ebrahimi ( start group or control verb 322*22dc650dSSadaf Ebrahimi ) end group or control verb 323*22dc650dSSadaf Ebrahimi * 0 or more quantifier 324*22dc650dSSadaf Ebrahimi + 1 or more quantifier; also "possessive quantifier" 325*22dc650dSSadaf Ebrahimi ? 0 or 1 quantifier; also quantifier minimizer 326*22dc650dSSadaf Ebrahimi { potential start of min/max quantifier 327*22dc650dSSadaf Ebrahimi</pre> 328*22dc650dSSadaf EbrahimiBrace characters { and } are also used to enclose data for constructions such 329*22dc650dSSadaf Ebrahimias \g{2} or \k{name}. In almost all uses of braces, space and/or horizontal 330*22dc650dSSadaf Ebrahimitab characters that follow { or precede } are allowed and are ignored. In the 331*22dc650dSSadaf Ebrahimicase of quantifiers, they may also appear before or after the comma. The 332*22dc650dSSadaf Ebrahimiexception to this is \u{...} which is an ECMAScript compatibility feature 333*22dc650dSSadaf Ebrahimithat is recognized only when the PCRE2_EXTRA_ALT_BSUX option is set. ECMAScript 334*22dc650dSSadaf Ebrahimidoes not ignore such white space; it causes the item to be interpreted as 335*22dc650dSSadaf Ebrahimiliteral. 336*22dc650dSSadaf Ebrahimi</P> 337*22dc650dSSadaf Ebrahimi<P> 338*22dc650dSSadaf EbrahimiPart of a pattern that is in square brackets is called a "character class". In 339*22dc650dSSadaf Ebrahimia character class the only metacharacters are: 340*22dc650dSSadaf Ebrahimi<pre> 341*22dc650dSSadaf Ebrahimi \ general escape character 342*22dc650dSSadaf Ebrahimi ^ negate the class, but only if the first character 343*22dc650dSSadaf Ebrahimi - indicates character range 344*22dc650dSSadaf Ebrahimi [ POSIX character class (if followed by POSIX syntax) 345*22dc650dSSadaf Ebrahimi ] terminates the character class 346*22dc650dSSadaf Ebrahimi</pre> 347*22dc650dSSadaf EbrahimiIf a pattern is compiled with the PCRE2_EXTENDED option, most white space in 348*22dc650dSSadaf Ebrahimithe pattern, other than in a character class, within a \Q...\E sequence, or 349*22dc650dSSadaf Ebrahimibetween a # outside a character class and the next newline, inclusive, are 350*22dc650dSSadaf Ebrahimiignored. An escaping backslash can be used to include a white space or a # 351*22dc650dSSadaf Ebrahimicharacter as part of the pattern. If the PCRE2_EXTENDED_MORE option is set, the 352*22dc650dSSadaf Ebrahimisame applies, but in addition unescaped space and horizontal tab characters are 353*22dc650dSSadaf Ebrahimiignored inside a character class. Note: only these two characters are ignored, 354*22dc650dSSadaf Ebrahiminot the full set of pattern white space characters that are ignored outside a 355*22dc650dSSadaf Ebrahimicharacter class. Option settings can be changed within a pattern; see the 356*22dc650dSSadaf Ebrahimisection entitled 357*22dc650dSSadaf Ebrahimi<a href="#internaloptions">"Internal Option Setting"</a> 358*22dc650dSSadaf Ebrahimibelow. 359*22dc650dSSadaf Ebrahimi</P> 360*22dc650dSSadaf Ebrahimi<P> 361*22dc650dSSadaf EbrahimiThe following sections describe the use of each of the metacharacters. 362*22dc650dSSadaf Ebrahimi</P> 363*22dc650dSSadaf Ebrahimi<br><a name="SEC5" href="#TOC1">BACKSLASH</a><br> 364*22dc650dSSadaf Ebrahimi<P> 365*22dc650dSSadaf EbrahimiThe backslash character has several uses. Firstly, if it is followed by a 366*22dc650dSSadaf Ebrahimicharacter that is not a digit or a letter, it takes away any special meaning 367*22dc650dSSadaf Ebrahimithat character may have. This use of backslash as an escape character applies 368*22dc650dSSadaf Ebrahimiboth inside and outside character classes. 369*22dc650dSSadaf Ebrahimi</P> 370*22dc650dSSadaf Ebrahimi<P> 371*22dc650dSSadaf EbrahimiFor example, if you want to match a * character, you must write \* in the 372*22dc650dSSadaf Ebrahimipattern. This escaping action applies whether or not the following character 373*22dc650dSSadaf Ebrahimiwould otherwise be interpreted as a metacharacter, so it is always safe to 374*22dc650dSSadaf Ebrahimiprecede a non-alphanumeric with backslash to specify that it stands for itself. 375*22dc650dSSadaf EbrahimiIn particular, if you want to match a backslash, you write \\. 376*22dc650dSSadaf Ebrahimi</P> 377*22dc650dSSadaf Ebrahimi<P> 378*22dc650dSSadaf EbrahimiOnly ASCII digits and letters have any special meaning after a backslash. All 379*22dc650dSSadaf Ebrahimiother characters (in particular, those whose code points are greater than 127) 380*22dc650dSSadaf Ebrahimiare treated as literals. 381*22dc650dSSadaf Ebrahimi</P> 382*22dc650dSSadaf Ebrahimi<P> 383*22dc650dSSadaf EbrahimiIf you want to treat all characters in a sequence as literals, you can do so by 384*22dc650dSSadaf Ebrahimiputting them between \Q and \E. Note that this includes white space even when 385*22dc650dSSadaf Ebrahimithe PCRE2_EXTENDED option is set so that most other white space is ignored. The 386*22dc650dSSadaf Ebrahimibehaviour is different from Perl in that $ and @ are handled as literals in 387*22dc650dSSadaf Ebrahimi\Q...\E sequences in PCRE2, whereas in Perl, $ and @ cause variable 388*22dc650dSSadaf Ebrahimiinterpolation. Also, Perl does "double-quotish backslash interpolation" on any 389*22dc650dSSadaf Ebrahimibackslashes between \Q and \E which, its documentation says, "may lead to 390*22dc650dSSadaf Ebrahimiconfusing results". PCRE2 treats a backslash between \Q and \E just like any 391*22dc650dSSadaf Ebrahimiother character. Note the following examples: 392*22dc650dSSadaf Ebrahimi<pre> 393*22dc650dSSadaf Ebrahimi Pattern PCRE2 matches Perl matches 394*22dc650dSSadaf Ebrahimi 395*22dc650dSSadaf Ebrahimi \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz 396*22dc650dSSadaf Ebrahimi \Qabc\$xyz\E abc\$xyz abc\$xyz 397*22dc650dSSadaf Ebrahimi \Qabc\E\$\Qxyz\E abc$xyz abc$xyz 398*22dc650dSSadaf Ebrahimi \QA\B\E A\B A\B 399*22dc650dSSadaf Ebrahimi \Q\\E \ \\E 400*22dc650dSSadaf Ebrahimi</pre> 401*22dc650dSSadaf EbrahimiThe \Q...\E sequence is recognized both inside and outside character classes. 402*22dc650dSSadaf EbrahimiAn isolated \E that is not preceded by \Q is ignored. If \Q is not followed 403*22dc650dSSadaf Ebrahimiby \E later in the pattern, the literal interpretation continues to the end of 404*22dc650dSSadaf Ebrahimithe pattern (that is, \E is assumed at the end). If the isolated \Q is inside 405*22dc650dSSadaf Ebrahimia character class, this causes an error, because the character class is then 406*22dc650dSSadaf Ebrahiminot terminated by a closing square bracket. 407*22dc650dSSadaf Ebrahimi<a name="digitsafterbackslash"></a></P> 408*22dc650dSSadaf Ebrahimi<br><b> 409*22dc650dSSadaf EbrahimiNon-printing characters 410*22dc650dSSadaf Ebrahimi</b><br> 411*22dc650dSSadaf Ebrahimi<P> 412*22dc650dSSadaf EbrahimiA second use of backslash provides a way of encoding non-printing characters 413*22dc650dSSadaf Ebrahimiin patterns in a visible manner. There is no restriction on the appearance of 414*22dc650dSSadaf Ebrahiminon-printing characters in a pattern, but when a pattern is being prepared by 415*22dc650dSSadaf Ebrahimitext editing, it is often easier to use one of the following escape sequences 416*22dc650dSSadaf Ebrahimiinstead of the binary character it represents. In an ASCII or Unicode 417*22dc650dSSadaf Ebrahimienvironment, these escapes are as follows: 418*22dc650dSSadaf Ebrahimi<pre> 419*22dc650dSSadaf Ebrahimi \a alarm, that is, the BEL character (hex 07) 420*22dc650dSSadaf Ebrahimi \cx "control-x", where x is a non-control ASCII character 421*22dc650dSSadaf Ebrahimi \e escape (hex 1B) 422*22dc650dSSadaf Ebrahimi \f form feed (hex 0C) 423*22dc650dSSadaf Ebrahimi \n linefeed (hex 0A) 424*22dc650dSSadaf Ebrahimi \r carriage return (hex 0D) (but see below) 425*22dc650dSSadaf Ebrahimi \t tab (hex 09) 426*22dc650dSSadaf Ebrahimi \0dd character with octal code 0dd 427*22dc650dSSadaf Ebrahimi \ddd character with octal code ddd, or backreference 428*22dc650dSSadaf Ebrahimi \o{ddd..} character with octal code ddd.. 429*22dc650dSSadaf Ebrahimi \xhh character with hex code hh 430*22dc650dSSadaf Ebrahimi \x{hhh..} character with hex code hhh.. 431*22dc650dSSadaf Ebrahimi \N{U+hhh..} character with Unicode hex code point hhh.. 432*22dc650dSSadaf Ebrahimi</pre> 433*22dc650dSSadaf EbrahimiBy default, after \x that is not followed by {, from zero to two hexadecimal 434*22dc650dSSadaf Ebrahimidigits are read (letters can be in upper or lower case). Any number of 435*22dc650dSSadaf Ebrahimihexadecimal digits may appear between \x{ and }. If a character other than a 436*22dc650dSSadaf Ebrahimihexadecimal digit appears between \x{ and }, or if there is no terminating }, 437*22dc650dSSadaf Ebrahimian error occurs. 438*22dc650dSSadaf Ebrahimi</P> 439*22dc650dSSadaf Ebrahimi<P> 440*22dc650dSSadaf EbrahimiCharacters whose code points are less than 256 can be defined by either of the 441*22dc650dSSadaf Ebrahimitwo syntaxes for \x or by an octal sequence. There is no difference in the way 442*22dc650dSSadaf Ebrahimithey are handled. For example, \xdc is exactly the same as \x{dc} or \334. 443*22dc650dSSadaf EbrahimiHowever, using the braced versions does make such sequences easier to read. 444*22dc650dSSadaf Ebrahimi</P> 445*22dc650dSSadaf Ebrahimi<P> 446*22dc650dSSadaf EbrahimiSupport is available for some ECMAScript (aka JavaScript) escape sequences via 447*22dc650dSSadaf Ebrahimitwo compile-time options. If PCRE2_ALT_BSUX is set, the sequence \x followed 448*22dc650dSSadaf Ebrahimiby { is not recognized. Only if \x is followed by two hexadecimal digits is it 449*22dc650dSSadaf Ebrahimirecognized as a character escape. Otherwise it is interpreted as a literal "x" 450*22dc650dSSadaf Ebrahimicharacter. In this mode, support for code points greater than 256 is provided 451*22dc650dSSadaf Ebrahimiby \u, which must be followed by four hexadecimal digits; otherwise it is 452*22dc650dSSadaf Ebrahimiinterpreted as a literal "u" character. 453*22dc650dSSadaf Ebrahimi</P> 454*22dc650dSSadaf Ebrahimi<P> 455*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in addition, 456*22dc650dSSadaf Ebrahimi\u{hhh..} is recognized as the character specified by hexadecimal code point. 457*22dc650dSSadaf EbrahimiThere may be any number of hexadecimal digits, but unlike other places that 458*22dc650dSSadaf Ebrahimialso use curly brackets, spaces are not allowed and would result in the string 459*22dc650dSSadaf Ebrahimibeing interpreted as a literal. This syntax is from ECMAScript 6. 460*22dc650dSSadaf Ebrahimi</P> 461*22dc650dSSadaf Ebrahimi<P> 462*22dc650dSSadaf EbrahimiThe \N{U+hhh..} escape sequence is recognized only when PCRE2 is operating in 463*22dc650dSSadaf EbrahimiUTF mode. Perl also uses \N{name} to specify characters by Unicode name; PCRE2 464*22dc650dSSadaf Ebrahimidoes not support this. Note that when \N is not followed by an opening brace 465*22dc650dSSadaf Ebrahimi(curly bracket) it has an entirely different meaning, matching any character 466*22dc650dSSadaf Ebrahimithat is not a newline. 467*22dc650dSSadaf Ebrahimi</P> 468*22dc650dSSadaf Ebrahimi<P> 469*22dc650dSSadaf EbrahimiThere are some legacy applications where the escape sequence \r is expected to 470*22dc650dSSadaf Ebrahimimatch a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option is set, \r in a 471*22dc650dSSadaf Ebrahimipattern is converted to \n so that it matches a LF (linefeed) instead of a CR 472*22dc650dSSadaf Ebrahimi(carriage return) character. 473*22dc650dSSadaf Ebrahimi</P> 474*22dc650dSSadaf Ebrahimi<P> 475*22dc650dSSadaf EbrahimiAn error occurs if \c is not followed by a character whose ASCII code point 476*22dc650dSSadaf Ebrahimiis in the range 32 to 126. The precise effect of \cx is as follows: if x is a 477*22dc650dSSadaf Ebrahimilower case letter, it is converted to upper case. Then bit 6 of the character 478*22dc650dSSadaf Ebrahimi(hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 479*22dc650dSSadaf Ebrahimi5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If 480*22dc650dSSadaf Ebrahimithe code unit following \c has a code point less than 32 or greater than 126, 481*22dc650dSSadaf Ebrahimia compile-time error occurs. 482*22dc650dSSadaf Ebrahimi</P> 483*22dc650dSSadaf Ebrahimi<P> 484*22dc650dSSadaf EbrahimiWhen PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported. \a, \e, 485*22dc650dSSadaf Ebrahimi\f, \n, \r, and \t generate the appropriate EBCDIC code values. The \c 486*22dc650dSSadaf Ebrahimiescape is processed as specified for Perl in the <b>perlebcdic</b> document. The 487*22dc650dSSadaf Ebrahimionly characters that are allowed after \c are A-Z, a-z, or one of @, [, \, ], 488*22dc650dSSadaf Ebrahimi^, _, or ?. Any other character provokes a compile-time error. The sequence 489*22dc650dSSadaf Ebrahimi\c@ encodes character code 0; after \c the letters (in either case) encode 490*22dc650dSSadaf Ebrahimicharacters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31 491*22dc650dSSadaf Ebrahimi(hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95 (hex 5F). 492*22dc650dSSadaf Ebrahimi</P> 493*22dc650dSSadaf Ebrahimi<P> 494*22dc650dSSadaf EbrahimiThus, apart from \c?, these escapes generate the same character code values as 495*22dc650dSSadaf Ebrahimithey do in an ASCII environment, though the meanings of the values mostly 496*22dc650dSSadaf Ebrahimidiffer. For example, \cG always generates code value 7, which is BEL in ASCII 497*22dc650dSSadaf Ebrahimibut DEL in EBCDIC. 498*22dc650dSSadaf Ebrahimi</P> 499*22dc650dSSadaf Ebrahimi<P> 500*22dc650dSSadaf EbrahimiThe sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but 501*22dc650dSSadaf Ebrahimibecause 127 is not a control character in EBCDIC, Perl makes it generate the 502*22dc650dSSadaf EbrahimiAPC character. Unfortunately, there are several variants of EBCDIC. In most of 503*22dc650dSSadaf Ebrahimithem the APC character has the value 255 (hex FF), but in the one Perl calls 504*22dc650dSSadaf EbrahimiPOSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC 505*22dc650dSSadaf Ebrahimivalues, PCRE2 makes \c? generate 95; otherwise it generates 255. 506*22dc650dSSadaf Ebrahimi</P> 507*22dc650dSSadaf Ebrahimi<P> 508*22dc650dSSadaf EbrahimiAfter \0 up to two further octal digits are read. If there are fewer than two 509*22dc650dSSadaf Ebrahimidigits, just those that are present are used. Thus the sequence \0\x\015 510*22dc650dSSadaf Ebrahimispecifies two binary zeros followed by a CR character (code value 13). Make 511*22dc650dSSadaf Ebrahimisure you supply two digits after the initial zero if the pattern character that 512*22dc650dSSadaf Ebrahimifollows is itself an octal digit. 513*22dc650dSSadaf Ebrahimi</P> 514*22dc650dSSadaf Ebrahimi<P> 515*22dc650dSSadaf EbrahimiThe escape \o must be followed by a sequence of octal digits, enclosed in 516*22dc650dSSadaf Ebrahimibraces. An error occurs if this is not the case. This escape is a recent 517*22dc650dSSadaf Ebrahimiaddition to Perl; it provides way of specifying character code points as octal 518*22dc650dSSadaf Ebrahiminumbers greater than 0777, and it also allows octal numbers and backreferences 519*22dc650dSSadaf Ebrahimito be unambiguously specified. 520*22dc650dSSadaf Ebrahimi</P> 521*22dc650dSSadaf Ebrahimi<P> 522*22dc650dSSadaf EbrahimiFor greater clarity and unambiguity, it is best to avoid following \ by a 523*22dc650dSSadaf Ebrahimidigit greater than zero. Instead, use \o{...} or \x{...} to specify numerical 524*22dc650dSSadaf Ebrahimicharacter code points, and \g{...} to specify backreferences. The following 525*22dc650dSSadaf Ebrahimiparagraphs describe the old, ambiguous syntax. 526*22dc650dSSadaf Ebrahimi</P> 527*22dc650dSSadaf Ebrahimi<P> 528*22dc650dSSadaf EbrahimiThe handling of a backslash followed by a digit other than 0 is complicated, 529*22dc650dSSadaf Ebrahimiand Perl has changed over time, causing PCRE2 also to change. 530*22dc650dSSadaf Ebrahimi</P> 531*22dc650dSSadaf Ebrahimi<P> 532*22dc650dSSadaf EbrahimiOutside a character class, PCRE2 reads the digit and any following digits as a 533*22dc650dSSadaf Ebrahimidecimal number. If the number is less than 10, begins with the digit 8 or 9, or 534*22dc650dSSadaf Ebrahimiif there are at least that many previous capture groups in the expression, the 535*22dc650dSSadaf Ebrahimientire sequence is taken as a <i>backreference</i>. A description of how this 536*22dc650dSSadaf Ebrahimiworks is given 537*22dc650dSSadaf Ebrahimi<a href="#backreferences">later,</a> 538*22dc650dSSadaf Ebrahimifollowing the discussion of 539*22dc650dSSadaf Ebrahimi<a href="#group">parenthesized groups.</a> 540*22dc650dSSadaf EbrahimiOtherwise, up to three octal digits are read to form a character code. 541*22dc650dSSadaf Ebrahimi</P> 542*22dc650dSSadaf Ebrahimi<P> 543*22dc650dSSadaf EbrahimiInside a character class, PCRE2 handles \8 and \9 as the literal characters 544*22dc650dSSadaf Ebrahimi"8" and "9", and otherwise reads up to three octal digits following the 545*22dc650dSSadaf Ebrahimibackslash, using them to generate a data character. Any subsequent digits stand 546*22dc650dSSadaf Ebrahimifor themselves. For example, outside a character class: 547*22dc650dSSadaf Ebrahimi<pre> 548*22dc650dSSadaf Ebrahimi \040 is another way of writing an ASCII space 549*22dc650dSSadaf Ebrahimi \40 is the same, provided there are fewer than 40 previous capture groups 550*22dc650dSSadaf Ebrahimi \7 is always a backreference 551*22dc650dSSadaf Ebrahimi \11 might be a backreference, or another way of writing a tab 552*22dc650dSSadaf Ebrahimi \011 is always a tab 553*22dc650dSSadaf Ebrahimi \0113 is a tab followed by the character "3" 554*22dc650dSSadaf Ebrahimi \113 might be a backreference, otherwise the character with octal code 113 555*22dc650dSSadaf Ebrahimi \377 might be a backreference, otherwise the value 255 (decimal) 556*22dc650dSSadaf Ebrahimi \81 is always a backreference 557*22dc650dSSadaf Ebrahimi</pre> 558*22dc650dSSadaf EbrahimiNote that octal values of 100 or greater that are specified using this syntax 559*22dc650dSSadaf Ebrahimimust not be introduced by a leading zero, because no more than three octal 560*22dc650dSSadaf Ebrahimidigits are ever read. 561*22dc650dSSadaf Ebrahimi</P> 562*22dc650dSSadaf Ebrahimi<br><b> 563*22dc650dSSadaf EbrahimiConstraints on character values 564*22dc650dSSadaf Ebrahimi</b><br> 565*22dc650dSSadaf Ebrahimi<P> 566*22dc650dSSadaf EbrahimiCharacters that are specified using octal or hexadecimal numbers are 567*22dc650dSSadaf Ebrahimilimited to certain values, as follows: 568*22dc650dSSadaf Ebrahimi<pre> 569*22dc650dSSadaf Ebrahimi 8-bit non-UTF mode no greater than 0xff 570*22dc650dSSadaf Ebrahimi 16-bit non-UTF mode no greater than 0xffff 571*22dc650dSSadaf Ebrahimi 32-bit non-UTF mode no greater than 0xffffffff 572*22dc650dSSadaf Ebrahimi All UTF modes no greater than 0x10ffff and a valid code point 573*22dc650dSSadaf Ebrahimi</pre> 574*22dc650dSSadaf EbrahimiInvalid Unicode code points are all those in the range 0xd800 to 0xdfff (the 575*22dc650dSSadaf Ebrahimiso-called "surrogate" code points). The check for these can be disabled by the 576*22dc650dSSadaf Ebrahimicaller of <b>pcre2_compile()</b> by setting the option 577*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8 578*22dc650dSSadaf Ebrahimiand UTF-32 modes, because these values are not representable in UTF-16. 579*22dc650dSSadaf Ebrahimi</P> 580*22dc650dSSadaf Ebrahimi<br><b> 581*22dc650dSSadaf EbrahimiEscape sequences in character classes 582*22dc650dSSadaf Ebrahimi</b><br> 583*22dc650dSSadaf Ebrahimi<P> 584*22dc650dSSadaf EbrahimiAll the sequences that define a single character value can be used both inside 585*22dc650dSSadaf Ebrahimiand outside character classes. In addition, inside a character class, \b is 586*22dc650dSSadaf Ebrahimiinterpreted as the backspace character (hex 08). 587*22dc650dSSadaf Ebrahimi</P> 588*22dc650dSSadaf Ebrahimi<P> 589*22dc650dSSadaf EbrahimiWhen not followed by an opening brace, \N is not allowed in a character class. 590*22dc650dSSadaf Ebrahimi\B, \R, and \X are not special inside a character class. Like other 591*22dc650dSSadaf Ebrahimiunrecognized alphabetic escape sequences, they cause an error. Outside a 592*22dc650dSSadaf Ebrahimicharacter class, these sequences have different meanings. 593*22dc650dSSadaf Ebrahimi</P> 594*22dc650dSSadaf Ebrahimi<br><b> 595*22dc650dSSadaf EbrahimiUnsupported escape sequences 596*22dc650dSSadaf Ebrahimi</b><br> 597*22dc650dSSadaf Ebrahimi<P> 598*22dc650dSSadaf EbrahimiIn Perl, the sequences \F, \l, \L, \u, and \U are recognized by its string 599*22dc650dSSadaf Ebrahimihandler and used to modify the case of following characters. By default, PCRE2 600*22dc650dSSadaf Ebrahimidoes not support these escape sequences in patterns. However, if either of the 601*22dc650dSSadaf EbrahimiPCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U matches a "U" 602*22dc650dSSadaf Ebrahimicharacter, and \u can be used to define a character by code point, as 603*22dc650dSSadaf Ebrahimidescribed above. 604*22dc650dSSadaf Ebrahimi</P> 605*22dc650dSSadaf Ebrahimi<br><b> 606*22dc650dSSadaf EbrahimiAbsolute and relative backreferences 607*22dc650dSSadaf Ebrahimi</b><br> 608*22dc650dSSadaf Ebrahimi<P> 609*22dc650dSSadaf EbrahimiThe sequence \g followed by a signed or unsigned number, optionally enclosed 610*22dc650dSSadaf Ebrahimiin braces, is an absolute or relative backreference. A named backreference 611*22dc650dSSadaf Ebrahimican be coded as \g{name}. Backreferences are discussed 612*22dc650dSSadaf Ebrahimi<a href="#backreferences">later,</a> 613*22dc650dSSadaf Ebrahimifollowing the discussion of 614*22dc650dSSadaf Ebrahimi<a href="#group">parenthesized groups.</a> 615*22dc650dSSadaf Ebrahimi</P> 616*22dc650dSSadaf Ebrahimi<br><b> 617*22dc650dSSadaf EbrahimiAbsolute and relative subroutine calls 618*22dc650dSSadaf Ebrahimi</b><br> 619*22dc650dSSadaf Ebrahimi<P> 620*22dc650dSSadaf EbrahimiFor compatibility with Oniguruma, the non-Perl syntax \g followed by a name or 621*22dc650dSSadaf Ebrahimia number enclosed either in angle brackets or single quotes, is an alternative 622*22dc650dSSadaf Ebrahimisyntax for referencing a capture group as a subroutine. Details are discussed 623*22dc650dSSadaf Ebrahimi<a href="#onigurumasubroutines">later.</a> 624*22dc650dSSadaf EbrahimiNote that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i> 625*22dc650dSSadaf Ebrahimisynonymous. The former is a backreference; the latter is a 626*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">subroutine</a> 627*22dc650dSSadaf Ebrahimicall. 628*22dc650dSSadaf Ebrahimi<a name="genericchartypes"></a></P> 629*22dc650dSSadaf Ebrahimi<br><b> 630*22dc650dSSadaf EbrahimiGeneric character types 631*22dc650dSSadaf Ebrahimi</b><br> 632*22dc650dSSadaf Ebrahimi<P> 633*22dc650dSSadaf EbrahimiAnother use of backslash is for specifying generic character types: 634*22dc650dSSadaf Ebrahimi<pre> 635*22dc650dSSadaf Ebrahimi \d any decimal digit 636*22dc650dSSadaf Ebrahimi \D any character that is not a decimal digit 637*22dc650dSSadaf Ebrahimi \h any horizontal white space character 638*22dc650dSSadaf Ebrahimi \H any character that is not a horizontal white space character 639*22dc650dSSadaf Ebrahimi \N any character that is not a newline 640*22dc650dSSadaf Ebrahimi \s any white space character 641*22dc650dSSadaf Ebrahimi \S any character that is not a white space character 642*22dc650dSSadaf Ebrahimi \v any vertical white space character 643*22dc650dSSadaf Ebrahimi \V any character that is not a vertical white space character 644*22dc650dSSadaf Ebrahimi \w any "word" character 645*22dc650dSSadaf Ebrahimi \W any "non-word" character 646*22dc650dSSadaf Ebrahimi</pre> 647*22dc650dSSadaf EbrahimiThe \N escape sequence has the same meaning as 648*22dc650dSSadaf Ebrahimi<a href="#fullstopdot">the "." metacharacter</a> 649*22dc650dSSadaf Ebrahimiwhen PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the 650*22dc650dSSadaf Ebrahimimeaning of \N. Note that when \N is followed by an opening brace it has a 651*22dc650dSSadaf Ebrahimidifferent meaning. See the section entitled 652*22dc650dSSadaf Ebrahimi<a href="#digitsafterbackslash">"Non-printing characters"</a> 653*22dc650dSSadaf Ebrahimiabove for details. Perl also uses \N{name} to specify characters by Unicode 654*22dc650dSSadaf Ebrahiminame; PCRE2 does not support this. 655*22dc650dSSadaf Ebrahimi</P> 656*22dc650dSSadaf Ebrahimi<P> 657*22dc650dSSadaf EbrahimiEach pair of lower and upper case escape sequences partitions the complete set 658*22dc650dSSadaf Ebrahimiof characters into two disjoint sets. Any given character matches one, and only 659*22dc650dSSadaf Ebrahimione, of each pair. The sequences can appear both inside and outside character 660*22dc650dSSadaf Ebrahimiclasses. They each match one character of the appropriate type. If the current 661*22dc650dSSadaf Ebrahimimatching point is at the end of the subject string, all of them fail, because 662*22dc650dSSadaf Ebrahimithere is no character to match. 663*22dc650dSSadaf Ebrahimi</P> 664*22dc650dSSadaf Ebrahimi<P> 665*22dc650dSSadaf EbrahimiThe default \s characters are HT (9), LF (10), VT (11), FF (12), CR (13), and 666*22dc650dSSadaf Ebrahimispace (32), which are defined as white space in the "C" locale. This list may 667*22dc650dSSadaf Ebrahimivary if locale-specific matching is taking place. For example, in some locales 668*22dc650dSSadaf Ebrahimithe "non-breaking space" character (\xA0) is recognized as white space, and in 669*22dc650dSSadaf Ebrahimiothers the VT character is not. 670*22dc650dSSadaf Ebrahimi</P> 671*22dc650dSSadaf Ebrahimi<P> 672*22dc650dSSadaf EbrahimiA "word" character is an underscore or any character that is a letter or digit. 673*22dc650dSSadaf EbrahimiBy default, the definition of letters and digits is controlled by PCRE2's 674*22dc650dSSadaf Ebrahimilow-valued character tables, and may vary if locale-specific matching is taking 675*22dc650dSSadaf Ebrahimiplace (see 676*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#localesupport">"Locale support"</a> 677*22dc650dSSadaf Ebrahimiin the 678*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 679*22dc650dSSadaf Ebrahimipage). For example, in a French locale such as "fr_FR" in Unix-like systems, 680*22dc650dSSadaf Ebrahimior "french" in Windows, some character codes greater than 127 are used for 681*22dc650dSSadaf Ebrahimiaccented letters, and these are then matched by \w. The use of locales with 682*22dc650dSSadaf EbrahimiUnicode is discouraged. 683*22dc650dSSadaf Ebrahimi</P> 684*22dc650dSSadaf Ebrahimi<P> 685*22dc650dSSadaf EbrahimiBy default, characters whose code points are greater than 127 never match \d, 686*22dc650dSSadaf Ebrahimi\s, or \w, and always match \D, \S, and \W, although this may be different 687*22dc650dSSadaf Ebrahimifor characters in the range 128-255 when locale-specific matching is happening. 688*22dc650dSSadaf EbrahimiThese escape sequences retain their original meanings from before Unicode 689*22dc650dSSadaf Ebrahimisupport was available, mainly for efficiency reasons. If the PCRE2_UCP option 690*22dc650dSSadaf Ebrahimiis set, the behaviour is changed so that Unicode properties are used to 691*22dc650dSSadaf Ebrahimidetermine character types, as follows: 692*22dc650dSSadaf Ebrahimi<pre> 693*22dc650dSSadaf Ebrahimi \d any character that matches \p{Nd} (decimal digit) 694*22dc650dSSadaf Ebrahimi \s any character that matches \p{Z} or \h or \v 695*22dc650dSSadaf Ebrahimi \w any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc} 696*22dc650dSSadaf Ebrahimi</pre> 697*22dc650dSSadaf EbrahimiThe addition of \p{Mn} (non-spacing mark) and the replacement of an explicit 698*22dc650dSSadaf Ebrahimitest for underscore with a test for \p{Pc} (connector punctuation) happened in 699*22dc650dSSadaf EbrahimiPCRE2 release 10.43. This brings PCRE2 into line with Perl. 700*22dc650dSSadaf Ebrahimi</P> 701*22dc650dSSadaf Ebrahimi<P> 702*22dc650dSSadaf EbrahimiThe upper case escapes match the inverse sets of characters. Note that \d 703*22dc650dSSadaf Ebrahimimatches only decimal digits, whereas \w matches any Unicode digit, as well as 704*22dc650dSSadaf Ebrahimiother character categories. Note also that PCRE2_UCP affects \b, and 705*22dc650dSSadaf Ebrahimi\B because they are defined in terms of \w and \W. Matching these sequences 706*22dc650dSSadaf Ebrahimiis noticeably slower when PCRE2_UCP is set. 707*22dc650dSSadaf Ebrahimi</P> 708*22dc650dSSadaf Ebrahimi<P> 709*22dc650dSSadaf EbrahimiThe effect of PCRE2_UCP on any one of these escape sequences can be negated by 710*22dc650dSSadaf Ebrahimithe options PCRE2_EXTRA_ASCII_BSD, PCRE2_EXTRA_ASCII_BSS, and 711*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ASCII_BSW, respectively. These options can be set and reset within 712*22dc650dSSadaf Ebrahimia pattern by means of an internal option setting 713*22dc650dSSadaf Ebrahimi<a href="#internaloptions">(see below).</a> 714*22dc650dSSadaf Ebrahimi</P> 715*22dc650dSSadaf Ebrahimi<P> 716*22dc650dSSadaf EbrahimiThe sequences \h, \H, \v, and \V, in contrast to the other sequences, which 717*22dc650dSSadaf Ebrahimimatch only ASCII characters by default, always match a specific list of code 718*22dc650dSSadaf Ebrahimipoints, whether or not PCRE2_UCP is set. The horizontal space characters are: 719*22dc650dSSadaf Ebrahimi<pre> 720*22dc650dSSadaf Ebrahimi U+0009 Horizontal tab (HT) 721*22dc650dSSadaf Ebrahimi U+0020 Space 722*22dc650dSSadaf Ebrahimi U+00A0 Non-break space 723*22dc650dSSadaf Ebrahimi U+1680 Ogham space mark 724*22dc650dSSadaf Ebrahimi U+180E Mongolian vowel separator 725*22dc650dSSadaf Ebrahimi U+2000 En quad 726*22dc650dSSadaf Ebrahimi U+2001 Em quad 727*22dc650dSSadaf Ebrahimi U+2002 En space 728*22dc650dSSadaf Ebrahimi U+2003 Em space 729*22dc650dSSadaf Ebrahimi U+2004 Three-per-em space 730*22dc650dSSadaf Ebrahimi U+2005 Four-per-em space 731*22dc650dSSadaf Ebrahimi U+2006 Six-per-em space 732*22dc650dSSadaf Ebrahimi U+2007 Figure space 733*22dc650dSSadaf Ebrahimi U+2008 Punctuation space 734*22dc650dSSadaf Ebrahimi U+2009 Thin space 735*22dc650dSSadaf Ebrahimi U+200A Hair space 736*22dc650dSSadaf Ebrahimi U+202F Narrow no-break space 737*22dc650dSSadaf Ebrahimi U+205F Medium mathematical space 738*22dc650dSSadaf Ebrahimi U+3000 Ideographic space 739*22dc650dSSadaf Ebrahimi</pre> 740*22dc650dSSadaf EbrahimiThe vertical space characters are: 741*22dc650dSSadaf Ebrahimi<pre> 742*22dc650dSSadaf Ebrahimi U+000A Linefeed (LF) 743*22dc650dSSadaf Ebrahimi U+000B Vertical tab (VT) 744*22dc650dSSadaf Ebrahimi U+000C Form feed (FF) 745*22dc650dSSadaf Ebrahimi U+000D Carriage return (CR) 746*22dc650dSSadaf Ebrahimi U+0085 Next line (NEL) 747*22dc650dSSadaf Ebrahimi U+2028 Line separator 748*22dc650dSSadaf Ebrahimi U+2029 Paragraph separator 749*22dc650dSSadaf Ebrahimi</pre> 750*22dc650dSSadaf EbrahimiIn 8-bit, non-UTF-8 mode, only the characters with code points less than 256 751*22dc650dSSadaf Ebrahimiare relevant. 752*22dc650dSSadaf Ebrahimi<a name="newlineseq"></a></P> 753*22dc650dSSadaf Ebrahimi<br><b> 754*22dc650dSSadaf EbrahimiNewline sequences 755*22dc650dSSadaf Ebrahimi</b><br> 756*22dc650dSSadaf Ebrahimi<P> 757*22dc650dSSadaf EbrahimiOutside a character class, by default, the escape sequence \R matches any 758*22dc650dSSadaf EbrahimiUnicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the 759*22dc650dSSadaf Ebrahimifollowing: 760*22dc650dSSadaf Ebrahimi<pre> 761*22dc650dSSadaf Ebrahimi (?>\r\n|\n|\x0b|\f|\r|\x85) 762*22dc650dSSadaf Ebrahimi</pre> 763*22dc650dSSadaf EbrahimiThis is an example of an "atomic group", details of which are given 764*22dc650dSSadaf Ebrahimi<a href="#atomicgroup">below.</a> 765*22dc650dSSadaf EbrahimiThis particular group matches either the two-character sequence CR followed by 766*22dc650dSSadaf EbrahimiLF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, 767*22dc650dSSadaf EbrahimiU+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next 768*22dc650dSSadaf Ebrahimiline, U+0085). Because this is an atomic group, the two-character sequence is 769*22dc650dSSadaf Ebrahimitreated as a single unit that cannot be split. 770*22dc650dSSadaf Ebrahimi</P> 771*22dc650dSSadaf Ebrahimi<P> 772*22dc650dSSadaf EbrahimiIn other modes, two additional characters whose code points are greater than 255 773*22dc650dSSadaf Ebrahimiare added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). 774*22dc650dSSadaf EbrahimiUnicode support is not needed for these characters to be recognized. 775*22dc650dSSadaf Ebrahimi</P> 776*22dc650dSSadaf Ebrahimi<P> 777*22dc650dSSadaf EbrahimiIt is possible to restrict \R to match only CR, LF, or CRLF (instead of the 778*22dc650dSSadaf Ebrahimicomplete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF 779*22dc650dSSadaf Ebrahimiat compile time. (BSR is an abbreviation for "backslash R".) This can be made 780*22dc650dSSadaf Ebrahimithe default when PCRE2 is built; if this is the case, the other behaviour can 781*22dc650dSSadaf Ebrahimibe requested via the PCRE2_BSR_UNICODE option. It is also possible to specify 782*22dc650dSSadaf Ebrahimithese settings by starting a pattern string with one of the following 783*22dc650dSSadaf Ebrahimisequences: 784*22dc650dSSadaf Ebrahimi<pre> 785*22dc650dSSadaf Ebrahimi (*BSR_ANYCRLF) CR, LF, or CRLF only 786*22dc650dSSadaf Ebrahimi (*BSR_UNICODE) any Unicode newline sequence 787*22dc650dSSadaf Ebrahimi</pre> 788*22dc650dSSadaf EbrahimiThese override the default and the options given to the compiling function. 789*22dc650dSSadaf EbrahimiNote that these special settings, which are not Perl-compatible, are recognized 790*22dc650dSSadaf Ebrahimionly at the very start of a pattern, and that they must be in upper case. If 791*22dc650dSSadaf Ebrahimimore than one of them is present, the last one is used. They can be combined 792*22dc650dSSadaf Ebrahimiwith a change of newline convention; for example, a pattern can start with: 793*22dc650dSSadaf Ebrahimi<pre> 794*22dc650dSSadaf Ebrahimi (*ANY)(*BSR_ANYCRLF) 795*22dc650dSSadaf Ebrahimi</pre> 796*22dc650dSSadaf EbrahimiThey can also be combined with the (*UTF) or (*UCP) special sequences. Inside a 797*22dc650dSSadaf Ebrahimicharacter class, \R is treated as an unrecognized escape sequence, and causes 798*22dc650dSSadaf Ebrahimian error. 799*22dc650dSSadaf Ebrahimi<a name="uniextseq"></a></P> 800*22dc650dSSadaf Ebrahimi<br><b> 801*22dc650dSSadaf EbrahimiUnicode character properties 802*22dc650dSSadaf Ebrahimi</b><br> 803*22dc650dSSadaf Ebrahimi<P> 804*22dc650dSSadaf EbrahimiWhen PCRE2 is built with Unicode support (the default), three additional escape 805*22dc650dSSadaf Ebrahimisequences that match characters with specific properties are available. They 806*22dc650dSSadaf Ebrahimican be used in any mode, though in 8-bit and 16-bit non-UTF modes these 807*22dc650dSSadaf Ebrahimisequences are of course limited to testing characters whose code points are 808*22dc650dSSadaf Ebrahimiless than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points 809*22dc650dSSadaf Ebrahimigreater than 0x10ffff (the Unicode limit) may be encountered. These are all 810*22dc650dSSadaf Ebrahimitreated as being in the Unknown script and with an unassigned type. 811*22dc650dSSadaf Ebrahimi</P> 812*22dc650dSSadaf Ebrahimi<P> 813*22dc650dSSadaf EbrahimiMatching characters by Unicode property is not fast, because PCRE2 has to do a 814*22dc650dSSadaf Ebrahimimultistage table lookup in order to find a character's property. That is why 815*22dc650dSSadaf Ebrahimithe traditional escape sequences such as \d and \w do not use Unicode 816*22dc650dSSadaf Ebrahimiproperties in PCRE2 by default, though you can make them do so by setting the 817*22dc650dSSadaf EbrahimiPCRE2_UCP option or by starting the pattern with (*UCP). 818*22dc650dSSadaf Ebrahimi</P> 819*22dc650dSSadaf Ebrahimi<P> 820*22dc650dSSadaf EbrahimiThe extra escape sequences that provide property support are: 821*22dc650dSSadaf Ebrahimi<pre> 822*22dc650dSSadaf Ebrahimi \p{<i>xx</i>} a character with the <i>xx</i> property 823*22dc650dSSadaf Ebrahimi \P{<i>xx</i>} a character without the <i>xx</i> property 824*22dc650dSSadaf Ebrahimi \X a Unicode extended grapheme cluster 825*22dc650dSSadaf Ebrahimi</pre> 826*22dc650dSSadaf EbrahimiThe property names represented by <i>xx</i> above are not case-sensitive, and in 827*22dc650dSSadaf Ebrahimiaccordance with Unicode's "loose matching" rules, spaces, hyphens, and 828*22dc650dSSadaf Ebrahimiunderscores are ignored. There is support for Unicode script names, Unicode 829*22dc650dSSadaf Ebrahimigeneral category properties, "Any", which matches any character (including 830*22dc650dSSadaf Ebrahiminewline), Bidi_Class, a number of binary (yes/no) properties, and some special 831*22dc650dSSadaf EbrahimiPCRE2 properties (described 832*22dc650dSSadaf Ebrahimi<a href="#extraprops">below).</a> 833*22dc650dSSadaf EbrahimiCertain other Perl properties such as "InMusicalSymbols" are not supported by 834*22dc650dSSadaf EbrahimiPCRE2. Note that \P{Any} does not match any characters, so always causes a 835*22dc650dSSadaf Ebrahimimatch failure. 836*22dc650dSSadaf Ebrahimi</P> 837*22dc650dSSadaf Ebrahimi<br><b> 838*22dc650dSSadaf EbrahimiScript properties for \p and \P 839*22dc650dSSadaf Ebrahimi</b><br> 840*22dc650dSSadaf Ebrahimi<P> 841*22dc650dSSadaf EbrahimiThere are three different syntax forms for matching a script. Each Unicode 842*22dc650dSSadaf Ebrahimicharacter has a basic script and, optionally, a list of other scripts ("Script 843*22dc650dSSadaf EbrahimiExtensions") with which it is commonly used. Using the Adlam script as an 844*22dc650dSSadaf Ebrahimiexample, \p{sc:Adlam} matches characters whose basic script is Adlam, whereas 845*22dc650dSSadaf Ebrahimi\p{scx:Adlam} matches, in addition, characters that have Adlam in their 846*22dc650dSSadaf Ebrahimiextensions list. The full names "script" and "script extensions" for the 847*22dc650dSSadaf Ebrahimiproperty types are recognized, and a equals sign is an alternative to the 848*22dc650dSSadaf Ebrahimicolon. If a script name is given without a property type, for example, 849*22dc650dSSadaf Ebrahimi\p{Adlam}, it is treated as \p{scx:Adlam}. Perl changed to this 850*22dc650dSSadaf Ebrahimiinterpretation at release 5.26 and PCRE2 changed at release 10.40. 851*22dc650dSSadaf Ebrahimi</P> 852*22dc650dSSadaf Ebrahimi<P> 853*22dc650dSSadaf EbrahimiUnassigned characters (and in non-UTF 32-bit mode, characters with code points 854*22dc650dSSadaf Ebrahimigreater than 0x10FFFF) are assigned the "Unknown" script. Others that are not 855*22dc650dSSadaf Ebrahimipart of an identified script are lumped together as "Common". The current list 856*22dc650dSSadaf Ebrahimiof recognized script names and their 4-character abbreviations can be obtained 857*22dc650dSSadaf Ebrahimiby running this command: 858*22dc650dSSadaf Ebrahimi<pre> 859*22dc650dSSadaf Ebrahimi pcre2test -LS 860*22dc650dSSadaf Ebrahimi 861*22dc650dSSadaf Ebrahimi</PRE> 862*22dc650dSSadaf Ebrahimi</P> 863*22dc650dSSadaf Ebrahimi<br><b> 864*22dc650dSSadaf EbrahimiThe general category property for \p and \P 865*22dc650dSSadaf Ebrahimi</b><br> 866*22dc650dSSadaf Ebrahimi<P> 867*22dc650dSSadaf EbrahimiEach character has exactly one Unicode general category property, specified by 868*22dc650dSSadaf Ebrahimia two-letter abbreviation. For compatibility with Perl, negation can be 869*22dc650dSSadaf Ebrahimispecified by including a circumflex between the opening brace and the property 870*22dc650dSSadaf Ebrahiminame. For example, \p{^Lu} is the same as \P{Lu}. 871*22dc650dSSadaf Ebrahimi</P> 872*22dc650dSSadaf Ebrahimi<P> 873*22dc650dSSadaf EbrahimiIf only one letter is specified with \p or \P, it includes all the general 874*22dc650dSSadaf Ebrahimicategory properties that start with that letter. In this case, in the absence 875*22dc650dSSadaf Ebrahimiof negation, the curly brackets in the escape sequence are optional; these two 876*22dc650dSSadaf Ebrahimiexamples have the same effect: 877*22dc650dSSadaf Ebrahimi<pre> 878*22dc650dSSadaf Ebrahimi \p{L} 879*22dc650dSSadaf Ebrahimi \pL 880*22dc650dSSadaf Ebrahimi</pre> 881*22dc650dSSadaf EbrahimiThe following general category property codes are supported: 882*22dc650dSSadaf Ebrahimi<pre> 883*22dc650dSSadaf Ebrahimi C Other 884*22dc650dSSadaf Ebrahimi Cc Control 885*22dc650dSSadaf Ebrahimi Cf Format 886*22dc650dSSadaf Ebrahimi Cn Unassigned 887*22dc650dSSadaf Ebrahimi Co Private use 888*22dc650dSSadaf Ebrahimi Cs Surrogate 889*22dc650dSSadaf Ebrahimi 890*22dc650dSSadaf Ebrahimi L Letter 891*22dc650dSSadaf Ebrahimi Ll Lower case letter 892*22dc650dSSadaf Ebrahimi Lm Modifier letter 893*22dc650dSSadaf Ebrahimi Lo Other letter 894*22dc650dSSadaf Ebrahimi Lt Title case letter 895*22dc650dSSadaf Ebrahimi Lu Upper case letter 896*22dc650dSSadaf Ebrahimi 897*22dc650dSSadaf Ebrahimi M Mark 898*22dc650dSSadaf Ebrahimi Mc Spacing mark 899*22dc650dSSadaf Ebrahimi Me Enclosing mark 900*22dc650dSSadaf Ebrahimi Mn Non-spacing mark 901*22dc650dSSadaf Ebrahimi 902*22dc650dSSadaf Ebrahimi N Number 903*22dc650dSSadaf Ebrahimi Nd Decimal number 904*22dc650dSSadaf Ebrahimi Nl Letter number 905*22dc650dSSadaf Ebrahimi No Other number 906*22dc650dSSadaf Ebrahimi 907*22dc650dSSadaf Ebrahimi P Punctuation 908*22dc650dSSadaf Ebrahimi Pc Connector punctuation 909*22dc650dSSadaf Ebrahimi Pd Dash punctuation 910*22dc650dSSadaf Ebrahimi Pe Close punctuation 911*22dc650dSSadaf Ebrahimi Pf Final punctuation 912*22dc650dSSadaf Ebrahimi Pi Initial punctuation 913*22dc650dSSadaf Ebrahimi Po Other punctuation 914*22dc650dSSadaf Ebrahimi Ps Open punctuation 915*22dc650dSSadaf Ebrahimi 916*22dc650dSSadaf Ebrahimi S Symbol 917*22dc650dSSadaf Ebrahimi Sc Currency symbol 918*22dc650dSSadaf Ebrahimi Sk Modifier symbol 919*22dc650dSSadaf Ebrahimi Sm Mathematical symbol 920*22dc650dSSadaf Ebrahimi So Other symbol 921*22dc650dSSadaf Ebrahimi 922*22dc650dSSadaf Ebrahimi Z Separator 923*22dc650dSSadaf Ebrahimi Zl Line separator 924*22dc650dSSadaf Ebrahimi Zp Paragraph separator 925*22dc650dSSadaf Ebrahimi Zs Space separator 926*22dc650dSSadaf Ebrahimi</pre> 927*22dc650dSSadaf EbrahimiThe special property LC, which has the synonym L&, is also supported: it 928*22dc650dSSadaf Ebrahimimatches a character that has the Lu, Ll, or Lt property, in other words, a 929*22dc650dSSadaf Ebrahimiletter that is not classified as a modifier or "other". 930*22dc650dSSadaf Ebrahimi</P> 931*22dc650dSSadaf Ebrahimi<P> 932*22dc650dSSadaf EbrahimiThe Cs (Surrogate) property applies only to characters whose code points are in 933*22dc650dSSadaf Ebrahimithe range U+D800 to U+DFFF. These characters are no different to any other 934*22dc650dSSadaf Ebrahimicharacter when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library). 935*22dc650dSSadaf EbrahimiHowever, they are not valid in Unicode strings and so cannot be tested by PCRE2 936*22dc650dSSadaf Ebrahimiin UTF mode, unless UTF validity checking has been turned off (see the 937*22dc650dSSadaf Ebrahimidiscussion of PCRE2_NO_UTF_CHECK in the 938*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 939*22dc650dSSadaf Ebrahimipage). 940*22dc650dSSadaf Ebrahimi</P> 941*22dc650dSSadaf Ebrahimi<P> 942*22dc650dSSadaf EbrahimiThe long synonyms for property names that Perl supports (such as \p{Letter}) 943*22dc650dSSadaf Ebrahimiare not supported by PCRE2, nor is it permitted to prefix any of these 944*22dc650dSSadaf Ebrahimiproperties with "Is". 945*22dc650dSSadaf Ebrahimi</P> 946*22dc650dSSadaf Ebrahimi<P> 947*22dc650dSSadaf EbrahimiNo character that is in the Unicode table has the Cn (unassigned) property. 948*22dc650dSSadaf EbrahimiInstead, this property is assumed for any code point that is not in the 949*22dc650dSSadaf EbrahimiUnicode table. 950*22dc650dSSadaf Ebrahimi</P> 951*22dc650dSSadaf Ebrahimi<P> 952*22dc650dSSadaf EbrahimiSpecifying caseless matching does not affect these escape sequences. For 953*22dc650dSSadaf Ebrahimiexample, \p{Lu} always matches only upper case letters. This is different from 954*22dc650dSSadaf Ebrahimithe behaviour of current versions of Perl. 955*22dc650dSSadaf Ebrahimi</P> 956*22dc650dSSadaf Ebrahimi<br><b> 957*22dc650dSSadaf EbrahimiBinary (yes/no) properties for \p and \P 958*22dc650dSSadaf Ebrahimi</b><br> 959*22dc650dSSadaf Ebrahimi<P> 960*22dc650dSSadaf EbrahimiUnicode defines a number of binary properties, that is, properties whose only 961*22dc650dSSadaf Ebrahimivalues are true or false. You can obtain a list of those that are recognized by 962*22dc650dSSadaf Ebrahimi\p and \P, along with their abbreviations, by running this command: 963*22dc650dSSadaf Ebrahimi<pre> 964*22dc650dSSadaf Ebrahimi pcre2test -LP 965*22dc650dSSadaf Ebrahimi 966*22dc650dSSadaf Ebrahimi</PRE> 967*22dc650dSSadaf Ebrahimi</P> 968*22dc650dSSadaf Ebrahimi<br><b> 969*22dc650dSSadaf EbrahimiThe Bidi_Class property for \p and \P 970*22dc650dSSadaf Ebrahimi</b><br> 971*22dc650dSSadaf Ebrahimi<P> 972*22dc650dSSadaf Ebrahimi<pre> 973*22dc650dSSadaf Ebrahimi \p{Bidi_Class:<class>} matches a character with the given class 974*22dc650dSSadaf Ebrahimi \p{BC:<class>} matches a character with the given class 975*22dc650dSSadaf Ebrahimi</pre> 976*22dc650dSSadaf EbrahimiThe recognized classes are: 977*22dc650dSSadaf Ebrahimi<pre> 978*22dc650dSSadaf Ebrahimi AL Arabic letter 979*22dc650dSSadaf Ebrahimi AN Arabic number 980*22dc650dSSadaf Ebrahimi B paragraph separator 981*22dc650dSSadaf Ebrahimi BN boundary neutral 982*22dc650dSSadaf Ebrahimi CS common separator 983*22dc650dSSadaf Ebrahimi EN European number 984*22dc650dSSadaf Ebrahimi ES European separator 985*22dc650dSSadaf Ebrahimi ET European terminator 986*22dc650dSSadaf Ebrahimi FSI first strong isolate 987*22dc650dSSadaf Ebrahimi L left-to-right 988*22dc650dSSadaf Ebrahimi LRE left-to-right embedding 989*22dc650dSSadaf Ebrahimi LRI left-to-right isolate 990*22dc650dSSadaf Ebrahimi LRO left-to-right override 991*22dc650dSSadaf Ebrahimi NSM non-spacing mark 992*22dc650dSSadaf Ebrahimi ON other neutral 993*22dc650dSSadaf Ebrahimi PDF pop directional format 994*22dc650dSSadaf Ebrahimi PDI pop directional isolate 995*22dc650dSSadaf Ebrahimi R right-to-left 996*22dc650dSSadaf Ebrahimi RLE right-to-left embedding 997*22dc650dSSadaf Ebrahimi RLI right-to-left isolate 998*22dc650dSSadaf Ebrahimi RLO right-to-left override 999*22dc650dSSadaf Ebrahimi S segment separator 1000*22dc650dSSadaf Ebrahimi WS which space 1001*22dc650dSSadaf Ebrahimi</pre> 1002*22dc650dSSadaf EbrahimiAn equals sign may be used instead of a colon. The class names are 1003*22dc650dSSadaf Ebrahimicase-insensitive; only the short names listed above are recognized. 1004*22dc650dSSadaf Ebrahimi</P> 1005*22dc650dSSadaf Ebrahimi<br><b> 1006*22dc650dSSadaf EbrahimiExtended grapheme clusters 1007*22dc650dSSadaf Ebrahimi</b><br> 1008*22dc650dSSadaf Ebrahimi<P> 1009*22dc650dSSadaf EbrahimiThe \X escape matches any number of Unicode characters that form an "extended 1010*22dc650dSSadaf Ebrahimigrapheme cluster", and treats the sequence as an atomic group 1011*22dc650dSSadaf Ebrahimi<a href="#atomicgroup">(see below).</a> 1012*22dc650dSSadaf EbrahimiUnicode supports various kinds of composite character by giving each character 1013*22dc650dSSadaf Ebrahimia grapheme breaking property, and having rules that use these properties to 1014*22dc650dSSadaf Ebrahimidefine the boundaries of extended grapheme clusters. The rules are defined in 1015*22dc650dSSadaf EbrahimiUnicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0 1016*22dc650dSSadaf Ebrahimiabandoned the use of some previous properties that had been used for emojis. 1017*22dc650dSSadaf EbrahimiInstead it introduced various emoji-specific properties. PCRE2 uses only the 1018*22dc650dSSadaf EbrahimiExtended Pictographic property. 1019*22dc650dSSadaf Ebrahimi</P> 1020*22dc650dSSadaf Ebrahimi<P> 1021*22dc650dSSadaf Ebrahimi\X always matches at least one character. Then it decides whether to add 1022*22dc650dSSadaf Ebrahimiadditional characters according to the following rules for ending a cluster: 1023*22dc650dSSadaf Ebrahimi</P> 1024*22dc650dSSadaf Ebrahimi<P> 1025*22dc650dSSadaf Ebrahimi1. End at the end of the subject string. 1026*22dc650dSSadaf Ebrahimi</P> 1027*22dc650dSSadaf Ebrahimi<P> 1028*22dc650dSSadaf Ebrahimi2. Do not end between CR and LF; otherwise end after any control character. 1029*22dc650dSSadaf Ebrahimi</P> 1030*22dc650dSSadaf Ebrahimi<P> 1031*22dc650dSSadaf Ebrahimi3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters 1032*22dc650dSSadaf Ebrahimiare of five types: L, V, T, LV, and LVT. An L character may be followed by an 1033*22dc650dSSadaf EbrahimiL, V, LV, or LVT character; an LV or V character may be followed by a V or T 1034*22dc650dSSadaf Ebrahimicharacter; an LVT or T character may be followed only by a T character. 1035*22dc650dSSadaf Ebrahimi</P> 1036*22dc650dSSadaf Ebrahimi<P> 1037*22dc650dSSadaf Ebrahimi4. Do not end before extending characters or spacing marks or the zero-width 1038*22dc650dSSadaf Ebrahimijoiner (ZWJ) character. Characters with the "mark" property always have the 1039*22dc650dSSadaf Ebrahimi"extend" grapheme breaking property. 1040*22dc650dSSadaf Ebrahimi</P> 1041*22dc650dSSadaf Ebrahimi<P> 1042*22dc650dSSadaf Ebrahimi5. Do not end after prepend characters. 1043*22dc650dSSadaf Ebrahimi</P> 1044*22dc650dSSadaf Ebrahimi<P> 1045*22dc650dSSadaf Ebrahimi6. Do not end within emoji modifier sequences or emoji ZWJ (zero-width 1046*22dc650dSSadaf Ebrahimijoiner) sequences. An emoji ZWJ sequence consists of a character with the 1047*22dc650dSSadaf EbrahimiExtended_Pictographic property, optionally followed by one or more characters 1048*22dc650dSSadaf Ebrahimiwith the Extend property, followed by the ZWJ character, followed by another 1049*22dc650dSSadaf EbrahimiExtended_Pictographic character. 1050*22dc650dSSadaf Ebrahimi</P> 1051*22dc650dSSadaf Ebrahimi<P> 1052*22dc650dSSadaf Ebrahimi7. Do not break within emoji flag sequences. That is, do not break between 1053*22dc650dSSadaf Ebrahimiregional indicator (RI) characters if there are an odd number of RI characters 1054*22dc650dSSadaf Ebrahimibefore the break point. 1055*22dc650dSSadaf Ebrahimi</P> 1056*22dc650dSSadaf Ebrahimi<P> 1057*22dc650dSSadaf Ebrahimi8. Otherwise, end the cluster. 1058*22dc650dSSadaf Ebrahimi<a name="extraprops"></a></P> 1059*22dc650dSSadaf Ebrahimi<br><b> 1060*22dc650dSSadaf EbrahimiPCRE2's additional properties 1061*22dc650dSSadaf Ebrahimi</b><br> 1062*22dc650dSSadaf Ebrahimi<P> 1063*22dc650dSSadaf EbrahimiAs well as the standard Unicode properties described above, PCRE2 supports four 1064*22dc650dSSadaf Ebrahimimore that make it possible to convert traditional escape sequences such as \w 1065*22dc650dSSadaf Ebrahimiand \s to use Unicode properties. PCRE2 uses these non-standard, non-Perl 1066*22dc650dSSadaf Ebrahimiproperties internally when PCRE2_UCP is set. However, they may also be used 1067*22dc650dSSadaf Ebrahimiexplicitly. These properties are: 1068*22dc650dSSadaf Ebrahimi<pre> 1069*22dc650dSSadaf Ebrahimi Xan Any alphanumeric character 1070*22dc650dSSadaf Ebrahimi Xps Any POSIX space character 1071*22dc650dSSadaf Ebrahimi Xsp Any Perl space character 1072*22dc650dSSadaf Ebrahimi Xwd Any Perl "word" character 1073*22dc650dSSadaf Ebrahimi</pre> 1074*22dc650dSSadaf EbrahimiXan matches characters that have either the L (letter) or the N (number) 1075*22dc650dSSadaf Ebrahimiproperty. Xps matches the characters tab, linefeed, vertical tab, form feed, or 1076*22dc650dSSadaf Ebrahimicarriage return, and any other character that has the Z (separator) property. 1077*22dc650dSSadaf EbrahimiXsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl 1078*22dc650dSSadaf Ebrahimicompatibility, but Perl changed. Xwd matches the same characters as Xan, plus 1079*22dc650dSSadaf Ebrahimithose that match Mn (non-spacing mark) or Pc (connector punctuation, which 1080*22dc650dSSadaf Ebrahimiincludes underscore). 1081*22dc650dSSadaf Ebrahimi</P> 1082*22dc650dSSadaf Ebrahimi<P> 1083*22dc650dSSadaf EbrahimiThere is another non-standard property, Xuc, which matches any character that 1084*22dc650dSSadaf Ebrahimican be represented by a Universal Character Name in C++ and other programming 1085*22dc650dSSadaf Ebrahimilanguages. These are the characters $, @, ` (grave accent), and all characters 1086*22dc650dSSadaf Ebrahimiwith Unicode code points greater than or equal to U+00A0, except for the 1087*22dc650dSSadaf Ebrahimisurrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are 1088*22dc650dSSadaf Ebrahimiexcluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH 1089*22dc650dSSadaf Ebrahimiwhere H is a hexadecimal digit. Note that the Xuc property does not match these 1090*22dc650dSSadaf Ebrahimisequences but the characters that they represent.) 1091*22dc650dSSadaf Ebrahimi<a name="resetmatchstart"></a></P> 1092*22dc650dSSadaf Ebrahimi<br><b> 1093*22dc650dSSadaf EbrahimiResetting the match start 1094*22dc650dSSadaf Ebrahimi</b><br> 1095*22dc650dSSadaf Ebrahimi<P> 1096*22dc650dSSadaf EbrahimiIn normal use, the escape sequence \K causes any previously matched characters 1097*22dc650dSSadaf Ebrahiminot to be included in the final matched sequence that is returned. For example, 1098*22dc650dSSadaf Ebrahimithe pattern: 1099*22dc650dSSadaf Ebrahimi<pre> 1100*22dc650dSSadaf Ebrahimi foo\Kbar 1101*22dc650dSSadaf Ebrahimi</pre> 1102*22dc650dSSadaf Ebrahimimatches "foobar", but reports that it has matched "bar". \K does not interact 1103*22dc650dSSadaf Ebrahimiwith anchoring in any way. The pattern: 1104*22dc650dSSadaf Ebrahimi<pre> 1105*22dc650dSSadaf Ebrahimi ^foo\Kbar 1106*22dc650dSSadaf Ebrahimi</pre> 1107*22dc650dSSadaf Ebrahimimatches only when the subject begins with "foobar" (in single line mode), 1108*22dc650dSSadaf Ebrahimithough it again reports the matched string as "bar". This feature is similar to 1109*22dc650dSSadaf Ebrahimia lookbehind assertion 1110*22dc650dSSadaf Ebrahimi<a href="#lookbehind">(described below),</a> 1111*22dc650dSSadaf Ebrahimibut the part of the pattern that precedes \K is not constrained to match a 1112*22dc650dSSadaf Ebrahimilimited number of characters, as is required for a lookbehind assertion. The 1113*22dc650dSSadaf Ebrahimiuse of \K does not interfere with the setting of 1114*22dc650dSSadaf Ebrahimi<a href="#group">captured substrings.</a> 1115*22dc650dSSadaf EbrahimiFor example, when the pattern 1116*22dc650dSSadaf Ebrahimi<pre> 1117*22dc650dSSadaf Ebrahimi (foo)\Kbar 1118*22dc650dSSadaf Ebrahimi</pre> 1119*22dc650dSSadaf Ebrahimimatches "foobar", the first substring is still set to "foo". 1120*22dc650dSSadaf Ebrahimi</P> 1121*22dc650dSSadaf Ebrahimi<P> 1122*22dc650dSSadaf EbrahimiFrom version 5.32.0 Perl forbids the use of \K in lookaround assertions. From 1123*22dc650dSSadaf Ebrahimirelease 10.38 PCRE2 also forbids this by default. However, the 1124*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ALLOW_LOOKAROUND_BSK option can be used when calling 1125*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b> to re-enable the previous behaviour. When this option is 1126*22dc650dSSadaf Ebrahimiset, \K is acted upon when it occurs inside positive assertions, but is 1127*22dc650dSSadaf Ebrahimiignored in negative assertions. Note that when a pattern such as (?=ab\K) 1128*22dc650dSSadaf Ebrahimimatches, the reported start of the match can be greater than the end of the 1129*22dc650dSSadaf Ebrahimimatch. Using \K in a lookbehind assertion at the start of a pattern can also 1130*22dc650dSSadaf Ebrahimilead to odd effects. For example, consider this pattern: 1131*22dc650dSSadaf Ebrahimi<pre> 1132*22dc650dSSadaf Ebrahimi (?<=\Kfoo)bar 1133*22dc650dSSadaf Ebrahimi</pre> 1134*22dc650dSSadaf EbrahimiIf the subject is "foobar", a call to <b>pcre2_match()</b> with a starting 1135*22dc650dSSadaf Ebrahimioffset of 3 succeeds and reports the matching string as "foobar", that is, the 1136*22dc650dSSadaf Ebrahimistart of the reported match is earlier than where the match started. 1137*22dc650dSSadaf Ebrahimi<a name="smallassertions"></a></P> 1138*22dc650dSSadaf Ebrahimi<br><b> 1139*22dc650dSSadaf EbrahimiSimple assertions 1140*22dc650dSSadaf Ebrahimi</b><br> 1141*22dc650dSSadaf Ebrahimi<P> 1142*22dc650dSSadaf EbrahimiThe final use of backslash is for certain simple assertions. An assertion 1143*22dc650dSSadaf Ebrahimispecifies a condition that has to be met at a particular point in a match, 1144*22dc650dSSadaf Ebrahimiwithout consuming any characters from the subject string. The use of 1145*22dc650dSSadaf Ebrahimigroups for more complicated assertions is described 1146*22dc650dSSadaf Ebrahimi<a href="#bigassertions">below.</a> 1147*22dc650dSSadaf EbrahimiThe backslashed assertions are: 1148*22dc650dSSadaf Ebrahimi<pre> 1149*22dc650dSSadaf Ebrahimi \b matches at a word boundary 1150*22dc650dSSadaf Ebrahimi \B matches when not at a word boundary 1151*22dc650dSSadaf Ebrahimi \A matches at the start of the subject 1152*22dc650dSSadaf Ebrahimi \Z matches at the end of the subject 1153*22dc650dSSadaf Ebrahimi also matches before a newline at the end of the subject 1154*22dc650dSSadaf Ebrahimi \z matches only at the end of the subject 1155*22dc650dSSadaf Ebrahimi \G matches at the first matching position in the subject 1156*22dc650dSSadaf Ebrahimi</pre> 1157*22dc650dSSadaf EbrahimiInside a character class, \b has a different meaning; it matches the backspace 1158*22dc650dSSadaf Ebrahimicharacter. If any other of these assertions appears in a character class, an 1159*22dc650dSSadaf Ebrahimi"invalid escape sequence" error is generated. 1160*22dc650dSSadaf Ebrahimi</P> 1161*22dc650dSSadaf Ebrahimi<P> 1162*22dc650dSSadaf EbrahimiA word boundary is a position in the subject string where the current character 1163*22dc650dSSadaf Ebrahimiand the previous character do not both match \w or \W (i.e. one matches 1164*22dc650dSSadaf Ebrahimi\w and the other matches \W), or the start or end of the string if the 1165*22dc650dSSadaf Ebrahimifirst or last character matches \w, respectively. When PCRE2 is built with 1166*22dc650dSSadaf EbrahimiUnicode support, the meanings of \w and \W can be changed by setting the 1167*22dc650dSSadaf EbrahimiPCRE2_UCP option. When this is done, it also affects \b and \B. Neither PCRE2 1168*22dc650dSSadaf Ebrahiminor Perl has a separate "start of word" or "end of word" metasequence. However, 1169*22dc650dSSadaf Ebrahimiwhatever follows \b normally determines which it is. For example, the fragment 1170*22dc650dSSadaf Ebrahimi\ba matches "a" at the start of a word. 1171*22dc650dSSadaf Ebrahimi</P> 1172*22dc650dSSadaf Ebrahimi<P> 1173*22dc650dSSadaf EbrahimiThe \A, \Z, and \z assertions differ from the traditional circumflex and 1174*22dc650dSSadaf Ebrahimidollar (described in the next section) in that they only ever match at the very 1175*22dc650dSSadaf Ebrahimistart and end of the subject string, whatever options are set. Thus, they are 1176*22dc650dSSadaf Ebrahimiindependent of multiline mode. These three assertions are not affected by the 1177*22dc650dSSadaf EbrahimiPCRE2_NOTBOL or PCRE2_NOTEOL options, which affect only the behaviour of the 1178*22dc650dSSadaf Ebrahimicircumflex and dollar metacharacters. However, if the <i>startoffset</i> 1179*22dc650dSSadaf Ebrahimiargument of <b>pcre2_match()</b> is non-zero, indicating that matching is to 1180*22dc650dSSadaf Ebrahimistart at a point other than the beginning of the subject, \A can never match. 1181*22dc650dSSadaf EbrahimiThe difference between \Z and \z is that \Z matches before a newline at the 1182*22dc650dSSadaf Ebrahimiend of the string as well as at the very end, whereas \z matches only at the 1183*22dc650dSSadaf Ebrahimiend. 1184*22dc650dSSadaf Ebrahimi</P> 1185*22dc650dSSadaf Ebrahimi<P> 1186*22dc650dSSadaf EbrahimiThe \G assertion is true only when the current matching position is at the 1187*22dc650dSSadaf Ebrahimistart point of the matching process, as specified by the <i>startoffset</i> 1188*22dc650dSSadaf Ebrahimiargument of <b>pcre2_match()</b>. It differs from \A when the value of 1189*22dc650dSSadaf Ebrahimi<i>startoffset</i> is non-zero. By calling <b>pcre2_match()</b> multiple times 1190*22dc650dSSadaf Ebrahimiwith appropriate arguments, you can mimic Perl's /g option, and it is in this 1191*22dc650dSSadaf Ebrahimikind of implementation where \G can be useful. 1192*22dc650dSSadaf Ebrahimi</P> 1193*22dc650dSSadaf Ebrahimi<P> 1194*22dc650dSSadaf EbrahimiNote, however, that PCRE2's implementation of \G, being true at the starting 1195*22dc650dSSadaf Ebrahimicharacter of the matching process, is subtly different from Perl's, which 1196*22dc650dSSadaf Ebrahimidefines it as true at the end of the previous match. In Perl, these can be 1197*22dc650dSSadaf Ebrahimidifferent when the previously matched string was empty. Because PCRE2 does just 1198*22dc650dSSadaf Ebrahimione match at a time, it cannot reproduce this behaviour. 1199*22dc650dSSadaf Ebrahimi</P> 1200*22dc650dSSadaf Ebrahimi<P> 1201*22dc650dSSadaf EbrahimiIf all the alternatives of a pattern begin with \G, the expression is anchored 1202*22dc650dSSadaf Ebrahimito the starting match position, and the "anchored" flag is set in the compiled 1203*22dc650dSSadaf Ebrahimiregular expression. 1204*22dc650dSSadaf Ebrahimi</P> 1205*22dc650dSSadaf Ebrahimi<br><a name="SEC6" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br> 1206*22dc650dSSadaf Ebrahimi<P> 1207*22dc650dSSadaf EbrahimiThe circumflex and dollar metacharacters are zero-width assertions. That is, 1208*22dc650dSSadaf Ebrahimithey test for a particular condition being true without consuming any 1209*22dc650dSSadaf Ebrahimicharacters from the subject string. These two metacharacters are concerned with 1210*22dc650dSSadaf Ebrahimimatching the starts and ends of lines. If the newline convention is set so that 1211*22dc650dSSadaf Ebrahimionly the two-character sequence CRLF is recognized as a newline, isolated CR 1212*22dc650dSSadaf Ebrahimiand LF characters are treated as ordinary data characters, and are not 1213*22dc650dSSadaf Ebrahimirecognized as newlines. 1214*22dc650dSSadaf Ebrahimi</P> 1215*22dc650dSSadaf Ebrahimi<P> 1216*22dc650dSSadaf EbrahimiOutside a character class, in the default matching mode, the circumflex 1217*22dc650dSSadaf Ebrahimicharacter is an assertion that is true only if the current matching point is at 1218*22dc650dSSadaf Ebrahimithe start of the subject string. If the <i>startoffset</i> argument of 1219*22dc650dSSadaf Ebrahimi<b>pcre2_match()</b> is non-zero, or if PCRE2_NOTBOL is set, circumflex can 1220*22dc650dSSadaf Ebrahiminever match if the PCRE2_MULTILINE option is unset. Inside a character class, 1221*22dc650dSSadaf Ebrahimicircumflex has an entirely different meaning 1222*22dc650dSSadaf Ebrahimi<a href="#characterclass">(see below).</a> 1223*22dc650dSSadaf Ebrahimi</P> 1224*22dc650dSSadaf Ebrahimi<P> 1225*22dc650dSSadaf EbrahimiCircumflex need not be the first character of the pattern if a number of 1226*22dc650dSSadaf Ebrahimialternatives are involved, but it should be the first thing in each alternative 1227*22dc650dSSadaf Ebrahimiin which it appears if the pattern is ever to match that branch. If all 1228*22dc650dSSadaf Ebrahimipossible alternatives start with a circumflex, that is, if the pattern is 1229*22dc650dSSadaf Ebrahimiconstrained to match only at the start of the subject, it is said to be an 1230*22dc650dSSadaf Ebrahimi"anchored" pattern. (There are also other constructs that can cause a pattern 1231*22dc650dSSadaf Ebrahimito be anchored.) 1232*22dc650dSSadaf Ebrahimi</P> 1233*22dc650dSSadaf Ebrahimi<P> 1234*22dc650dSSadaf EbrahimiThe dollar character is an assertion that is true only if the current matching 1235*22dc650dSSadaf Ebrahimipoint is at the end of the subject string, or immediately before a newline at 1236*22dc650dSSadaf Ebrahimithe end of the string (by default), unless PCRE2_NOTEOL is set. Note, however, 1237*22dc650dSSadaf Ebrahimithat it does not actually match the newline. Dollar need not be the last 1238*22dc650dSSadaf Ebrahimicharacter of the pattern if a number of alternatives are involved, but it 1239*22dc650dSSadaf Ebrahimishould be the last item in any branch in which it appears. Dollar has no 1240*22dc650dSSadaf Ebrahimispecial meaning in a character class. 1241*22dc650dSSadaf Ebrahimi</P> 1242*22dc650dSSadaf Ebrahimi<P> 1243*22dc650dSSadaf EbrahimiThe meaning of dollar can be changed so that it matches only at the very end of 1244*22dc650dSSadaf Ebrahimithe string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This 1245*22dc650dSSadaf Ebrahimidoes not affect the \Z assertion. 1246*22dc650dSSadaf Ebrahimi</P> 1247*22dc650dSSadaf Ebrahimi<P> 1248*22dc650dSSadaf EbrahimiThe meanings of the circumflex and dollar metacharacters are changed if the 1249*22dc650dSSadaf EbrahimiPCRE2_MULTILINE option is set. When this is the case, a dollar character 1250*22dc650dSSadaf Ebrahimimatches before any newlines in the string, as well as at the very end, and a 1251*22dc650dSSadaf Ebrahimicircumflex matches immediately after internal newlines as well as at the start 1252*22dc650dSSadaf Ebrahimiof the subject string. It does not match after a newline that ends the string, 1253*22dc650dSSadaf Ebrahimifor compatibility with Perl. However, this can be changed by setting the 1254*22dc650dSSadaf EbrahimiPCRE2_ALT_CIRCUMFLEX option. 1255*22dc650dSSadaf Ebrahimi</P> 1256*22dc650dSSadaf Ebrahimi<P> 1257*22dc650dSSadaf EbrahimiFor example, the pattern /^abc$/ matches the subject string "def\nabc" (where 1258*22dc650dSSadaf Ebrahimi\n represents a newline) in multiline mode, but not otherwise. Consequently, 1259*22dc650dSSadaf Ebrahimipatterns that are anchored in single line mode because all branches start with 1260*22dc650dSSadaf Ebrahimi^ are not anchored in multiline mode, and a match for circumflex is possible 1261*22dc650dSSadaf Ebrahimiwhen the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The 1262*22dc650dSSadaf EbrahimiPCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set. 1263*22dc650dSSadaf Ebrahimi</P> 1264*22dc650dSSadaf Ebrahimi<P> 1265*22dc650dSSadaf EbrahimiWhen the newline convention (see 1266*22dc650dSSadaf Ebrahimi<a href="#newlines">"Newline conventions"</a> 1267*22dc650dSSadaf Ebrahimibelow) recognizes the two-character sequence CRLF as a newline, this is 1268*22dc650dSSadaf Ebrahimipreferred, even if the single characters CR and LF are also recognized as 1269*22dc650dSSadaf Ebrahiminewlines. For example, if the newline convention is "any", a multiline mode 1270*22dc650dSSadaf Ebrahimicircumflex matches before "xyz" in the string "abc\r\nxyz" rather than after 1271*22dc650dSSadaf EbrahimiCR, even though CR on its own is a valid newline. (It also matches at the very 1272*22dc650dSSadaf Ebrahimistart of the string, of course.) 1273*22dc650dSSadaf Ebrahimi</P> 1274*22dc650dSSadaf Ebrahimi<P> 1275*22dc650dSSadaf EbrahimiNote that the sequences \A, \Z, and \z can be used to match the start and 1276*22dc650dSSadaf Ebrahimiend of the subject in both modes, and if all branches of a pattern start with 1277*22dc650dSSadaf Ebrahimi\A it is always anchored, whether or not PCRE2_MULTILINE is set. 1278*22dc650dSSadaf Ebrahimi<a name="fullstopdot"></a></P> 1279*22dc650dSSadaf Ebrahimi<br><a name="SEC7" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br> 1280*22dc650dSSadaf Ebrahimi<P> 1281*22dc650dSSadaf EbrahimiOutside a character class, a dot in the pattern matches any one character in 1282*22dc650dSSadaf Ebrahimithe subject string except (by default) a character that signifies the end of a 1283*22dc650dSSadaf Ebrahimiline. One or more characters may be specified as line terminators (see 1284*22dc650dSSadaf Ebrahimi<a href="#newlines">"Newline conventions"</a> 1285*22dc650dSSadaf Ebrahimiabove). 1286*22dc650dSSadaf Ebrahimi</P> 1287*22dc650dSSadaf Ebrahimi<P> 1288*22dc650dSSadaf EbrahimiDot never matches a single line-ending character. When the two-character 1289*22dc650dSSadaf Ebrahimisequence CRLF is the only line ending, dot does not match CR if it is 1290*22dc650dSSadaf Ebrahimiimmediately followed by LF, but otherwise it matches all characters (including 1291*22dc650dSSadaf Ebrahimiisolated CRs and LFs). When ANYCRLF is selected for line endings, no occurrences 1292*22dc650dSSadaf Ebrahimiof CR of LF match dot. When all Unicode line endings are being recognized, dot 1293*22dc650dSSadaf Ebrahimidoes not match CR or LF or any of the other line ending characters. 1294*22dc650dSSadaf Ebrahimi</P> 1295*22dc650dSSadaf Ebrahimi<P> 1296*22dc650dSSadaf EbrahimiThe behaviour of dot with regard to newlines can be changed. If the 1297*22dc650dSSadaf EbrahimiPCRE2_DOTALL option is set, a dot matches any one character, without exception. 1298*22dc650dSSadaf EbrahimiIf the two-character sequence CRLF is present in the subject string, it takes 1299*22dc650dSSadaf Ebrahimitwo dots to match it. 1300*22dc650dSSadaf Ebrahimi</P> 1301*22dc650dSSadaf Ebrahimi<P> 1302*22dc650dSSadaf EbrahimiThe handling of dot is entirely independent of the handling of circumflex and 1303*22dc650dSSadaf Ebrahimidollar, the only relationship being that they both involve newlines. Dot has no 1304*22dc650dSSadaf Ebrahimispecial meaning in a character class. 1305*22dc650dSSadaf Ebrahimi</P> 1306*22dc650dSSadaf Ebrahimi<P> 1307*22dc650dSSadaf EbrahimiThe escape sequence \N when not followed by an opening brace behaves like a 1308*22dc650dSSadaf Ebrahimidot, except that it is not affected by the PCRE2_DOTALL option. In other words, 1309*22dc650dSSadaf Ebrahimiit matches any character except one that signifies the end of a line. 1310*22dc650dSSadaf Ebrahimi</P> 1311*22dc650dSSadaf Ebrahimi<P> 1312*22dc650dSSadaf EbrahimiWhen \N is followed by an opening brace it has a different meaning. See the 1313*22dc650dSSadaf Ebrahimisection entitled 1314*22dc650dSSadaf Ebrahimi<a href="digitsafterbackslash">"Non-printing characters"</a> 1315*22dc650dSSadaf Ebrahimiabove for details. Perl also uses \N{name} to specify characters by Unicode 1316*22dc650dSSadaf Ebrahiminame; PCRE2 does not support this. 1317*22dc650dSSadaf Ebrahimi</P> 1318*22dc650dSSadaf Ebrahimi<br><a name="SEC8" href="#TOC1">MATCHING A SINGLE CODE UNIT</a><br> 1319*22dc650dSSadaf Ebrahimi<P> 1320*22dc650dSSadaf EbrahimiOutside a character class, the escape sequence \C matches any one code unit, 1321*22dc650dSSadaf Ebrahimiwhether or not a UTF mode is set. In the 8-bit library, one code unit is one 1322*22dc650dSSadaf Ebrahimibyte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a 1323*22dc650dSSadaf Ebrahimi32-bit unit. Unlike a dot, \C always matches line-ending characters. The 1324*22dc650dSSadaf Ebrahimifeature is provided in Perl in order to match individual bytes in UTF-8 mode, 1325*22dc650dSSadaf Ebrahimibut it is unclear how it can usefully be used. 1326*22dc650dSSadaf Ebrahimi</P> 1327*22dc650dSSadaf Ebrahimi<P> 1328*22dc650dSSadaf EbrahimiBecause \C breaks up characters into individual code units, matching one unit 1329*22dc650dSSadaf Ebrahimiwith \C in UTF-8 or UTF-16 mode means that the rest of the string may start 1330*22dc650dSSadaf Ebrahimiwith a malformed UTF character. This has undefined results, because PCRE2 1331*22dc650dSSadaf Ebrahimiassumes that it is matching character by character in a valid UTF string (by 1332*22dc650dSSadaf Ebrahimidefault it checks the subject string's validity at the start of processing 1333*22dc650dSSadaf Ebrahimiunless the PCRE2_NO_UTF_CHECK or PCRE2_MATCH_INVALID_UTF option is used). 1334*22dc650dSSadaf Ebrahimi</P> 1335*22dc650dSSadaf Ebrahimi<P> 1336*22dc650dSSadaf EbrahimiAn application can lock out the use of \C by setting the 1337*22dc650dSSadaf EbrahimiPCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to 1338*22dc650dSSadaf Ebrahimibuild PCRE2 with the use of \C permanently disabled. 1339*22dc650dSSadaf Ebrahimi</P> 1340*22dc650dSSadaf Ebrahimi<P> 1341*22dc650dSSadaf EbrahimiPCRE2 does not allow \C to appear in lookbehind assertions 1342*22dc650dSSadaf Ebrahimi<a href="#lookbehind">(described below)</a> 1343*22dc650dSSadaf Ebrahimiin UTF-8 or UTF-16 modes, because this would make it impossible to calculate 1344*22dc650dSSadaf Ebrahimithe length of the lookbehind. Neither the alternative matching function 1345*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b> nor the JIT optimizer support \C in these UTF modes. 1346*22dc650dSSadaf EbrahimiThe former gives a match-time error; the latter fails to optimize and so the 1347*22dc650dSSadaf Ebrahimimatch is always run using the interpreter. 1348*22dc650dSSadaf Ebrahimi</P> 1349*22dc650dSSadaf Ebrahimi<P> 1350*22dc650dSSadaf EbrahimiIn the 32-bit library, however, \C is always supported (when not explicitly 1351*22dc650dSSadaf Ebrahimilocked out) because it always matches a single code unit, whether or not UTF-32 1352*22dc650dSSadaf Ebrahimiis specified. 1353*22dc650dSSadaf Ebrahimi</P> 1354*22dc650dSSadaf Ebrahimi<P> 1355*22dc650dSSadaf EbrahimiIn general, the \C escape sequence is best avoided. However, one way of using 1356*22dc650dSSadaf Ebrahimiit that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a 1357*22dc650dSSadaf Ebrahimilookahead to check the length of the next character, as in this pattern, which 1358*22dc650dSSadaf Ebrahimicould be used with a UTF-8 string (ignore white space and line breaks): 1359*22dc650dSSadaf Ebrahimi<pre> 1360*22dc650dSSadaf Ebrahimi (?| (?=[\x00-\x7f])(\C) | 1361*22dc650dSSadaf Ebrahimi (?=[\x80-\x{7ff}])(\C)(\C) | 1362*22dc650dSSadaf Ebrahimi (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | 1363*22dc650dSSadaf Ebrahimi (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) 1364*22dc650dSSadaf Ebrahimi</pre> 1365*22dc650dSSadaf EbrahimiIn this example, a group that starts with (?| resets the capturing parentheses 1366*22dc650dSSadaf Ebrahiminumbers in each alternative (see 1367*22dc650dSSadaf Ebrahimi<a href="#dupgroupnumber">"Duplicate Group Numbers"</a> 1368*22dc650dSSadaf Ebrahimibelow). The assertions at the start of each branch check the next UTF-8 1369*22dc650dSSadaf Ebrahimicharacter for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The 1370*22dc650dSSadaf Ebrahimicharacter's individual bytes are then captured by the appropriate number of 1371*22dc650dSSadaf Ebrahimi\C groups. 1372*22dc650dSSadaf Ebrahimi<a name="characterclass"></a></P> 1373*22dc650dSSadaf Ebrahimi<br><a name="SEC9" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br> 1374*22dc650dSSadaf Ebrahimi<P> 1375*22dc650dSSadaf EbrahimiAn opening square bracket introduces a character class, terminated by a closing 1376*22dc650dSSadaf Ebrahimisquare bracket. A closing square bracket on its own is not special by default. 1377*22dc650dSSadaf EbrahimiIf a closing square bracket is required as a member of the class, it should be 1378*22dc650dSSadaf Ebrahimithe first data character in the class (after an initial circumflex, if present) 1379*22dc650dSSadaf Ebrahimior escaped with a backslash. This means that, by default, an empty class cannot 1380*22dc650dSSadaf Ebrahimibe defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing 1381*22dc650dSSadaf Ebrahimisquare bracket at the start does end the (empty) class. 1382*22dc650dSSadaf Ebrahimi</P> 1383*22dc650dSSadaf Ebrahimi<P> 1384*22dc650dSSadaf EbrahimiA character class matches a single character in the subject. A matched 1385*22dc650dSSadaf Ebrahimicharacter must be in the set of characters defined by the class, unless the 1386*22dc650dSSadaf Ebrahimifirst character in the class definition is a circumflex, in which case the 1387*22dc650dSSadaf Ebrahimisubject character must not be in the set defined by the class. If a circumflex 1388*22dc650dSSadaf Ebrahimiis actually required as a member of the class, ensure it is not the first 1389*22dc650dSSadaf Ebrahimicharacter, or escape it with a backslash. 1390*22dc650dSSadaf Ebrahimi</P> 1391*22dc650dSSadaf Ebrahimi<P> 1392*22dc650dSSadaf EbrahimiFor example, the character class [aeiou] matches any lower case vowel, while 1393*22dc650dSSadaf Ebrahimi[^aeiou] matches any character that is not a lower case vowel. Note that a 1394*22dc650dSSadaf Ebrahimicircumflex is just a convenient notation for specifying the characters that 1395*22dc650dSSadaf Ebrahimiare in the class by enumerating those that are not. A class that starts with a 1396*22dc650dSSadaf Ebrahimicircumflex is not an assertion; it still consumes a character from the subject 1397*22dc650dSSadaf Ebrahimistring, and therefore it fails if the current pointer is at the end of the 1398*22dc650dSSadaf Ebrahimistring. 1399*22dc650dSSadaf Ebrahimi</P> 1400*22dc650dSSadaf Ebrahimi<P> 1401*22dc650dSSadaf EbrahimiCharacters in a class may be specified by their code points using \o, \x, or 1402*22dc650dSSadaf Ebrahimi\N{U+hh..} in the usual way. When caseless matching is set, any letters in a 1403*22dc650dSSadaf Ebrahimiclass represent both their upper case and lower case versions, so for example, 1404*22dc650dSSadaf Ebrahimia caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not 1405*22dc650dSSadaf Ebrahimimatch "A", whereas a caseful version would. Note that there are two ASCII 1406*22dc650dSSadaf Ebrahimicharacters, K and S, that, in addition to their lower case ASCII equivalents, 1407*22dc650dSSadaf Ebrahimiare case-equivalent with Unicode U+212A (Kelvin sign) and U+017F (long S) 1408*22dc650dSSadaf Ebrahimirespectively when either PCRE2_UTF or PCRE2_UCP is set. 1409*22dc650dSSadaf Ebrahimi</P> 1410*22dc650dSSadaf Ebrahimi<P> 1411*22dc650dSSadaf EbrahimiCharacters that might indicate line breaks are never treated in any special way 1412*22dc650dSSadaf Ebrahimiwhen matching character classes, whatever line-ending sequence is in use, and 1413*22dc650dSSadaf Ebrahimiwhatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A 1414*22dc650dSSadaf Ebrahimiclass such as [^a] always matches one of these characters. 1415*22dc650dSSadaf Ebrahimi</P> 1416*22dc650dSSadaf Ebrahimi<P> 1417*22dc650dSSadaf EbrahimiThe generic character type escape sequences \d, \D, \h, \H, \p, \P, \s, 1418*22dc650dSSadaf Ebrahimi\S, \v, \V, \w, and \W may appear in a character class, and add the 1419*22dc650dSSadaf Ebrahimicharacters that they match to the class. For example, [\dABCDEF] matches any 1420*22dc650dSSadaf Ebrahimihexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of 1421*22dc650dSSadaf Ebrahimi\d, \s, \w and their upper case partners, just as it does when they appear 1422*22dc650dSSadaf Ebrahimioutside a character class, as described in the section entitled 1423*22dc650dSSadaf Ebrahimi<a href="#genericchartypes">"Generic character types"</a> 1424*22dc650dSSadaf Ebrahimiabove. The escape sequence \b has a different meaning inside a character 1425*22dc650dSSadaf Ebrahimiclass; it matches the backspace character. The sequences \B, \R, and \X are 1426*22dc650dSSadaf Ebrahiminot special inside a character class. Like any other unrecognized escape 1427*22dc650dSSadaf Ebrahimisequences, they cause an error. The same is true for \N when not followed by 1428*22dc650dSSadaf Ebrahimian opening brace. 1429*22dc650dSSadaf Ebrahimi</P> 1430*22dc650dSSadaf Ebrahimi<P> 1431*22dc650dSSadaf EbrahimiThe minus (hyphen) character can be used to specify a range of characters in a 1432*22dc650dSSadaf Ebrahimicharacter class. For example, [d-m] matches any letter between d and m, 1433*22dc650dSSadaf Ebrahimiinclusive. If a minus character is required in a class, it must be escaped with 1434*22dc650dSSadaf Ebrahimia backslash or appear in a position where it cannot be interpreted as 1435*22dc650dSSadaf Ebrahimiindicating a range, typically as the first or last character in the class, 1436*22dc650dSSadaf Ebrahimior immediately after a range. For example, [b-d-z] matches letters in the range 1437*22dc650dSSadaf Ebrahimib to d, a hyphen character, or z. 1438*22dc650dSSadaf Ebrahimi</P> 1439*22dc650dSSadaf Ebrahimi<P> 1440*22dc650dSSadaf EbrahimiPerl treats a hyphen as a literal if it appears before or after a POSIX class 1441*22dc650dSSadaf Ebrahimi(see below) or before or after a character type escape such as \d or \H. 1442*22dc650dSSadaf EbrahimiHowever, unless the hyphen is the last character in the class, Perl outputs a 1443*22dc650dSSadaf Ebrahimiwarning in its warning mode, as this is most likely a user error. As PCRE2 has 1444*22dc650dSSadaf Ebrahimino facility for warning, an error is given in these cases. 1445*22dc650dSSadaf Ebrahimi</P> 1446*22dc650dSSadaf Ebrahimi<P> 1447*22dc650dSSadaf EbrahimiIt is not possible to have the literal character "]" as the end character of a 1448*22dc650dSSadaf Ebrahimirange. A pattern such as [W-]46] is interpreted as a class of two characters 1449*22dc650dSSadaf Ebrahimi("W" and "-") followed by a literal string "46]", so it would match "W46]" or 1450*22dc650dSSadaf Ebrahimi"-46]". However, if the "]" is escaped with a backslash it is interpreted as 1451*22dc650dSSadaf Ebrahimithe end of range, so [W-\]46] is interpreted as a class containing a range 1452*22dc650dSSadaf Ebrahimifollowed by two other characters. The octal or hexadecimal representation of 1453*22dc650dSSadaf Ebrahimi"]" can also be used to end a range. 1454*22dc650dSSadaf Ebrahimi</P> 1455*22dc650dSSadaf Ebrahimi<P> 1456*22dc650dSSadaf EbrahimiRanges normally include all code points between the start and end characters, 1457*22dc650dSSadaf Ebrahimiinclusive. They can also be used for code points specified numerically, for 1458*22dc650dSSadaf Ebrahimiexample [\000-\037]. Ranges can include any characters that are valid for the 1459*22dc650dSSadaf Ebrahimicurrent mode. In any UTF mode, the so-called "surrogate" characters (those 1460*22dc650dSSadaf Ebrahimiwhose code points lie between 0xd800 and 0xdfff inclusive) may not be specified 1461*22dc650dSSadaf Ebrahimiexplicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables 1462*22dc650dSSadaf Ebrahimithis check). However, ranges such as [\x{d7ff}-\x{e000}], which include the 1463*22dc650dSSadaf Ebrahimisurrogates, are always permitted. 1464*22dc650dSSadaf Ebrahimi</P> 1465*22dc650dSSadaf Ebrahimi<P> 1466*22dc650dSSadaf EbrahimiThere is a special case in EBCDIC environments for ranges whose end points are 1467*22dc650dSSadaf Ebrahimiboth specified as literal letters in the same case. For compatibility with 1468*22dc650dSSadaf EbrahimiPerl, EBCDIC code points within the range that are not letters are omitted. For 1469*22dc650dSSadaf Ebrahimiexample, [h-k] matches only four characters, even though the codes for h and k 1470*22dc650dSSadaf Ebrahimiare 0x88 and 0x92, a range of 11 code points. However, if the range is 1471*22dc650dSSadaf Ebrahimispecified numerically, for example, [\x88-\x92] or [h-\x92], all code points 1472*22dc650dSSadaf Ebrahimiare included. 1473*22dc650dSSadaf Ebrahimi</P> 1474*22dc650dSSadaf Ebrahimi<P> 1475*22dc650dSSadaf EbrahimiIf a range that includes letters is used when caseless matching is set, it 1476*22dc650dSSadaf Ebrahimimatches the letters in either case. For example, [W-c] is equivalent to 1477*22dc650dSSadaf Ebrahimi[][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character 1478*22dc650dSSadaf Ebrahimitables for a French locale are in use, [\xc8-\xcb] matches accented E 1479*22dc650dSSadaf Ebrahimicharacters in both cases. 1480*22dc650dSSadaf Ebrahimi</P> 1481*22dc650dSSadaf Ebrahimi<P> 1482*22dc650dSSadaf EbrahimiA circumflex can conveniently be used with the upper case character types to 1483*22dc650dSSadaf Ebrahimispecify a more restricted set of characters than the matching lower case type. 1484*22dc650dSSadaf EbrahimiFor example, the class [^\W_] matches any letter or digit, but not underscore, 1485*22dc650dSSadaf Ebrahimiwhereas [\w] includes underscore. A positive character class should be read as 1486*22dc650dSSadaf Ebrahimi"something OR something OR ..." and a negative class as "NOT something AND NOT 1487*22dc650dSSadaf Ebrahimisomething AND NOT ...". 1488*22dc650dSSadaf Ebrahimi</P> 1489*22dc650dSSadaf Ebrahimi<P> 1490*22dc650dSSadaf EbrahimiThe only metacharacters that are recognized in character classes are backslash, 1491*22dc650dSSadaf Ebrahimihyphen (only where it can be interpreted as specifying a range), circumflex 1492*22dc650dSSadaf Ebrahimi(only at the start), opening square bracket (only when it can be interpreted as 1493*22dc650dSSadaf Ebrahimiintroducing a POSIX class name, or for a special compatibility feature - see 1494*22dc650dSSadaf Ebrahimithe next two sections), and the terminating closing square bracket. However, 1495*22dc650dSSadaf Ebrahimiescaping other non-alphanumeric characters does no harm. 1496*22dc650dSSadaf Ebrahimi</P> 1497*22dc650dSSadaf Ebrahimi<br><a name="SEC10" href="#TOC1">POSIX CHARACTER CLASSES</a><br> 1498*22dc650dSSadaf Ebrahimi<P> 1499*22dc650dSSadaf EbrahimiPerl supports the POSIX notation for character classes. This uses names 1500*22dc650dSSadaf Ebrahimienclosed by [: and :] within the enclosing square brackets. PCRE2 also supports 1501*22dc650dSSadaf Ebrahimithis notation. For example, 1502*22dc650dSSadaf Ebrahimi<pre> 1503*22dc650dSSadaf Ebrahimi [01[:alpha:]%] 1504*22dc650dSSadaf Ebrahimi</pre> 1505*22dc650dSSadaf Ebrahimimatches "0", "1", any alphabetic character, or "%". The supported class names 1506*22dc650dSSadaf Ebrahimiare: 1507*22dc650dSSadaf Ebrahimi<pre> 1508*22dc650dSSadaf Ebrahimi alnum letters and digits 1509*22dc650dSSadaf Ebrahimi alpha letters 1510*22dc650dSSadaf Ebrahimi ascii character codes 0 - 127 1511*22dc650dSSadaf Ebrahimi blank space or tab only 1512*22dc650dSSadaf Ebrahimi cntrl control characters 1513*22dc650dSSadaf Ebrahimi digit decimal digits (same as \d) 1514*22dc650dSSadaf Ebrahimi graph printing characters, excluding space 1515*22dc650dSSadaf Ebrahimi lower lower case letters 1516*22dc650dSSadaf Ebrahimi print printing characters, including space 1517*22dc650dSSadaf Ebrahimi punct printing characters, excluding letters and digits and space 1518*22dc650dSSadaf Ebrahimi space white space (the same as \s from PCRE2 8.34) 1519*22dc650dSSadaf Ebrahimi upper upper case letters 1520*22dc650dSSadaf Ebrahimi word "word" characters (same as \w) 1521*22dc650dSSadaf Ebrahimi xdigit hexadecimal digits 1522*22dc650dSSadaf Ebrahimi</pre> 1523*22dc650dSSadaf EbrahimiThe default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), 1524*22dc650dSSadaf Ebrahimiand space (32). If locale-specific matching is taking place, the list of space 1525*22dc650dSSadaf Ebrahimicharacters may be different; there may be fewer or more of them. "Space" and 1526*22dc650dSSadaf Ebrahimi\s match the same set of characters, as do "word" and \w. 1527*22dc650dSSadaf Ebrahimi</P> 1528*22dc650dSSadaf Ebrahimi<P> 1529*22dc650dSSadaf EbrahimiThe name "word" is a Perl extension, and "blank" is a GNU extension from Perl 1530*22dc650dSSadaf Ebrahimi5.8. Another Perl extension is negation, which is indicated by a ^ character 1531*22dc650dSSadaf Ebrahimiafter the colon. For example, 1532*22dc650dSSadaf Ebrahimi<pre> 1533*22dc650dSSadaf Ebrahimi [12[:^digit:]] 1534*22dc650dSSadaf Ebrahimi</pre> 1535*22dc650dSSadaf Ebrahimimatches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX 1536*22dc650dSSadaf Ebrahimisyntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not 1537*22dc650dSSadaf Ebrahimisupported, and an error is given if they are encountered. 1538*22dc650dSSadaf Ebrahimi</P> 1539*22dc650dSSadaf Ebrahimi<P> 1540*22dc650dSSadaf EbrahimiBy default, characters with values greater than 127 do not match any of the 1541*22dc650dSSadaf EbrahimiPOSIX character classes, although this may be different for characters in the 1542*22dc650dSSadaf Ebrahimirange 128-255 when locale-specific matching is happening. However, in UCP mode, 1543*22dc650dSSadaf Ebrahimiunless certain options are set (see below), some of the classes are changed so 1544*22dc650dSSadaf Ebrahimithat Unicode character properties are used. This is achieved by replacing 1545*22dc650dSSadaf EbrahimiPOSIX classes with other sequences, as follows: 1546*22dc650dSSadaf Ebrahimi<pre> 1547*22dc650dSSadaf Ebrahimi [:alnum:] becomes \p{Xan} 1548*22dc650dSSadaf Ebrahimi [:alpha:] becomes \p{L} 1549*22dc650dSSadaf Ebrahimi [:blank:] becomes \h 1550*22dc650dSSadaf Ebrahimi [:cntrl:] becomes \p{Cc} 1551*22dc650dSSadaf Ebrahimi [:digit:] becomes \p{Nd} 1552*22dc650dSSadaf Ebrahimi [:lower:] becomes \p{Ll} 1553*22dc650dSSadaf Ebrahimi [:space:] becomes \p{Xps} 1554*22dc650dSSadaf Ebrahimi [:upper:] becomes \p{Lu} 1555*22dc650dSSadaf Ebrahimi [:word:] becomes \p{Xwd} 1556*22dc650dSSadaf Ebrahimi</pre> 1557*22dc650dSSadaf EbrahimiNegated versions, such as [:^alpha:] use \P instead of \p. Four other POSIX 1558*22dc650dSSadaf Ebrahimiclasses are handled specially in UCP mode: 1559*22dc650dSSadaf Ebrahimi</P> 1560*22dc650dSSadaf Ebrahimi<P> 1561*22dc650dSSadaf Ebrahimi[:graph:] 1562*22dc650dSSadaf EbrahimiThis matches characters that have glyphs that mark the page when printed. In 1563*22dc650dSSadaf EbrahimiUnicode property terms, it matches all characters with the L, M, N, P, S, or Cf 1564*22dc650dSSadaf Ebrahimiproperties, except for: 1565*22dc650dSSadaf Ebrahimi<pre> 1566*22dc650dSSadaf Ebrahimi U+061C Arabic Letter Mark 1567*22dc650dSSadaf Ebrahimi U+180E Mongolian Vowel Separator 1568*22dc650dSSadaf Ebrahimi U+2066 - U+2069 Various "isolate"s 1569*22dc650dSSadaf Ebrahimi 1570*22dc650dSSadaf Ebrahimi</PRE> 1571*22dc650dSSadaf Ebrahimi</P> 1572*22dc650dSSadaf Ebrahimi<P> 1573*22dc650dSSadaf Ebrahimi[:print:] 1574*22dc650dSSadaf EbrahimiThis matches the same characters as [:graph:] plus space characters that are 1575*22dc650dSSadaf Ebrahiminot controls, that is, characters with the Zs property. 1576*22dc650dSSadaf Ebrahimi</P> 1577*22dc650dSSadaf Ebrahimi<P> 1578*22dc650dSSadaf Ebrahimi[:punct:] 1579*22dc650dSSadaf EbrahimiThis matches all characters that have the Unicode P (punctuation) property, 1580*22dc650dSSadaf Ebrahimiplus those characters with code points less than 256 that have the S (Symbol) 1581*22dc650dSSadaf Ebrahimiproperty. 1582*22dc650dSSadaf Ebrahimi</P> 1583*22dc650dSSadaf Ebrahimi<P> 1584*22dc650dSSadaf Ebrahimi[:xdigit:] 1585*22dc650dSSadaf EbrahimiIn addition to the ASCII hexadecimal digits, this also matches the "fullwidth" 1586*22dc650dSSadaf Ebrahimiversions of those characters, whose Unicode code points start at U+FF10. This 1587*22dc650dSSadaf Ebrahimiis a change that was made in PCRE release 10.43 for Perl compatibility. 1588*22dc650dSSadaf Ebrahimi</P> 1589*22dc650dSSadaf Ebrahimi<P> 1590*22dc650dSSadaf EbrahimiThe other POSIX classes are unchanged by PCRE2_UCP, and match only characters 1591*22dc650dSSadaf Ebrahimiwith code points less than 256. 1592*22dc650dSSadaf Ebrahimi</P> 1593*22dc650dSSadaf Ebrahimi<P> 1594*22dc650dSSadaf EbrahimiThere are two options that can be used to restrict the POSIX classes to ASCII 1595*22dc650dSSadaf Ebrahimicharacters when PCRE2_UCP is set. The option PCRE2_EXTRA_ASCII_DIGIT affects 1596*22dc650dSSadaf Ebrahimijust [:digit:] and [:xdigit:]. Within a pattern, this can be set and unset by 1597*22dc650dSSadaf Ebrahimi(?aT) and (?-aT). The PCRE2_EXTRA_ASCII_POSIX option disables UCP processing 1598*22dc650dSSadaf Ebrahimifor all POSIX classes, including [:digit:] and [:xdigit:]. Within a pattern, 1599*22dc650dSSadaf Ebrahimi(?aP) and (?-aP) set and unset both these options for consistency. 1600*22dc650dSSadaf Ebrahimi</P> 1601*22dc650dSSadaf Ebrahimi<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br> 1602*22dc650dSSadaf Ebrahimi<P> 1603*22dc650dSSadaf EbrahimiIn the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly 1604*22dc650dSSadaf Ebrahimisyntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of 1605*22dc650dSSadaf Ebrahimiword". PCRE2 treats these items as follows: 1606*22dc650dSSadaf Ebrahimi<pre> 1607*22dc650dSSadaf Ebrahimi [[:<:]] is converted to \b(?=\w) 1608*22dc650dSSadaf Ebrahimi [[:>:]] is converted to \b(?<=\w) 1609*22dc650dSSadaf Ebrahimi</pre> 1610*22dc650dSSadaf EbrahimiOnly these exact character sequences are recognized. A sequence such as 1611*22dc650dSSadaf Ebrahimi[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is 1612*22dc650dSSadaf Ebrahiminot compatible with Perl. It is provided to help migrations from other 1613*22dc650dSSadaf Ebrahimienvironments, and is best not used in any new patterns. Note that \b matches 1614*22dc650dSSadaf Ebrahimiat the start and the end of a word (see 1615*22dc650dSSadaf Ebrahimi<a href="#smallassertions">"Simple assertions"</a> 1616*22dc650dSSadaf Ebrahimiabove), and in a Perl-style pattern the preceding or following character 1617*22dc650dSSadaf Ebrahiminormally shows which is wanted, without the need for the assertions that are 1618*22dc650dSSadaf Ebrahimiused above in order to give exactly the POSIX behaviour. Note also that the 1619*22dc650dSSadaf EbrahimiPCRE2_UCP option changes the meaning of \w (and therefore \b) by default, so 1620*22dc650dSSadaf Ebrahimiit also affects these POSIX sequences. 1621*22dc650dSSadaf Ebrahimi</P> 1622*22dc650dSSadaf Ebrahimi<br><a name="SEC12" href="#TOC1">VERTICAL BAR</a><br> 1623*22dc650dSSadaf Ebrahimi<P> 1624*22dc650dSSadaf EbrahimiVertical bar characters are used to separate alternative patterns. For example, 1625*22dc650dSSadaf Ebrahimithe pattern 1626*22dc650dSSadaf Ebrahimi<pre> 1627*22dc650dSSadaf Ebrahimi gilbert|sullivan 1628*22dc650dSSadaf Ebrahimi</pre> 1629*22dc650dSSadaf Ebrahimimatches either "gilbert" or "sullivan". Any number of alternatives may appear, 1630*22dc650dSSadaf Ebrahimiand an empty alternative is permitted (matching the empty string). The matching 1631*22dc650dSSadaf Ebrahimiprocess tries each alternative in turn, from left to right, and the first one 1632*22dc650dSSadaf Ebrahimithat succeeds is used. If the alternatives are within a group 1633*22dc650dSSadaf Ebrahimi<a href="#group">(defined below),</a> 1634*22dc650dSSadaf Ebrahimi"succeeds" means matching the rest of the main pattern as well as the 1635*22dc650dSSadaf Ebrahimialternative in the group. 1636*22dc650dSSadaf Ebrahimi<a name="internaloptions"></a></P> 1637*22dc650dSSadaf Ebrahimi<br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br> 1638*22dc650dSSadaf Ebrahimi<P> 1639*22dc650dSSadaf EbrahimiThe settings of several options can be changed within a pattern by a sequence 1640*22dc650dSSadaf Ebrahimiof letters enclosed between "(?" and ")". The following are Perl-compatible, 1641*22dc650dSSadaf Ebrahimiand are described in detail in the 1642*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 1643*22dc650dSSadaf Ebrahimidocumentation. The option letters are: 1644*22dc650dSSadaf Ebrahimi<pre> 1645*22dc650dSSadaf Ebrahimi i for PCRE2_CASELESS 1646*22dc650dSSadaf Ebrahimi m for PCRE2_MULTILINE 1647*22dc650dSSadaf Ebrahimi n for PCRE2_NO_AUTO_CAPTURE 1648*22dc650dSSadaf Ebrahimi s for PCRE2_DOTALL 1649*22dc650dSSadaf Ebrahimi x for PCRE2_EXTENDED 1650*22dc650dSSadaf Ebrahimi xx for PCRE2_EXTENDED_MORE 1651*22dc650dSSadaf Ebrahimi</pre> 1652*22dc650dSSadaf EbrahimiFor example, (?im) sets caseless, multiline matching. It is also possible to 1653*22dc650dSSadaf Ebrahimiunset these options by preceding the relevant letters with a hyphen, for 1654*22dc650dSSadaf Ebrahimiexample (?-im). The two "extended" options are not independent; unsetting 1655*22dc650dSSadaf Ebrahimieither one cancels the effects of both of them. 1656*22dc650dSSadaf Ebrahimi</P> 1657*22dc650dSSadaf Ebrahimi<P> 1658*22dc650dSSadaf EbrahimiA combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS 1659*22dc650dSSadaf Ebrahimiand PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also 1660*22dc650dSSadaf Ebrahimipermitted. Only one hyphen may appear in the options string. If a letter 1661*22dc650dSSadaf Ebrahimiappears both before and after the hyphen, the option is unset. An empty options 1662*22dc650dSSadaf Ebrahimisetting "(?)" is allowed. Needless to say, it has no effect. 1663*22dc650dSSadaf Ebrahimi</P> 1664*22dc650dSSadaf Ebrahimi<P> 1665*22dc650dSSadaf EbrahimiIf the first character following (? is a circumflex, it causes all of the above 1666*22dc650dSSadaf Ebrahimioptions to be unset. Letters may follow the circumflex to cause some options to 1667*22dc650dSSadaf Ebrahimibe re-instated, but a hyphen may not appear. 1668*22dc650dSSadaf Ebrahimi</P> 1669*22dc650dSSadaf Ebrahimi<P> 1670*22dc650dSSadaf EbrahimiSome PCRE2-specific options can be changed by the same mechanism using these 1671*22dc650dSSadaf Ebrahimipairs or individual letters: 1672*22dc650dSSadaf Ebrahimi<pre> 1673*22dc650dSSadaf Ebrahimi aD for PCRE2_EXTRA_ASCII_BSD 1674*22dc650dSSadaf Ebrahimi aS for PCRE2_EXTRA_ASCII_BSS 1675*22dc650dSSadaf Ebrahimi aW for PCRE2_EXTRA_ASCII_BSW 1676*22dc650dSSadaf Ebrahimi aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT 1677*22dc650dSSadaf Ebrahimi aT for PCRE2_EXTRA_ASCII_DIGIT 1678*22dc650dSSadaf Ebrahimi r for PCRE2_EXTRA_CASELESS_RESTRICT 1679*22dc650dSSadaf Ebrahimi J for PCRE2_DUPNAMES 1680*22dc650dSSadaf Ebrahimi U for PCRE2_UNGREEDY 1681*22dc650dSSadaf Ebrahimi</pre> 1682*22dc650dSSadaf EbrahimiHowever, except for 'r', these are not unset by (?^), which is equivalent to 1683*22dc650dSSadaf Ebrahimi(?-imnrsx). If 'a' is not followed by any of the upper case letters shown 1684*22dc650dSSadaf Ebrahimiabove, it sets (or unsets) all the ASCII options. 1685*22dc650dSSadaf Ebrahimi</P> 1686*22dc650dSSadaf Ebrahimi<P> 1687*22dc650dSSadaf EbrahimiPCRE2_EXTRA_ASCII_DIGIT has no additional effect when PCRE2_EXTRA_ASCII_POSIX 1688*22dc650dSSadaf Ebrahimiis set, but including it in (?aP) means that (?-aP) suppresses all ASCII 1689*22dc650dSSadaf Ebrahimirestrictions for POSIX classes. 1690*22dc650dSSadaf Ebrahimi</P> 1691*22dc650dSSadaf Ebrahimi<P> 1692*22dc650dSSadaf EbrahimiWhen one of these option changes occurs at top level (that is, not inside group 1693*22dc650dSSadaf Ebrahimiparentheses), the change applies until a subsequent change, or the end of the 1694*22dc650dSSadaf Ebrahimipattern. An option change within a group (see below for a description of 1695*22dc650dSSadaf Ebrahimigroups) affects only that part of the group that follows it. At the end of the 1696*22dc650dSSadaf Ebrahimigroup these options are reset to the state they were before the group. For 1697*22dc650dSSadaf Ebrahimiexample, 1698*22dc650dSSadaf Ebrahimi<pre> 1699*22dc650dSSadaf Ebrahimi (a(?i)b)c 1700*22dc650dSSadaf Ebrahimi</pre> 1701*22dc650dSSadaf Ebrahimimatches abc and aBc and no other strings (assuming PCRE2_CASELESS is not set 1702*22dc650dSSadaf Ebrahimiexternally). Any changes made in one alternative do carry on into subsequent 1703*22dc650dSSadaf Ebrahimibranches within the same group. For example, 1704*22dc650dSSadaf Ebrahimi<pre> 1705*22dc650dSSadaf Ebrahimi (a(?i)b|c) 1706*22dc650dSSadaf Ebrahimi</pre> 1707*22dc650dSSadaf Ebrahimimatches "ab", "aB", "c", and "C", even though when matching "C" the first 1708*22dc650dSSadaf Ebrahimibranch is abandoned before the option setting. This is because the effects of 1709*22dc650dSSadaf Ebrahimioption settings happen at compile time. There would be some very weird 1710*22dc650dSSadaf Ebrahimibehaviour otherwise. 1711*22dc650dSSadaf Ebrahimi</P> 1712*22dc650dSSadaf Ebrahimi<P> 1713*22dc650dSSadaf EbrahimiAs a convenient shorthand, if any option settings are required at the start of 1714*22dc650dSSadaf Ebrahimia non-capturing group (see the next section), the option letters may 1715*22dc650dSSadaf Ebrahimiappear between the "?" and the ":". Thus the two patterns 1716*22dc650dSSadaf Ebrahimi<pre> 1717*22dc650dSSadaf Ebrahimi (?i:saturday|sunday) 1718*22dc650dSSadaf Ebrahimi (?:(?i)saturday|sunday) 1719*22dc650dSSadaf Ebrahimi</pre> 1720*22dc650dSSadaf Ebrahimimatch exactly the same set of strings. 1721*22dc650dSSadaf Ebrahimi</P> 1722*22dc650dSSadaf Ebrahimi<P> 1723*22dc650dSSadaf Ebrahimi<b>Note:</b> There are other PCRE2-specific options, applying to the whole 1724*22dc650dSSadaf Ebrahimipattern, which can be set by the application when the compiling function is 1725*22dc650dSSadaf Ebrahimicalled. In addition, the pattern can contain special leading sequences such as 1726*22dc650dSSadaf Ebrahimi(*CRLF) to override what the application has set or what has been defaulted. 1727*22dc650dSSadaf EbrahimiDetails are given in the section entitled 1728*22dc650dSSadaf Ebrahimi<a href="#newlineseq">"Newline sequences"</a> 1729*22dc650dSSadaf Ebrahimiabove. There are also the (*UTF) and (*UCP) leading sequences that can be used 1730*22dc650dSSadaf Ebrahimito set UTF and Unicode property modes; they are equivalent to setting the 1731*22dc650dSSadaf EbrahimiPCRE2_UTF and PCRE2_UCP options, respectively. However, the application can set 1732*22dc650dSSadaf Ebrahimithe PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, which lock out the use of the 1733*22dc650dSSadaf Ebrahimi(*UTF) and (*UCP) sequences. 1734*22dc650dSSadaf Ebrahimi<a name="group"></a></P> 1735*22dc650dSSadaf Ebrahimi<br><a name="SEC14" href="#TOC1">GROUPS</a><br> 1736*22dc650dSSadaf Ebrahimi<P> 1737*22dc650dSSadaf EbrahimiGroups are delimited by parentheses (round brackets), which can be nested. 1738*22dc650dSSadaf EbrahimiTurning part of a pattern into a group does two things: 1739*22dc650dSSadaf Ebrahimi<br> 1740*22dc650dSSadaf Ebrahimi<br> 1741*22dc650dSSadaf Ebrahimi1. It localizes a set of alternatives. For example, the pattern 1742*22dc650dSSadaf Ebrahimi<pre> 1743*22dc650dSSadaf Ebrahimi cat(aract|erpillar|) 1744*22dc650dSSadaf Ebrahimi</pre> 1745*22dc650dSSadaf Ebrahimimatches "cataract", "caterpillar", or "cat". Without the parentheses, it would 1746*22dc650dSSadaf Ebrahimimatch "cataract", "erpillar" or an empty string. 1747*22dc650dSSadaf Ebrahimi<br> 1748*22dc650dSSadaf Ebrahimi<br> 1749*22dc650dSSadaf Ebrahimi2. It creates a "capture group". This means that, when the whole pattern 1750*22dc650dSSadaf Ebrahimimatches, the portion of the subject string that matched the group is passed 1751*22dc650dSSadaf Ebrahimiback to the caller, separately from the portion that matched the whole pattern. 1752*22dc650dSSadaf Ebrahimi(This applies only to the traditional matching function; the DFA matching 1753*22dc650dSSadaf Ebrahimifunction does not support capturing.) 1754*22dc650dSSadaf Ebrahimi</P> 1755*22dc650dSSadaf Ebrahimi<P> 1756*22dc650dSSadaf EbrahimiOpening parentheses are counted from left to right (starting from 1) to obtain 1757*22dc650dSSadaf Ebrahiminumbers for capture groups. For example, if the string "the red king" is 1758*22dc650dSSadaf Ebrahimimatched against the pattern 1759*22dc650dSSadaf Ebrahimi<pre> 1760*22dc650dSSadaf Ebrahimi the ((red|white) (king|queen)) 1761*22dc650dSSadaf Ebrahimi</pre> 1762*22dc650dSSadaf Ebrahimithe captured substrings are "red king", "red", and "king", and are numbered 1, 1763*22dc650dSSadaf Ebrahimi2, and 3, respectively. 1764*22dc650dSSadaf Ebrahimi</P> 1765*22dc650dSSadaf Ebrahimi<P> 1766*22dc650dSSadaf EbrahimiThe fact that plain parentheses fulfil two functions is not always helpful. 1767*22dc650dSSadaf EbrahimiThere are often times when grouping is required without capturing. If an 1768*22dc650dSSadaf Ebrahimiopening parenthesis is followed by a question mark and a colon, the group 1769*22dc650dSSadaf Ebrahimidoes not do any capturing, and is not counted when computing the number of any 1770*22dc650dSSadaf Ebrahimisubsequent capture groups. For example, if the string "the white queen" 1771*22dc650dSSadaf Ebrahimiis matched against the pattern 1772*22dc650dSSadaf Ebrahimi<pre> 1773*22dc650dSSadaf Ebrahimi the ((?:red|white) (king|queen)) 1774*22dc650dSSadaf Ebrahimi</pre> 1775*22dc650dSSadaf Ebrahimithe captured substrings are "white queen" and "queen", and are numbered 1 and 1776*22dc650dSSadaf Ebrahimi2. The maximum number of capture groups is 65535. 1777*22dc650dSSadaf Ebrahimi</P> 1778*22dc650dSSadaf Ebrahimi<P> 1779*22dc650dSSadaf EbrahimiAs a convenient shorthand, if any option settings are required at the start of 1780*22dc650dSSadaf Ebrahimia non-capturing group, the option letters may appear between the "?" and the 1781*22dc650dSSadaf Ebrahimi":". Thus the two patterns 1782*22dc650dSSadaf Ebrahimi<pre> 1783*22dc650dSSadaf Ebrahimi (?i:saturday|sunday) 1784*22dc650dSSadaf Ebrahimi (?:(?i)saturday|sunday) 1785*22dc650dSSadaf Ebrahimi</pre> 1786*22dc650dSSadaf Ebrahimimatch exactly the same set of strings. Because alternative branches are tried 1787*22dc650dSSadaf Ebrahimifrom left to right, and options are not reset until the end of the group is 1788*22dc650dSSadaf Ebrahimireached, an option setting in one branch does affect subsequent branches, so 1789*22dc650dSSadaf Ebrahimithe above patterns match "SUNDAY" as well as "Saturday". 1790*22dc650dSSadaf Ebrahimi<a name="dupgroupnumber"></a></P> 1791*22dc650dSSadaf Ebrahimi<br><a name="SEC15" href="#TOC1">DUPLICATE GROUP NUMBERS</a><br> 1792*22dc650dSSadaf Ebrahimi<P> 1793*22dc650dSSadaf EbrahimiPerl 5.10 introduced a feature whereby each alternative in a group uses the 1794*22dc650dSSadaf Ebrahimisame numbers for its capturing parentheses. Such a group starts with (?| and is 1795*22dc650dSSadaf Ebrahimiitself a non-capturing group. For example, consider this pattern: 1796*22dc650dSSadaf Ebrahimi<pre> 1797*22dc650dSSadaf Ebrahimi (?|(Sat)ur|(Sun))day 1798*22dc650dSSadaf Ebrahimi</pre> 1799*22dc650dSSadaf EbrahimiBecause the two alternatives are inside a (?| group, both sets of capturing 1800*22dc650dSSadaf Ebrahimiparentheses are numbered one. Thus, when the pattern matches, you can look 1801*22dc650dSSadaf Ebrahimiat captured substring number one, whichever alternative matched. This construct 1802*22dc650dSSadaf Ebrahimiis useful when you want to capture part, but not all, of one of a number of 1803*22dc650dSSadaf Ebrahimialternatives. Inside a (?| group, parentheses are numbered as usual, but the 1804*22dc650dSSadaf Ebrahiminumber is reset at the start of each branch. The numbers of any capturing 1805*22dc650dSSadaf Ebrahimiparentheses that follow the whole group start after the highest number used in 1806*22dc650dSSadaf Ebrahimiany branch. The following example is taken from the Perl documentation. The 1807*22dc650dSSadaf Ebrahiminumbers underneath show in which buffer the captured content will be stored. 1808*22dc650dSSadaf Ebrahimi<pre> 1809*22dc650dSSadaf Ebrahimi # before ---------------branch-reset----------- after 1810*22dc650dSSadaf Ebrahimi / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x 1811*22dc650dSSadaf Ebrahimi # 1 2 2 3 2 3 4 1812*22dc650dSSadaf Ebrahimi</pre> 1813*22dc650dSSadaf EbrahimiA backreference to a capture group uses the most recent value that is set for 1814*22dc650dSSadaf Ebrahimithe group. The following pattern matches "abcabc" or "defdef": 1815*22dc650dSSadaf Ebrahimi<pre> 1816*22dc650dSSadaf Ebrahimi /(?|(abc)|(def))\1/ 1817*22dc650dSSadaf Ebrahimi</pre> 1818*22dc650dSSadaf EbrahimiIn contrast, a subroutine call to a capture group always refers to the 1819*22dc650dSSadaf Ebrahimifirst one in the pattern with the given number. The following pattern matches 1820*22dc650dSSadaf Ebrahimi"abcabc" or "defabc": 1821*22dc650dSSadaf Ebrahimi<pre> 1822*22dc650dSSadaf Ebrahimi /(?|(abc)|(def))(?1)/ 1823*22dc650dSSadaf Ebrahimi</pre> 1824*22dc650dSSadaf EbrahimiA relative reference such as (?-1) is no different: it is just a convenient way 1825*22dc650dSSadaf Ebrahimiof computing an absolute group number. 1826*22dc650dSSadaf Ebrahimi</P> 1827*22dc650dSSadaf Ebrahimi<P> 1828*22dc650dSSadaf EbrahimiIf a 1829*22dc650dSSadaf Ebrahimi<a href="#conditions">condition test</a> 1830*22dc650dSSadaf Ebrahimifor a group's having matched refers to a non-unique number, the test is 1831*22dc650dSSadaf Ebrahimitrue if any group with that number has matched. 1832*22dc650dSSadaf Ebrahimi</P> 1833*22dc650dSSadaf Ebrahimi<P> 1834*22dc650dSSadaf EbrahimiAn alternative approach to using this "branch reset" feature is to use 1835*22dc650dSSadaf Ebrahimiduplicate named groups, as described in the next section. 1836*22dc650dSSadaf Ebrahimi</P> 1837*22dc650dSSadaf Ebrahimi<br><a name="SEC16" href="#TOC1">NAMED CAPTURE GROUPS</a><br> 1838*22dc650dSSadaf Ebrahimi<P> 1839*22dc650dSSadaf EbrahimiIdentifying capture groups by number is simple, but it can be very hard to keep 1840*22dc650dSSadaf Ebrahimitrack of the numbers in complicated patterns. Furthermore, if an expression is 1841*22dc650dSSadaf Ebrahimimodified, the numbers may change. To help with this difficulty, PCRE2 supports 1842*22dc650dSSadaf Ebrahimithe naming of capture groups. This feature was not added to Perl until release 1843*22dc650dSSadaf Ebrahimi5.10. Python had the feature earlier, and PCRE1 introduced it at release 4.0, 1844*22dc650dSSadaf Ebrahimiusing the Python syntax. PCRE2 supports both the Perl and the Python syntax. 1845*22dc650dSSadaf Ebrahimi</P> 1846*22dc650dSSadaf Ebrahimi<P> 1847*22dc650dSSadaf EbrahimiIn PCRE2, a capture group can be named in one of three ways: (?<name>...) or 1848*22dc650dSSadaf Ebrahimi(?'name'...) as in Perl, or (?P<name>...) as in Python. Names may be up to 128 1849*22dc650dSSadaf Ebrahimicode units long. When PCRE2_UTF is not set, they may contain only ASCII 1850*22dc650dSSadaf Ebrahimialphanumeric characters and underscores, but must start with a non-digit. When 1851*22dc650dSSadaf EbrahimiPCRE2_UTF is set, the syntax of group names is extended to allow any Unicode 1852*22dc650dSSadaf Ebrahimiletter or Unicode decimal digit. In other words, group names must match one of 1853*22dc650dSSadaf Ebrahimithese patterns: 1854*22dc650dSSadaf Ebrahimi<pre> 1855*22dc650dSSadaf Ebrahimi ^[_A-Za-z][_A-Za-z0-9]*\z when PCRE2_UTF is not set 1856*22dc650dSSadaf Ebrahimi ^[_\p{L}][_\p{L}\p{Nd}]*\z when PCRE2_UTF is set 1857*22dc650dSSadaf Ebrahimi</pre> 1858*22dc650dSSadaf EbrahimiReferences to capture groups from other parts of the pattern, such as 1859*22dc650dSSadaf Ebrahimi<a href="#backreferences">backreferences,</a> 1860*22dc650dSSadaf Ebrahimi<a href="#recursion">recursion,</a> 1861*22dc650dSSadaf Ebrahimiand 1862*22dc650dSSadaf Ebrahimi<a href="#conditions">conditions,</a> 1863*22dc650dSSadaf Ebrahimican all be made by name as well as by number. 1864*22dc650dSSadaf Ebrahimi</P> 1865*22dc650dSSadaf Ebrahimi<P> 1866*22dc650dSSadaf EbrahimiNamed capture groups are allocated numbers as well as names, exactly as 1867*22dc650dSSadaf Ebrahimiif the names were not present. In both PCRE2 and Perl, capture groups 1868*22dc650dSSadaf Ebrahimiare primarily identified by numbers; any names are just aliases for these 1869*22dc650dSSadaf Ebrahiminumbers. The PCRE2 API provides function calls for extracting the complete 1870*22dc650dSSadaf Ebrahiminame-to-number translation table from a compiled pattern, as well as 1871*22dc650dSSadaf Ebrahimiconvenience functions for extracting captured substrings by name. 1872*22dc650dSSadaf Ebrahimi</P> 1873*22dc650dSSadaf Ebrahimi<P> 1874*22dc650dSSadaf Ebrahimi<b>Warning:</b> When more than one capture group has the same number, as 1875*22dc650dSSadaf Ebrahimidescribed in the previous section, a name given to one of them applies to all 1876*22dc650dSSadaf Ebrahimiof them. Perl allows identically numbered groups to have different names. 1877*22dc650dSSadaf EbrahimiConsider this pattern, where there are two capture groups, both numbered 1: 1878*22dc650dSSadaf Ebrahimi<pre> 1879*22dc650dSSadaf Ebrahimi (?|(?<AA>aa)|(?<BB>bb)) 1880*22dc650dSSadaf Ebrahimi</pre> 1881*22dc650dSSadaf EbrahimiPerl allows this, with both names AA and BB as aliases of group 1. Thus, after 1882*22dc650dSSadaf Ebrahimia successful match, both names yield the same value (either "aa" or "bb"). 1883*22dc650dSSadaf Ebrahimi</P> 1884*22dc650dSSadaf Ebrahimi<P> 1885*22dc650dSSadaf EbrahimiIn an attempt to reduce confusion, PCRE2 does not allow the same group number 1886*22dc650dSSadaf Ebrahimito be associated with more than one name. The example above provokes a 1887*22dc650dSSadaf Ebrahimicompile-time error. However, there is still scope for confusion. Consider this 1888*22dc650dSSadaf Ebrahimipattern: 1889*22dc650dSSadaf Ebrahimi<pre> 1890*22dc650dSSadaf Ebrahimi (?|(?<AA>aa)|(bb)) 1891*22dc650dSSadaf Ebrahimi</pre> 1892*22dc650dSSadaf EbrahimiAlthough the second group number 1 is not explicitly named, the name AA is 1893*22dc650dSSadaf Ebrahimistill an alias for any group 1. Whether the pattern matches "aa" or "bb", a 1894*22dc650dSSadaf Ebrahimireference by name to group AA yields the matched string. 1895*22dc650dSSadaf Ebrahimi</P> 1896*22dc650dSSadaf Ebrahimi<P> 1897*22dc650dSSadaf EbrahimiBy default, a name must be unique within a pattern, except that duplicate names 1898*22dc650dSSadaf Ebrahimiare permitted for groups with the same number, for example: 1899*22dc650dSSadaf Ebrahimi<pre> 1900*22dc650dSSadaf Ebrahimi (?|(?<AA>aa)|(?<AA>bb)) 1901*22dc650dSSadaf Ebrahimi</pre> 1902*22dc650dSSadaf EbrahimiThe duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES 1903*22dc650dSSadaf Ebrahimioption at compile time, or by the use of (?J) within the pattern, as described 1904*22dc650dSSadaf Ebrahimiin the section entitled 1905*22dc650dSSadaf Ebrahimi<a href="#internaloptions">"Internal Option Setting"</a> 1906*22dc650dSSadaf Ebrahimiabove. 1907*22dc650dSSadaf Ebrahimi</P> 1908*22dc650dSSadaf Ebrahimi<P> 1909*22dc650dSSadaf EbrahimiDuplicate names can be useful for patterns where only one instance of the named 1910*22dc650dSSadaf Ebrahimicapture group can match. Suppose you want to match the name of a weekday, 1911*22dc650dSSadaf Ebrahimieither as a 3-letter abbreviation or as the full name, and in both cases you 1912*22dc650dSSadaf Ebrahimiwant to extract the abbreviation. This pattern (ignoring the line breaks) does 1913*22dc650dSSadaf Ebrahimithe job: 1914*22dc650dSSadaf Ebrahimi<pre> 1915*22dc650dSSadaf Ebrahimi (?J) 1916*22dc650dSSadaf Ebrahimi (?<DN>Mon|Fri|Sun)(?:day)?| 1917*22dc650dSSadaf Ebrahimi (?<DN>Tue)(?:sday)?| 1918*22dc650dSSadaf Ebrahimi (?<DN>Wed)(?:nesday)?| 1919*22dc650dSSadaf Ebrahimi (?<DN>Thu)(?:rsday)?| 1920*22dc650dSSadaf Ebrahimi (?<DN>Sat)(?:urday)? 1921*22dc650dSSadaf Ebrahimi</pre> 1922*22dc650dSSadaf EbrahimiThere are five capture groups, but only one is ever set after a match. The 1923*22dc650dSSadaf Ebrahimiconvenience functions for extracting the data by name returns the substring for 1924*22dc650dSSadaf Ebrahimithe first (and in this example, the only) group of that name that matched. This 1925*22dc650dSSadaf Ebrahimisaves searching to find which numbered group it was. (An alternative way of 1926*22dc650dSSadaf Ebrahimisolving this problem is to use a "branch reset" group, as described in the 1927*22dc650dSSadaf Ebrahimiprevious section.) 1928*22dc650dSSadaf Ebrahimi</P> 1929*22dc650dSSadaf Ebrahimi<P> 1930*22dc650dSSadaf EbrahimiIf you make a backreference to a non-unique named group from elsewhere in the 1931*22dc650dSSadaf Ebrahimipattern, the groups to which the name refers are checked in the order in which 1932*22dc650dSSadaf Ebrahimithey appear in the overall pattern. The first one that is set is used for the 1933*22dc650dSSadaf Ebrahimireference. For example, this pattern matches both "foofoo" and "barbar" but not 1934*22dc650dSSadaf Ebrahimi"foobar" or "barfoo": 1935*22dc650dSSadaf Ebrahimi<pre> 1936*22dc650dSSadaf Ebrahimi (?J)(?:(?<n>foo)|(?<n>bar))\k<n> 1937*22dc650dSSadaf Ebrahimi 1938*22dc650dSSadaf Ebrahimi</PRE> 1939*22dc650dSSadaf Ebrahimi</P> 1940*22dc650dSSadaf Ebrahimi<P> 1941*22dc650dSSadaf EbrahimiIf you make a subroutine call to a non-unique named group, the one that 1942*22dc650dSSadaf Ebrahimicorresponds to the first occurrence of the name is used. In the absence of 1943*22dc650dSSadaf Ebrahimiduplicate numbers this is the one with the lowest number. 1944*22dc650dSSadaf Ebrahimi</P> 1945*22dc650dSSadaf Ebrahimi<P> 1946*22dc650dSSadaf EbrahimiIf you use a named reference in a condition 1947*22dc650dSSadaf Ebrahimitest (see the 1948*22dc650dSSadaf Ebrahimi<a href="#conditions">section about conditions</a> 1949*22dc650dSSadaf Ebrahimibelow), either to check whether a capture group has matched, or to check for 1950*22dc650dSSadaf Ebrahimirecursion, all groups with the same name are tested. If the condition is true 1951*22dc650dSSadaf Ebrahimifor any one of them, the overall condition is true. This is the same behaviour 1952*22dc650dSSadaf Ebrahimias testing by number. For further details of the interfaces for handling named 1953*22dc650dSSadaf Ebrahimicapture groups, see the 1954*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 1955*22dc650dSSadaf Ebrahimidocumentation. 1956*22dc650dSSadaf Ebrahimi</P> 1957*22dc650dSSadaf Ebrahimi<br><a name="SEC17" href="#TOC1">REPETITION</a><br> 1958*22dc650dSSadaf Ebrahimi<P> 1959*22dc650dSSadaf EbrahimiRepetition is specified by quantifiers, which may follow any one of these 1960*22dc650dSSadaf Ebrahimiitems: 1961*22dc650dSSadaf Ebrahimi<pre> 1962*22dc650dSSadaf Ebrahimi a literal data character 1963*22dc650dSSadaf Ebrahimi the dot metacharacter 1964*22dc650dSSadaf Ebrahimi the \C escape sequence 1965*22dc650dSSadaf Ebrahimi the \R escape sequence 1966*22dc650dSSadaf Ebrahimi the \X escape sequence 1967*22dc650dSSadaf Ebrahimi any escape sequence that matches a single character 1968*22dc650dSSadaf Ebrahimi a character class 1969*22dc650dSSadaf Ebrahimi a backreference 1970*22dc650dSSadaf Ebrahimi a parenthesized group (including lookaround assertions) 1971*22dc650dSSadaf Ebrahimi a subroutine call (recursive or otherwise) 1972*22dc650dSSadaf Ebrahimi</pre> 1973*22dc650dSSadaf EbrahimiIf a quantifier does not follow a repeatable item, an error occurs. The 1974*22dc650dSSadaf Ebrahimigeneral repetition quantifier specifies a minimum and maximum number of 1975*22dc650dSSadaf Ebrahimipermitted matches by giving two numbers in curly brackets (braces), separated 1976*22dc650dSSadaf Ebrahimiby a comma. The numbers must be less than 65536, and the first must be less 1977*22dc650dSSadaf Ebrahimithan or equal to the second. For example, 1978*22dc650dSSadaf Ebrahimi<pre> 1979*22dc650dSSadaf Ebrahimi z{2,4} 1980*22dc650dSSadaf Ebrahimi</pre> 1981*22dc650dSSadaf Ebrahimimatches "zz", "zzz", or "zzzz". A closing brace on its own is not a special 1982*22dc650dSSadaf Ebrahimicharacter. If the second number is omitted, but the comma is present, there is 1983*22dc650dSSadaf Ebrahimino upper limit; if the second number and the comma are both omitted, the 1984*22dc650dSSadaf Ebrahimiquantifier specifies an exact number of required matches. Thus 1985*22dc650dSSadaf Ebrahimi<pre> 1986*22dc650dSSadaf Ebrahimi [aeiou]{3,} 1987*22dc650dSSadaf Ebrahimi</pre> 1988*22dc650dSSadaf Ebrahimimatches at least 3 successive vowels, but may match many more, whereas 1989*22dc650dSSadaf Ebrahimi<pre> 1990*22dc650dSSadaf Ebrahimi \d{8} 1991*22dc650dSSadaf Ebrahimi</pre> 1992*22dc650dSSadaf Ebrahimimatches exactly 8 digits. If the first number is omitted, the lower limit is 1993*22dc650dSSadaf Ebrahimitaken as zero; in this case the upper limit must be present. 1994*22dc650dSSadaf Ebrahimi<pre> 1995*22dc650dSSadaf Ebrahimi X{,4} is interpreted as X{0,4} 1996*22dc650dSSadaf Ebrahimi</pre> 1997*22dc650dSSadaf EbrahimiThis is a change in behaviour that happened in Perl 5.34.0 and PCRE2 10.43. In 1998*22dc650dSSadaf Ebrahimiearlier versions such a sequence was not interpreted as a quantifier. Other 1999*22dc650dSSadaf Ebrahimiregular expression engines may behave either way. 2000*22dc650dSSadaf Ebrahimi</P> 2001*22dc650dSSadaf Ebrahimi<P> 2002*22dc650dSSadaf EbrahimiIf the characters that follow an opening brace do not match the syntax of a 2003*22dc650dSSadaf Ebrahimiquantifier, the brace is taken as a literal character. In particular, this 2004*22dc650dSSadaf Ebrahimimeans that {,} is a literal string of three characters. 2005*22dc650dSSadaf Ebrahimi</P> 2006*22dc650dSSadaf Ebrahimi<P> 2007*22dc650dSSadaf EbrahimiNote that not every opening brace is potentially the start of a quantifier 2008*22dc650dSSadaf Ebrahimibecause braces are used in other items such as \N{U+345} or \k{name}. 2009*22dc650dSSadaf Ebrahimi</P> 2010*22dc650dSSadaf Ebrahimi<P> 2011*22dc650dSSadaf EbrahimiIn UTF modes, quantifiers apply to characters rather than to individual code 2012*22dc650dSSadaf Ebrahimiunits. Thus, for example, \x{100}{2} matches two characters, each of 2013*22dc650dSSadaf Ebrahimiwhich is represented by a two-byte sequence in a UTF-8 string. Similarly, 2014*22dc650dSSadaf Ebrahimi\X{3} matches three Unicode extended grapheme clusters, each of which may be 2015*22dc650dSSadaf Ebrahimiseveral code units long (and they may be of different lengths). 2016*22dc650dSSadaf Ebrahimi</P> 2017*22dc650dSSadaf Ebrahimi<P> 2018*22dc650dSSadaf EbrahimiThe quantifier {0} is permitted, causing the expression to behave as if the 2019*22dc650dSSadaf Ebrahimiprevious item and the quantifier were not present. This may be useful for 2020*22dc650dSSadaf Ebrahimicapture groups that are referenced as 2021*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">subroutines</a> 2022*22dc650dSSadaf Ebrahimifrom elsewhere in the pattern (but see also the section entitled 2023*22dc650dSSadaf Ebrahimi<a href="#subdefine">"Defining capture groups for use by reference only"</a> 2024*22dc650dSSadaf Ebrahimibelow). Except for parenthesized groups, items that have a {0} quantifier are 2025*22dc650dSSadaf Ebrahimiomitted from the compiled pattern. 2026*22dc650dSSadaf Ebrahimi</P> 2027*22dc650dSSadaf Ebrahimi<P> 2028*22dc650dSSadaf EbrahimiFor convenience, the three most common quantifiers have single-character 2029*22dc650dSSadaf Ebrahimiabbreviations: 2030*22dc650dSSadaf Ebrahimi<pre> 2031*22dc650dSSadaf Ebrahimi * is equivalent to {0,} 2032*22dc650dSSadaf Ebrahimi + is equivalent to {1,} 2033*22dc650dSSadaf Ebrahimi ? is equivalent to {0,1} 2034*22dc650dSSadaf Ebrahimi</pre> 2035*22dc650dSSadaf EbrahimiIt is possible to construct infinite loops by following a group that can match 2036*22dc650dSSadaf Ebrahimino characters with a quantifier that has no upper limit, for example: 2037*22dc650dSSadaf Ebrahimi<pre> 2038*22dc650dSSadaf Ebrahimi (a?)* 2039*22dc650dSSadaf Ebrahimi</pre> 2040*22dc650dSSadaf EbrahimiEarlier versions of Perl and PCRE1 used to give an error at compile time for 2041*22dc650dSSadaf Ebrahimisuch patterns. However, because there are cases where this can be useful, such 2042*22dc650dSSadaf Ebrahimipatterns are now accepted, but whenever an iteration of such a group matches no 2043*22dc650dSSadaf Ebrahimicharacters, matching moves on to the next item in the pattern instead of 2044*22dc650dSSadaf Ebrahimirepeatedly matching an empty string. This does not prevent backtracking into 2045*22dc650dSSadaf Ebrahimiany of the iterations if a subsequent item fails to match. 2046*22dc650dSSadaf Ebrahimi</P> 2047*22dc650dSSadaf Ebrahimi<P> 2048*22dc650dSSadaf EbrahimiBy default, quantifiers are "greedy", that is, they match as much as possible 2049*22dc650dSSadaf Ebrahimi(up to the maximum number of permitted repetitions), without causing the rest 2050*22dc650dSSadaf Ebrahimiof the pattern to fail. The classic example of where this gives problems is in 2051*22dc650dSSadaf Ebrahimitrying to match comments in C programs. These appear between /* and */ and 2052*22dc650dSSadaf Ebrahimiwithin the comment, individual * and / characters may appear. An attempt to 2053*22dc650dSSadaf Ebrahimimatch C comments by applying the pattern 2054*22dc650dSSadaf Ebrahimi<pre> 2055*22dc650dSSadaf Ebrahimi /\*.*\*/ 2056*22dc650dSSadaf Ebrahimi</pre> 2057*22dc650dSSadaf Ebrahimito the string 2058*22dc650dSSadaf Ebrahimi<pre> 2059*22dc650dSSadaf Ebrahimi /* first comment */ not comment /* second comment */ 2060*22dc650dSSadaf Ebrahimi</pre> 2061*22dc650dSSadaf Ebrahimifails, because it matches the entire string owing to the greediness of the .* 2062*22dc650dSSadaf Ebrahimiitem. However, if a quantifier is followed by a question mark, it ceases to be 2063*22dc650dSSadaf Ebrahimigreedy, and instead matches the minimum number of times possible, so the 2064*22dc650dSSadaf Ebrahimipattern 2065*22dc650dSSadaf Ebrahimi<pre> 2066*22dc650dSSadaf Ebrahimi /\*.*?\*/ 2067*22dc650dSSadaf Ebrahimi</pre> 2068*22dc650dSSadaf Ebrahimidoes the right thing with C comments. The meaning of the various quantifiers is 2069*22dc650dSSadaf Ebrahiminot otherwise changed, just the preferred number of matches. Do not confuse 2070*22dc650dSSadaf Ebrahimithis use of question mark with its use as a quantifier in its own right. 2071*22dc650dSSadaf EbrahimiBecause it has two uses, it can sometimes appear doubled, as in 2072*22dc650dSSadaf Ebrahimi<pre> 2073*22dc650dSSadaf Ebrahimi \d??\d 2074*22dc650dSSadaf Ebrahimi</pre> 2075*22dc650dSSadaf Ebrahimiwhich matches one digit by preference, but can match two if that is the only 2076*22dc650dSSadaf Ebrahimiway the rest of the pattern matches. 2077*22dc650dSSadaf Ebrahimi</P> 2078*22dc650dSSadaf Ebrahimi<P> 2079*22dc650dSSadaf EbrahimiIf the PCRE2_UNGREEDY option is set (an option that is not available in Perl), 2080*22dc650dSSadaf Ebrahimithe quantifiers are not greedy by default, but individual ones can be made 2081*22dc650dSSadaf Ebrahimigreedy by following them with a question mark. In other words, it inverts the 2082*22dc650dSSadaf Ebrahimidefault behaviour. 2083*22dc650dSSadaf Ebrahimi</P> 2084*22dc650dSSadaf Ebrahimi<P> 2085*22dc650dSSadaf EbrahimiWhen a parenthesized group is quantified with a minimum repeat count that 2086*22dc650dSSadaf Ebrahimiis greater than 1 or with a limited maximum, more memory is required for the 2087*22dc650dSSadaf Ebrahimicompiled pattern, in proportion to the size of the minimum or maximum. 2088*22dc650dSSadaf Ebrahimi</P> 2089*22dc650dSSadaf Ebrahimi<P> 2090*22dc650dSSadaf EbrahimiIf a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent 2091*22dc650dSSadaf Ebrahimito Perl's /s) is set, thus allowing the dot to match newlines, the pattern is 2092*22dc650dSSadaf Ebrahimiimplicitly anchored, because whatever follows will be tried against every 2093*22dc650dSSadaf Ebrahimicharacter position in the subject string, so there is no point in retrying the 2094*22dc650dSSadaf Ebrahimioverall match at any position after the first. PCRE2 normally treats such a 2095*22dc650dSSadaf Ebrahimipattern as though it were preceded by \A. 2096*22dc650dSSadaf Ebrahimi</P> 2097*22dc650dSSadaf Ebrahimi<P> 2098*22dc650dSSadaf EbrahimiIn cases where it is known that the subject string contains no newlines, it is 2099*22dc650dSSadaf Ebrahimiworth setting PCRE2_DOTALL in order to obtain this optimization, or 2100*22dc650dSSadaf Ebrahimialternatively, using ^ to indicate anchoring explicitly. 2101*22dc650dSSadaf Ebrahimi</P> 2102*22dc650dSSadaf Ebrahimi<P> 2103*22dc650dSSadaf EbrahimiHowever, there are some cases where the optimization cannot be used. When .* 2104*22dc650dSSadaf Ebrahimiis inside capturing parentheses that are the subject of a backreference 2105*22dc650dSSadaf Ebrahimielsewhere in the pattern, a match at the start may fail where a later one 2106*22dc650dSSadaf Ebrahimisucceeds. Consider, for example: 2107*22dc650dSSadaf Ebrahimi<pre> 2108*22dc650dSSadaf Ebrahimi (.*)abc\1 2109*22dc650dSSadaf Ebrahimi</pre> 2110*22dc650dSSadaf EbrahimiIf the subject is "xyz123abc123" the match point is the fourth character. For 2111*22dc650dSSadaf Ebrahimithis reason, such a pattern is not implicitly anchored. 2112*22dc650dSSadaf Ebrahimi</P> 2113*22dc650dSSadaf Ebrahimi<P> 2114*22dc650dSSadaf EbrahimiAnother case where implicit anchoring is not applied is when the leading .* is 2115*22dc650dSSadaf Ebrahimiinside an atomic group. Once again, a match at the start may fail where a later 2116*22dc650dSSadaf Ebrahimione succeeds. Consider this pattern: 2117*22dc650dSSadaf Ebrahimi<pre> 2118*22dc650dSSadaf Ebrahimi (?>.*?a)b 2119*22dc650dSSadaf Ebrahimi</pre> 2120*22dc650dSSadaf EbrahimiIt matches "ab" in the subject "aab". The use of the backtracking control verbs 2121*22dc650dSSadaf Ebrahimi(*PRUNE) and (*SKIP) also disable this optimization, and there is an option, 2122*22dc650dSSadaf EbrahimiPCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. 2123*22dc650dSSadaf Ebrahimi</P> 2124*22dc650dSSadaf Ebrahimi<P> 2125*22dc650dSSadaf EbrahimiWhen a capture group is repeated, the value captured is the substring that 2126*22dc650dSSadaf Ebrahimimatched the final iteration. For example, after 2127*22dc650dSSadaf Ebrahimi<pre> 2128*22dc650dSSadaf Ebrahimi (tweedle[dume]{3}\s*)+ 2129*22dc650dSSadaf Ebrahimi</pre> 2130*22dc650dSSadaf Ebrahimihas matched "tweedledum tweedledee" the value of the captured substring is 2131*22dc650dSSadaf Ebrahimi"tweedledee". However, if there are nested capture groups, the corresponding 2132*22dc650dSSadaf Ebrahimicaptured values may have been set in previous iterations. For example, after 2133*22dc650dSSadaf Ebrahimi<pre> 2134*22dc650dSSadaf Ebrahimi (a|(b))+ 2135*22dc650dSSadaf Ebrahimi</pre> 2136*22dc650dSSadaf Ebrahimimatches "aba" the value of the second captured substring is "b". 2137*22dc650dSSadaf Ebrahimi<a name="atomicgroup"></a></P> 2138*22dc650dSSadaf Ebrahimi<br><a name="SEC18" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> 2139*22dc650dSSadaf Ebrahimi<P> 2140*22dc650dSSadaf EbrahimiWith both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") 2141*22dc650dSSadaf Ebrahimirepetition, failure of what follows normally causes the repeated item to be 2142*22dc650dSSadaf Ebrahimire-evaluated to see if a different number of repeats allows the rest of the 2143*22dc650dSSadaf Ebrahimipattern to match. Sometimes it is useful to prevent this, either to change the 2144*22dc650dSSadaf Ebrahiminature of the match, or to cause it fail earlier than it otherwise might, when 2145*22dc650dSSadaf Ebrahimithe author of the pattern knows there is no point in carrying on. 2146*22dc650dSSadaf Ebrahimi</P> 2147*22dc650dSSadaf Ebrahimi<P> 2148*22dc650dSSadaf EbrahimiConsider, for example, the pattern \d+foo when applied to the subject line 2149*22dc650dSSadaf Ebrahimi<pre> 2150*22dc650dSSadaf Ebrahimi 123456bar 2151*22dc650dSSadaf Ebrahimi</pre> 2152*22dc650dSSadaf EbrahimiAfter matching all 6 digits and then failing to match "foo", the normal 2153*22dc650dSSadaf Ebrahimiaction of the matcher is to try again with only 5 digits matching the \d+ 2154*22dc650dSSadaf Ebrahimiitem, and then with 4, and so on, before ultimately failing. "Atomic grouping" 2155*22dc650dSSadaf Ebrahimi(a term taken from Jeffrey Friedl's book) provides the means for specifying 2156*22dc650dSSadaf Ebrahimithat once a group has matched, it is not to be re-evaluated in this way. 2157*22dc650dSSadaf Ebrahimi</P> 2158*22dc650dSSadaf Ebrahimi<P> 2159*22dc650dSSadaf EbrahimiIf we use atomic grouping for the previous example, the matcher gives up 2160*22dc650dSSadaf Ebrahimiimmediately on failing to match "foo" the first time. The notation is a kind of 2161*22dc650dSSadaf Ebrahimispecial parenthesis, starting with (?> as in this example: 2162*22dc650dSSadaf Ebrahimi<pre> 2163*22dc650dSSadaf Ebrahimi (?>\d+)foo 2164*22dc650dSSadaf Ebrahimi</pre> 2165*22dc650dSSadaf EbrahimiPerl 5.28 introduced an experimental alphabetic form starting with (* which may 2166*22dc650dSSadaf Ebrahimibe easier to remember: 2167*22dc650dSSadaf Ebrahimi<pre> 2168*22dc650dSSadaf Ebrahimi (*atomic:\d+)foo 2169*22dc650dSSadaf Ebrahimi</pre> 2170*22dc650dSSadaf EbrahimiThis kind of parenthesized group "locks up" the part of the pattern it contains 2171*22dc650dSSadaf Ebrahimionce it has matched, and a failure further into the pattern is prevented from 2172*22dc650dSSadaf Ebrahimibacktracking into it. Backtracking past it to previous items, however, works as 2173*22dc650dSSadaf Ebrahiminormal. 2174*22dc650dSSadaf Ebrahimi</P> 2175*22dc650dSSadaf Ebrahimi<P> 2176*22dc650dSSadaf EbrahimiAn alternative description is that a group of this type matches exactly the 2177*22dc650dSSadaf Ebrahimistring of characters that an identical standalone pattern would match, if 2178*22dc650dSSadaf Ebrahimianchored at the current point in the subject string. 2179*22dc650dSSadaf Ebrahimi</P> 2180*22dc650dSSadaf Ebrahimi<P> 2181*22dc650dSSadaf EbrahimiAtomic groups are not capture groups. Simple cases such as the above example 2182*22dc650dSSadaf Ebrahimican be thought of as a maximizing repeat that must swallow everything it can. 2183*22dc650dSSadaf EbrahimiSo, while both \d+ and \d+? are prepared to adjust the number of digits they 2184*22dc650dSSadaf Ebrahimimatch in order to make the rest of the pattern match, (?>\d+) can only match 2185*22dc650dSSadaf Ebrahimian entire sequence of digits. 2186*22dc650dSSadaf Ebrahimi</P> 2187*22dc650dSSadaf Ebrahimi<P> 2188*22dc650dSSadaf EbrahimiAtomic groups in general can of course contain arbitrarily complicated 2189*22dc650dSSadaf Ebrahimiexpressions, and can be nested. However, when the contents of an atomic 2190*22dc650dSSadaf Ebrahimigroup is just a single repeated item, as in the example above, a simpler 2191*22dc650dSSadaf Ebrahiminotation, called a "possessive quantifier" can be used. This consists of an 2192*22dc650dSSadaf Ebrahimiadditional + character following a quantifier. Using this notation, the 2193*22dc650dSSadaf Ebrahimiprevious example can be rewritten as 2194*22dc650dSSadaf Ebrahimi<pre> 2195*22dc650dSSadaf Ebrahimi \d++foo 2196*22dc650dSSadaf Ebrahimi</pre> 2197*22dc650dSSadaf EbrahimiNote that a possessive quantifier can be used with an entire group, for 2198*22dc650dSSadaf Ebrahimiexample: 2199*22dc650dSSadaf Ebrahimi<pre> 2200*22dc650dSSadaf Ebrahimi (abc|xyz){2,3}+ 2201*22dc650dSSadaf Ebrahimi</pre> 2202*22dc650dSSadaf EbrahimiPossessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY 2203*22dc650dSSadaf Ebrahimioption is ignored. They are a convenient notation for the simpler forms of 2204*22dc650dSSadaf Ebrahimiatomic group. However, there is no difference in the meaning of a possessive 2205*22dc650dSSadaf Ebrahimiquantifier and the equivalent atomic group, though there may be a performance 2206*22dc650dSSadaf Ebrahimidifference; possessive quantifiers should be slightly faster. 2207*22dc650dSSadaf Ebrahimi</P> 2208*22dc650dSSadaf Ebrahimi<P> 2209*22dc650dSSadaf EbrahimiThe possessive quantifier syntax is an extension to the Perl 5.8 syntax. 2210*22dc650dSSadaf EbrahimiJeffrey Friedl originated the idea (and the name) in the first edition of his 2211*22dc650dSSadaf Ebrahimibook. Mike McCloskey liked it, so implemented it when he built Sun's Java 2212*22dc650dSSadaf Ebrahimipackage, and PCRE1 copied it from there. It found its way into Perl at release 2213*22dc650dSSadaf Ebrahimi5.10. 2214*22dc650dSSadaf Ebrahimi</P> 2215*22dc650dSSadaf Ebrahimi<P> 2216*22dc650dSSadaf EbrahimiPCRE2 has an optimization that automatically "possessifies" certain simple 2217*22dc650dSSadaf Ebrahimipattern constructs. For example, the sequence A+B is treated as A++B because 2218*22dc650dSSadaf Ebrahimithere is no point in backtracking into a sequence of A's when B must follow. 2219*22dc650dSSadaf EbrahimiThis feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting 2220*22dc650dSSadaf Ebrahimithe pattern with (*NO_AUTO_POSSESS). 2221*22dc650dSSadaf Ebrahimi</P> 2222*22dc650dSSadaf Ebrahimi<P> 2223*22dc650dSSadaf EbrahimiWhen a pattern contains an unlimited repeat inside a group that can itself be 2224*22dc650dSSadaf Ebrahimirepeated an unlimited number of times, the use of an atomic group is the only 2225*22dc650dSSadaf Ebrahimiway to avoid some failing matches taking a very long time indeed. The pattern 2226*22dc650dSSadaf Ebrahimi<pre> 2227*22dc650dSSadaf Ebrahimi (\D+|<\d+>)*[!?] 2228*22dc650dSSadaf Ebrahimi</pre> 2229*22dc650dSSadaf Ebrahimimatches an unlimited number of substrings that either consist of non-digits, or 2230*22dc650dSSadaf Ebrahimidigits enclosed in <>, followed by either ! or ?. When it matches, it runs 2231*22dc650dSSadaf Ebrahimiquickly. However, if it is applied to 2232*22dc650dSSadaf Ebrahimi<pre> 2233*22dc650dSSadaf Ebrahimi aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 2234*22dc650dSSadaf Ebrahimi</pre> 2235*22dc650dSSadaf Ebrahimiit takes a long time before reporting failure. This is because the string can 2236*22dc650dSSadaf Ebrahimibe divided between the internal \D+ repeat and the external * repeat in a 2237*22dc650dSSadaf Ebrahimilarge number of ways, and all have to be tried. (The example uses [!?] rather 2238*22dc650dSSadaf Ebrahimithan a single character at the end, because both PCRE2 and Perl have an 2239*22dc650dSSadaf Ebrahimioptimization that allows for fast failure when a single character is used. They 2240*22dc650dSSadaf Ebrahimiremember the last single character that is required for a match, and fail early 2241*22dc650dSSadaf Ebrahimiif it is not present in the string.) If the pattern is changed so that it uses 2242*22dc650dSSadaf Ebrahimian atomic group, like this: 2243*22dc650dSSadaf Ebrahimi<pre> 2244*22dc650dSSadaf Ebrahimi ((?>\D+)|<\d+>)*[!?] 2245*22dc650dSSadaf Ebrahimi</pre> 2246*22dc650dSSadaf Ebrahimisequences of non-digits cannot be broken, and failure happens quickly. 2247*22dc650dSSadaf Ebrahimi<a name="backreferences"></a></P> 2248*22dc650dSSadaf Ebrahimi<br><a name="SEC19" href="#TOC1">BACKREFERENCES</a><br> 2249*22dc650dSSadaf Ebrahimi<P> 2250*22dc650dSSadaf EbrahimiOutside a character class, a backslash followed by a digit greater than 0 (and 2251*22dc650dSSadaf Ebrahimipossibly further digits) is a backreference to a capture group earlier (that 2252*22dc650dSSadaf Ebrahimiis, to its left) in the pattern, provided there have been that many previous 2253*22dc650dSSadaf Ebrahimicapture groups. 2254*22dc650dSSadaf Ebrahimi</P> 2255*22dc650dSSadaf Ebrahimi<P> 2256*22dc650dSSadaf EbrahimiHowever, if the decimal number following the backslash is less than 8, it is 2257*22dc650dSSadaf Ebrahimialways taken as a backreference, and causes an error only if there are not that 2258*22dc650dSSadaf Ebrahimimany capture groups in the entire pattern. In other words, the group that is 2259*22dc650dSSadaf Ebrahimireferenced need not be to the left of the reference for numbers less than 8. A 2260*22dc650dSSadaf Ebrahimi"forward backreference" of this type can make sense when a repetition is 2261*22dc650dSSadaf Ebrahimiinvolved and the group to the right has participated in an earlier iteration. 2262*22dc650dSSadaf Ebrahimi</P> 2263*22dc650dSSadaf Ebrahimi<P> 2264*22dc650dSSadaf EbrahimiIt is not possible to have a numerical "forward backreference" to a group whose 2265*22dc650dSSadaf Ebrahiminumber is 8 or more using this syntax because a sequence such as \50 is 2266*22dc650dSSadaf Ebrahimiinterpreted as a character defined in octal. See the subsection entitled 2267*22dc650dSSadaf Ebrahimi"Non-printing characters" 2268*22dc650dSSadaf Ebrahimi<a href="#digitsafterbackslash">above</a> 2269*22dc650dSSadaf Ebrahimifor further details of the handling of digits following a backslash. Other 2270*22dc650dSSadaf Ebrahimiforms of backreferencing do not suffer from this restriction. In particular, 2271*22dc650dSSadaf Ebrahimithere is no problem when named capture groups are used (see below). 2272*22dc650dSSadaf Ebrahimi</P> 2273*22dc650dSSadaf Ebrahimi<P> 2274*22dc650dSSadaf EbrahimiAnother way of avoiding the ambiguity inherent in the use of digits following a 2275*22dc650dSSadaf Ebrahimibackslash is to use the \g escape sequence. This escape must be followed by a 2276*22dc650dSSadaf Ebrahimisigned or unsigned number, optionally enclosed in braces. These examples are 2277*22dc650dSSadaf Ebrahimiall identical: 2278*22dc650dSSadaf Ebrahimi<pre> 2279*22dc650dSSadaf Ebrahimi (ring), \1 2280*22dc650dSSadaf Ebrahimi (ring), \g1 2281*22dc650dSSadaf Ebrahimi (ring), \g{1} 2282*22dc650dSSadaf Ebrahimi</pre> 2283*22dc650dSSadaf EbrahimiAn unsigned number specifies an absolute reference without the ambiguity that 2284*22dc650dSSadaf Ebrahimiis present in the older syntax. It is also useful when literal digits follow 2285*22dc650dSSadaf Ebrahimithe reference. A signed number is a relative reference. Consider this example: 2286*22dc650dSSadaf Ebrahimi<pre> 2287*22dc650dSSadaf Ebrahimi (abc(def)ghi)\g{-1} 2288*22dc650dSSadaf Ebrahimi</pre> 2289*22dc650dSSadaf EbrahimiThe sequence \g{-1} is a reference to the capture group whose number is one 2290*22dc650dSSadaf Ebrahimiless than the number of the next group to be started, so in this example (where 2291*22dc650dSSadaf Ebrahimithe next group would be numbered 3) is it equivalent to \2, and \g{-2} would 2292*22dc650dSSadaf Ebrahimibe equivalent to \1. Note that if this construct is inside a capture group, 2293*22dc650dSSadaf Ebrahimithat group is included in the count, so in this example \g{-2} also refers to 2294*22dc650dSSadaf Ebrahimigroup 1: 2295*22dc650dSSadaf Ebrahimi<pre> 2296*22dc650dSSadaf Ebrahimi (A)(\g{-2}B) 2297*22dc650dSSadaf Ebrahimi</pre> 2298*22dc650dSSadaf EbrahimiThe use of relative references can be helpful in long patterns, and also in 2299*22dc650dSSadaf Ebrahimipatterns that are created by joining together fragments that contain references 2300*22dc650dSSadaf Ebrahimiwithin themselves. 2301*22dc650dSSadaf Ebrahimi</P> 2302*22dc650dSSadaf Ebrahimi<P> 2303*22dc650dSSadaf EbrahimiThe sequence \g{+1} is a reference to the next capture group that is started 2304*22dc650dSSadaf Ebrahimiafter this item, and \g{+2} refers to the one after that, and so on. This kind 2305*22dc650dSSadaf Ebrahimiof forward reference can be useful in patterns that repeat. Perl does not 2306*22dc650dSSadaf Ebrahimisupport the use of + in this way. 2307*22dc650dSSadaf Ebrahimi</P> 2308*22dc650dSSadaf Ebrahimi<P> 2309*22dc650dSSadaf EbrahimiA backreference matches whatever actually most recently matched the capture 2310*22dc650dSSadaf Ebrahimigroup in the current subject string, rather than anything at all that matches 2311*22dc650dSSadaf Ebrahimithe group (see 2312*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">"Groups as subroutines"</a> 2313*22dc650dSSadaf Ebrahimibelow for a way of doing that). So the pattern 2314*22dc650dSSadaf Ebrahimi<pre> 2315*22dc650dSSadaf Ebrahimi (sens|respons)e and \1ibility 2316*22dc650dSSadaf Ebrahimi</pre> 2317*22dc650dSSadaf Ebrahimimatches "sense and sensibility" and "response and responsibility", but not 2318*22dc650dSSadaf Ebrahimi"sense and responsibility". If caseful matching is in force at the time of the 2319*22dc650dSSadaf Ebrahimibackreference, the case of letters is relevant. For example, 2320*22dc650dSSadaf Ebrahimi<pre> 2321*22dc650dSSadaf Ebrahimi ((?i)rah)\s+\1 2322*22dc650dSSadaf Ebrahimi</pre> 2323*22dc650dSSadaf Ebrahimimatches "rah rah" and "RAH RAH", but not "RAH rah", even though the original 2324*22dc650dSSadaf Ebrahimicapture group is matched caselessly. 2325*22dc650dSSadaf Ebrahimi</P> 2326*22dc650dSSadaf Ebrahimi<P> 2327*22dc650dSSadaf EbrahimiThere are several different ways of writing backreferences to named capture 2328*22dc650dSSadaf Ebrahimigroups. The .NET syntax is \k{name}, the Python syntax is (?=name), and the 2329*22dc650dSSadaf Ebrahimioriginal Perl syntax is \k<name> or \k'name'. All of these are now supported 2330*22dc650dSSadaf Ebrahimiby both Perl and PCRE2. Perl 5.10's unified backreference syntax, in which \g 2331*22dc650dSSadaf Ebrahimican be used for both numeric and named references, is also supported by PCRE2. 2332*22dc650dSSadaf EbrahimiWe could rewrite the above example in any of the following ways: 2333*22dc650dSSadaf Ebrahimi<pre> 2334*22dc650dSSadaf Ebrahimi (?<p1>(?i)rah)\s+\k<p1> 2335*22dc650dSSadaf Ebrahimi (?'p1'(?i)rah)\s+\k{p1} 2336*22dc650dSSadaf Ebrahimi (?P<p1>(?i)rah)\s+(?P=p1) 2337*22dc650dSSadaf Ebrahimi (?<p1>(?i)rah)\s+\g{p1} 2338*22dc650dSSadaf Ebrahimi</pre> 2339*22dc650dSSadaf EbrahimiA capture group that is referenced by name may appear in the pattern before or 2340*22dc650dSSadaf Ebrahimiafter the reference. 2341*22dc650dSSadaf Ebrahimi</P> 2342*22dc650dSSadaf Ebrahimi<P> 2343*22dc650dSSadaf EbrahimiThere may be more than one backreference to the same group. If a group has not 2344*22dc650dSSadaf Ebrahimiactually been used in a particular match, backreferences to it always fail by 2345*22dc650dSSadaf Ebrahimidefault. For example, the pattern 2346*22dc650dSSadaf Ebrahimi<pre> 2347*22dc650dSSadaf Ebrahimi (a|(bc))\2 2348*22dc650dSSadaf Ebrahimi</pre> 2349*22dc650dSSadaf Ebrahimialways fails if it starts to match "a" rather than "bc". However, if the 2350*22dc650dSSadaf EbrahimiPCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an 2351*22dc650dSSadaf Ebrahimiunset value matches an empty string. 2352*22dc650dSSadaf Ebrahimi</P> 2353*22dc650dSSadaf Ebrahimi<P> 2354*22dc650dSSadaf EbrahimiBecause there may be many capture groups in a pattern, all digits following a 2355*22dc650dSSadaf Ebrahimibackslash are taken as part of a potential backreference number. If the pattern 2356*22dc650dSSadaf Ebrahimicontinues with a digit character, some delimiter must be used to terminate the 2357*22dc650dSSadaf Ebrahimibackreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this 2358*22dc650dSSadaf Ebrahimican be white space. Otherwise, the \g{} syntax or an empty comment (see 2359*22dc650dSSadaf Ebrahimi<a href="#comments">"Comments"</a> 2360*22dc650dSSadaf Ebrahimibelow) can be used. 2361*22dc650dSSadaf Ebrahimi</P> 2362*22dc650dSSadaf Ebrahimi<br><b> 2363*22dc650dSSadaf EbrahimiRecursive backreferences 2364*22dc650dSSadaf Ebrahimi</b><br> 2365*22dc650dSSadaf Ebrahimi<P> 2366*22dc650dSSadaf EbrahimiA backreference that occurs inside the group to which it refers fails when the 2367*22dc650dSSadaf Ebrahimigroup is first used, so, for example, (a\1) never matches. However, such 2368*22dc650dSSadaf Ebrahimireferences can be useful inside repeated groups. For example, the pattern 2369*22dc650dSSadaf Ebrahimi<pre> 2370*22dc650dSSadaf Ebrahimi (a|b\1)+ 2371*22dc650dSSadaf Ebrahimi</pre> 2372*22dc650dSSadaf Ebrahimimatches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of 2373*22dc650dSSadaf Ebrahimithe group, the backreference matches the character string corresponding to the 2374*22dc650dSSadaf Ebrahimiprevious iteration. In order for this to work, the pattern must be such that 2375*22dc650dSSadaf Ebrahimithe first iteration does not need to match the backreference. This can be done 2376*22dc650dSSadaf Ebrahimiusing alternation, as in the example above, or by a quantifier with a minimum 2377*22dc650dSSadaf Ebrahimiof zero. 2378*22dc650dSSadaf Ebrahimi</P> 2379*22dc650dSSadaf Ebrahimi<P> 2380*22dc650dSSadaf EbrahimiFor versions of PCRE2 less than 10.25, backreferences of this type used to 2381*22dc650dSSadaf Ebrahimicause the group that they reference to be treated as an 2382*22dc650dSSadaf Ebrahimi<a href="#atomicgroup">atomic group.</a> 2383*22dc650dSSadaf EbrahimiThis restriction no longer applies, and backtracking into such groups can occur 2384*22dc650dSSadaf Ebrahimias normal. 2385*22dc650dSSadaf Ebrahimi<a name="bigassertions"></a></P> 2386*22dc650dSSadaf Ebrahimi<br><a name="SEC20" href="#TOC1">ASSERTIONS</a><br> 2387*22dc650dSSadaf Ebrahimi<P> 2388*22dc650dSSadaf EbrahimiAn assertion is a test on the characters following or preceding the current 2389*22dc650dSSadaf Ebrahimimatching point that does not consume any characters. The simple assertions 2390*22dc650dSSadaf Ebrahimicoded as \b, \B, \A, \G, \Z, \z, ^ and $ are described 2391*22dc650dSSadaf Ebrahimi<a href="#smallassertions">above.</a> 2392*22dc650dSSadaf Ebrahimi</P> 2393*22dc650dSSadaf Ebrahimi<P> 2394*22dc650dSSadaf EbrahimiMore complicated assertions are coded as parenthesized groups. There are two 2395*22dc650dSSadaf Ebrahimikinds: those that look ahead of the current position in the subject string, and 2396*22dc650dSSadaf Ebrahimithose that look behind it, and in each case an assertion may be positive (must 2397*22dc650dSSadaf Ebrahimimatch for the assertion to be true) or negative (must not match for the 2398*22dc650dSSadaf Ebrahimiassertion to be true). An assertion group is matched in the normal way, 2399*22dc650dSSadaf Ebrahimiand if it is true, matching continues after it, but with the matching position 2400*22dc650dSSadaf Ebrahimiin the subject string reset to what it was before the assertion was processed. 2401*22dc650dSSadaf Ebrahimi</P> 2402*22dc650dSSadaf Ebrahimi<P> 2403*22dc650dSSadaf EbrahimiThe Perl-compatible lookaround assertions are atomic. If an assertion is true, 2404*22dc650dSSadaf Ebrahimibut there is a subsequent matching failure, there is no backtracking into the 2405*22dc650dSSadaf Ebrahimiassertion. However, there are some cases where non-atomic assertions can be 2406*22dc650dSSadaf Ebrahimiuseful. PCRE2 has some support for these, described in the section entitled 2407*22dc650dSSadaf Ebrahimi<a href="#nonatomicassertions">"Non-atomic assertions"</a> 2408*22dc650dSSadaf Ebrahimibelow, but they are not Perl-compatible. 2409*22dc650dSSadaf Ebrahimi</P> 2410*22dc650dSSadaf Ebrahimi<P> 2411*22dc650dSSadaf EbrahimiA lookaround assertion may appear as the condition in a 2412*22dc650dSSadaf Ebrahimi<a href="#conditions">conditional group</a> 2413*22dc650dSSadaf Ebrahimi(see below). In this case, the result of matching the assertion determines 2414*22dc650dSSadaf Ebrahimiwhich branch of the condition is followed. 2415*22dc650dSSadaf Ebrahimi</P> 2416*22dc650dSSadaf Ebrahimi<P> 2417*22dc650dSSadaf EbrahimiAssertion groups are not capture groups. If an assertion contains capture 2418*22dc650dSSadaf Ebrahimigroups within it, these are counted for the purposes of numbering the capture 2419*22dc650dSSadaf Ebrahimigroups in the whole pattern. Within each branch of an assertion, locally 2420*22dc650dSSadaf Ebrahimicaptured substrings may be referenced in the usual way. For example, a sequence 2421*22dc650dSSadaf Ebrahimisuch as (.)\g{-1} can be used to check that two adjacent characters are the 2422*22dc650dSSadaf Ebrahimisame. 2423*22dc650dSSadaf Ebrahimi</P> 2424*22dc650dSSadaf Ebrahimi<P> 2425*22dc650dSSadaf EbrahimiWhen a branch within an assertion fails to match, any substrings that were 2426*22dc650dSSadaf Ebrahimicaptured are discarded (as happens with any pattern branch that fails to 2427*22dc650dSSadaf Ebrahimimatch). A negative assertion is true only when all its branches fail to match; 2428*22dc650dSSadaf Ebrahimithis means that no captured substrings are ever retained after a successful 2429*22dc650dSSadaf Ebrahiminegative assertion. When an assertion contains a matching branch, what happens 2430*22dc650dSSadaf Ebrahimidepends on the type of assertion. 2431*22dc650dSSadaf Ebrahimi</P> 2432*22dc650dSSadaf Ebrahimi<P> 2433*22dc650dSSadaf EbrahimiFor a positive assertion, internally captured substrings in the successful 2434*22dc650dSSadaf Ebrahimibranch are retained, and matching continues with the next pattern item after 2435*22dc650dSSadaf Ebrahimithe assertion. For a negative assertion, a matching branch means that the 2436*22dc650dSSadaf Ebrahimiassertion is not true. If such an assertion is being used as a condition in a 2437*22dc650dSSadaf Ebrahimi<a href="#conditions">conditional group</a> 2438*22dc650dSSadaf Ebrahimi(see below), captured substrings are retained, because matching continues with 2439*22dc650dSSadaf Ebrahimithe "no" branch of the condition. For other failing negative assertions, 2440*22dc650dSSadaf Ebrahimicontrol passes to the previous backtracking point, thus discarding any captured 2441*22dc650dSSadaf Ebrahimistrings within the assertion. 2442*22dc650dSSadaf Ebrahimi</P> 2443*22dc650dSSadaf Ebrahimi<P> 2444*22dc650dSSadaf EbrahimiMost assertion groups may be repeated; though it makes no sense to assert the 2445*22dc650dSSadaf Ebrahimisame thing several times, the side effect of capturing in positive assertions 2446*22dc650dSSadaf Ebrahimimay occasionally be useful. However, an assertion that forms the condition for 2447*22dc650dSSadaf Ebrahimia conditional group may not be quantified. PCRE2 used to restrict the 2448*22dc650dSSadaf Ebrahimirepetition of assertions, but from release 10.35 the only restriction is that 2449*22dc650dSSadaf Ebrahimian unlimited maximum repetition is changed to be one more than the minimum. For 2450*22dc650dSSadaf Ebrahimiexample, {3,} is treated as {3,4}. 2451*22dc650dSSadaf Ebrahimi</P> 2452*22dc650dSSadaf Ebrahimi<br><b> 2453*22dc650dSSadaf EbrahimiAlphabetic assertion names 2454*22dc650dSSadaf Ebrahimi</b><br> 2455*22dc650dSSadaf Ebrahimi<P> 2456*22dc650dSSadaf EbrahimiTraditionally, symbolic sequences such as (?= and (?<= have been used to 2457*22dc650dSSadaf Ebrahimispecify lookaround assertions. Perl 5.28 introduced some experimental 2458*22dc650dSSadaf Ebrahimialphabetic alternatives which might be easier to remember. They all start with 2459*22dc650dSSadaf Ebrahimi(* instead of (? and must be written using lower case letters. PCRE2 supports 2460*22dc650dSSadaf Ebrahimithe following synonyms: 2461*22dc650dSSadaf Ebrahimi<pre> 2462*22dc650dSSadaf Ebrahimi (*positive_lookahead: or (*pla: is the same as (?= 2463*22dc650dSSadaf Ebrahimi (*negative_lookahead: or (*nla: is the same as (?! 2464*22dc650dSSadaf Ebrahimi (*positive_lookbehind: or (*plb: is the same as (?<= 2465*22dc650dSSadaf Ebrahimi (*negative_lookbehind: or (*nlb: is the same as (?<! 2466*22dc650dSSadaf Ebrahimi</pre> 2467*22dc650dSSadaf EbrahimiFor example, (*pla:foo) is the same assertion as (?=foo). In the following 2468*22dc650dSSadaf Ebrahimisections, the various assertions are described using the original symbolic 2469*22dc650dSSadaf Ebrahimiforms. 2470*22dc650dSSadaf Ebrahimi</P> 2471*22dc650dSSadaf Ebrahimi<br><b> 2472*22dc650dSSadaf EbrahimiLookahead assertions 2473*22dc650dSSadaf Ebrahimi</b><br> 2474*22dc650dSSadaf Ebrahimi<P> 2475*22dc650dSSadaf EbrahimiLookahead assertions start with (?= for positive assertions and (?! for 2476*22dc650dSSadaf Ebrahiminegative assertions. For example, 2477*22dc650dSSadaf Ebrahimi<pre> 2478*22dc650dSSadaf Ebrahimi \w+(?=;) 2479*22dc650dSSadaf Ebrahimi</pre> 2480*22dc650dSSadaf Ebrahimimatches a word followed by a semicolon, but does not include the semicolon in 2481*22dc650dSSadaf Ebrahimithe match, and 2482*22dc650dSSadaf Ebrahimi<pre> 2483*22dc650dSSadaf Ebrahimi foo(?!bar) 2484*22dc650dSSadaf Ebrahimi</pre> 2485*22dc650dSSadaf Ebrahimimatches any occurrence of "foo" that is not followed by "bar". Note that the 2486*22dc650dSSadaf Ebrahimiapparently similar pattern 2487*22dc650dSSadaf Ebrahimi<pre> 2488*22dc650dSSadaf Ebrahimi (?!foo)bar 2489*22dc650dSSadaf Ebrahimi</pre> 2490*22dc650dSSadaf Ebrahimidoes not find an occurrence of "bar" that is preceded by something other than 2491*22dc650dSSadaf Ebrahimi"foo"; it finds any occurrence of "bar" whatsoever, because the assertion 2492*22dc650dSSadaf Ebrahimi(?!foo) is always true when the next three characters are "bar". A 2493*22dc650dSSadaf Ebrahimilookbehind assertion is needed to achieve the other effect. 2494*22dc650dSSadaf Ebrahimi</P> 2495*22dc650dSSadaf Ebrahimi<P> 2496*22dc650dSSadaf EbrahimiIf you want to force a matching failure at some point in a pattern, the most 2497*22dc650dSSadaf Ebrahimiconvenient way to do it is with (?!) because an empty string always matches, so 2498*22dc650dSSadaf Ebrahimian assertion that requires there not to be an empty string must always fail. 2499*22dc650dSSadaf EbrahimiThe backtracking control verb (*FAIL) or (*F) is a synonym for (?!). 2500*22dc650dSSadaf Ebrahimi<a name="lookbehind"></a></P> 2501*22dc650dSSadaf Ebrahimi<br><b> 2502*22dc650dSSadaf EbrahimiLookbehind assertions 2503*22dc650dSSadaf Ebrahimi</b><br> 2504*22dc650dSSadaf Ebrahimi<P> 2505*22dc650dSSadaf EbrahimiLookbehind assertions start with (?<= for positive assertions and (?<! for 2506*22dc650dSSadaf Ebrahiminegative assertions. For example, 2507*22dc650dSSadaf Ebrahimi<pre> 2508*22dc650dSSadaf Ebrahimi (?<!foo)bar 2509*22dc650dSSadaf Ebrahimi</pre> 2510*22dc650dSSadaf Ebrahimidoes find an occurrence of "bar" that is not preceded by "foo". The contents of 2511*22dc650dSSadaf Ebrahimia lookbehind assertion are restricted such that there must be a known maximum 2512*22dc650dSSadaf Ebrahimito the lengths of all the strings it matches. There are two cases: 2513*22dc650dSSadaf Ebrahimi</P> 2514*22dc650dSSadaf Ebrahimi<P> 2515*22dc650dSSadaf EbrahimiIf every top-level alternative matches a fixed length, for example 2516*22dc650dSSadaf Ebrahimi<pre> 2517*22dc650dSSadaf Ebrahimi (?<=colour|color) 2518*22dc650dSSadaf Ebrahimi</pre> 2519*22dc650dSSadaf Ebrahimithere is a limit of 65535 characters to the lengths, which do not have to be 2520*22dc650dSSadaf Ebrahimithe same, as this example demonstrates. This is the only kind of lookbehind 2521*22dc650dSSadaf Ebrahimisupported by PCRE2 versions earlier than 10.43 and by the alternative matching 2522*22dc650dSSadaf Ebrahimifunction <b>pcre2_dfa_match()</b>. 2523*22dc650dSSadaf Ebrahimi</P> 2524*22dc650dSSadaf Ebrahimi<P> 2525*22dc650dSSadaf EbrahimiIn PCRE2 10.43 and later, <b>pcre2_match()</b> supports lookbehind assertions in 2526*22dc650dSSadaf Ebrahimiwhich one or more top-level alternatives can match more than one string length, 2527*22dc650dSSadaf Ebrahimifor example 2528*22dc650dSSadaf Ebrahimi<pre> 2529*22dc650dSSadaf Ebrahimi (?<=colou?r) 2530*22dc650dSSadaf Ebrahimi</pre> 2531*22dc650dSSadaf EbrahimiThe maximum matching length for any branch of the lookbehind is limited to a 2532*22dc650dSSadaf Ebrahimivalue set by the calling program (default 255 characters). Unlimited repetition 2533*22dc650dSSadaf Ebrahimi(for example \d*) is not supported. In some cases, the escape sequence \K 2534*22dc650dSSadaf Ebrahimi<a href="#resetmatchstart">(see above)</a> 2535*22dc650dSSadaf Ebrahimican be used instead of a lookbehind assertion at the start of a pattern to get 2536*22dc650dSSadaf Ebrahimiround the length limit restriction. 2537*22dc650dSSadaf Ebrahimi</P> 2538*22dc650dSSadaf Ebrahimi<P> 2539*22dc650dSSadaf EbrahimiIn UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which matches a 2540*22dc650dSSadaf Ebrahimisingle code unit even in a UTF mode) to appear in lookbehind assertions, 2541*22dc650dSSadaf Ebrahimibecause it makes it impossible to calculate the length of the lookbehind. The 2542*22dc650dSSadaf Ebrahimi\X and \R escapes, which can match different numbers of code units, are never 2543*22dc650dSSadaf Ebrahimipermitted in lookbehinds. 2544*22dc650dSSadaf Ebrahimi</P> 2545*22dc650dSSadaf Ebrahimi<P> 2546*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">"Subroutine"</a> 2547*22dc650dSSadaf Ebrahimicalls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long 2548*22dc650dSSadaf Ebrahimias the called capture group matches a limited-length string. However, 2549*22dc650dSSadaf Ebrahimi<a href="#recursion">recursion,</a> 2550*22dc650dSSadaf Ebrahimithat is, a "subroutine" call into a group that is already active, 2551*22dc650dSSadaf Ebrahimiis not supported. 2552*22dc650dSSadaf Ebrahimi</P> 2553*22dc650dSSadaf Ebrahimi<P> 2554*22dc650dSSadaf EbrahimiPCRE2 supports backreferences in lookbehinds, but only if certain conditions 2555*22dc650dSSadaf Ebrahimiare met. The PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no 2556*22dc650dSSadaf Ebrahimiuse of (?| in the pattern (it creates duplicate group numbers), and if the 2557*22dc650dSSadaf Ebrahimibackreference is by name, the name must be unique. Of course, the referenced 2558*22dc650dSSadaf Ebrahimigroup must itself match a limited length substring. The following pattern 2559*22dc650dSSadaf Ebrahimimatches words containing at least two characters that begin and end with the 2560*22dc650dSSadaf Ebrahimisame character: 2561*22dc650dSSadaf Ebrahimi<pre> 2562*22dc650dSSadaf Ebrahimi \b(\w)\w++(?<=\1) 2563*22dc650dSSadaf Ebrahimi</PRE> 2564*22dc650dSSadaf Ebrahimi</P> 2565*22dc650dSSadaf Ebrahimi<P> 2566*22dc650dSSadaf EbrahimiPossessive quantifiers can be used in conjunction with lookbehind assertions to 2567*22dc650dSSadaf Ebrahimispecify efficient matching at the end of subject strings. Consider a simple 2568*22dc650dSSadaf Ebrahimipattern such as 2569*22dc650dSSadaf Ebrahimi<pre> 2570*22dc650dSSadaf Ebrahimi abcd$ 2571*22dc650dSSadaf Ebrahimi</pre> 2572*22dc650dSSadaf Ebrahimiwhen applied to a long string that does not match. Because matching proceeds 2573*22dc650dSSadaf Ebrahimifrom left to right, PCRE2 will look for each "a" in the subject and then see if 2574*22dc650dSSadaf Ebrahimiwhat follows matches the rest of the pattern. If the pattern is specified as 2575*22dc650dSSadaf Ebrahimi<pre> 2576*22dc650dSSadaf Ebrahimi ^.*abcd$ 2577*22dc650dSSadaf Ebrahimi</pre> 2578*22dc650dSSadaf Ebrahimithe initial .* matches the entire string at first, but when this fails (because 2579*22dc650dSSadaf Ebrahimithere is no following "a"), it backtracks to match all but the last character, 2580*22dc650dSSadaf Ebrahimithen all but the last two characters, and so on. Once again the search for "a" 2581*22dc650dSSadaf Ebrahimicovers the entire string, from right to left, so we are no better off. However, 2582*22dc650dSSadaf Ebrahimiif the pattern is written as 2583*22dc650dSSadaf Ebrahimi<pre> 2584*22dc650dSSadaf Ebrahimi ^.*+(?<=abcd) 2585*22dc650dSSadaf Ebrahimi</pre> 2586*22dc650dSSadaf Ebrahimithere can be no backtracking for the .*+ item because of the possessive 2587*22dc650dSSadaf Ebrahimiquantifier; it can match only the entire string. The subsequent lookbehind 2588*22dc650dSSadaf Ebrahimiassertion does a single test on the last four characters. If it fails, the 2589*22dc650dSSadaf Ebrahimimatch fails immediately. For long strings, this approach makes a significant 2590*22dc650dSSadaf Ebrahimidifference to the processing time. 2591*22dc650dSSadaf Ebrahimi</P> 2592*22dc650dSSadaf Ebrahimi<br><b> 2593*22dc650dSSadaf EbrahimiUsing multiple assertions 2594*22dc650dSSadaf Ebrahimi</b><br> 2595*22dc650dSSadaf Ebrahimi<P> 2596*22dc650dSSadaf EbrahimiSeveral assertions (of any sort) may occur in succession. For example, 2597*22dc650dSSadaf Ebrahimi<pre> 2598*22dc650dSSadaf Ebrahimi (?<=\d{3})(?<!999)foo 2599*22dc650dSSadaf Ebrahimi</pre> 2600*22dc650dSSadaf Ebrahimimatches "foo" preceded by three digits that are not "999". Notice that each of 2601*22dc650dSSadaf Ebrahimithe assertions is applied independently at the same point in the subject 2602*22dc650dSSadaf Ebrahimistring. First there is a check that the previous three characters are all 2603*22dc650dSSadaf Ebrahimidigits, and then there is a check that the same three characters are not "999". 2604*22dc650dSSadaf EbrahimiThis pattern does <i>not</i> match "foo" preceded by six characters, the first 2605*22dc650dSSadaf Ebrahimiof which are digits and the last three of which are not "999". For example, it 2606*22dc650dSSadaf Ebrahimidoesn't match "123abcfoo". A pattern to do that is 2607*22dc650dSSadaf Ebrahimi<pre> 2608*22dc650dSSadaf Ebrahimi (?<=\d{3}...)(?<!999)foo 2609*22dc650dSSadaf Ebrahimi</pre> 2610*22dc650dSSadaf EbrahimiThis time the first assertion looks at the preceding six characters, checking 2611*22dc650dSSadaf Ebrahimithat the first three are digits, and then the second assertion checks that the 2612*22dc650dSSadaf Ebrahimipreceding three characters are not "999". 2613*22dc650dSSadaf Ebrahimi</P> 2614*22dc650dSSadaf Ebrahimi<P> 2615*22dc650dSSadaf EbrahimiAssertions can be nested in any combination. For example, 2616*22dc650dSSadaf Ebrahimi<pre> 2617*22dc650dSSadaf Ebrahimi (?<=(?<!foo)bar)baz 2618*22dc650dSSadaf Ebrahimi</pre> 2619*22dc650dSSadaf Ebrahimimatches an occurrence of "baz" that is preceded by "bar" which in turn is not 2620*22dc650dSSadaf Ebrahimipreceded by "foo", while 2621*22dc650dSSadaf Ebrahimi<pre> 2622*22dc650dSSadaf Ebrahimi (?<=\d{3}(?!999)...)foo 2623*22dc650dSSadaf Ebrahimi</pre> 2624*22dc650dSSadaf Ebrahimiis another pattern that matches "foo" preceded by three digits and any three 2625*22dc650dSSadaf Ebrahimicharacters that are not "999". 2626*22dc650dSSadaf Ebrahimi<a name="nonatomicassertions"></a></P> 2627*22dc650dSSadaf Ebrahimi<br><a name="SEC21" href="#TOC1">NON-ATOMIC ASSERTIONS</a><br> 2628*22dc650dSSadaf Ebrahimi<P> 2629*22dc650dSSadaf EbrahimiTraditional lookaround assertions are atomic. That is, if an assertion is true, 2630*22dc650dSSadaf Ebrahimibut there is a subsequent matching failure, there is no backtracking into the 2631*22dc650dSSadaf Ebrahimiassertion. However, there are some cases where non-atomic positive assertions 2632*22dc650dSSadaf Ebrahimican be useful. PCRE2 provides these using the following syntax: 2633*22dc650dSSadaf Ebrahimi<pre> 2634*22dc650dSSadaf Ebrahimi (*non_atomic_positive_lookahead: or (*napla: or (?* 2635*22dc650dSSadaf Ebrahimi (*non_atomic_positive_lookbehind: or (*naplb: or (?<* 2636*22dc650dSSadaf Ebrahimi</pre> 2637*22dc650dSSadaf EbrahimiConsider the problem of finding the right-most word in a string that also 2638*22dc650dSSadaf Ebrahimiappears earlier in the string, that is, it must appear at least twice in total. 2639*22dc650dSSadaf EbrahimiThis pattern returns the required result as captured substring 1: 2640*22dc650dSSadaf Ebrahimi<pre> 2641*22dc650dSSadaf Ebrahimi ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2} 2642*22dc650dSSadaf Ebrahimi</pre> 2643*22dc650dSSadaf EbrahimiFor a subject such as "word1 word2 word3 word2 word3 word4" the result is 2644*22dc650dSSadaf Ebrahimi"word3". How does it work? At the start, ^(?x) anchors the pattern and sets the 2645*22dc650dSSadaf Ebrahimi"x" option, which causes white space (introduced for readability) to be 2646*22dc650dSSadaf Ebrahimiignored. Inside the assertion, the greedy .* at first consumes the entire 2647*22dc650dSSadaf Ebrahimistring, but then has to backtrack until the rest of the assertion can match a 2648*22dc650dSSadaf Ebrahimiword, which is captured by group 1. In other words, when the assertion first 2649*22dc650dSSadaf Ebrahimisucceeds, it captures the right-most word in the string. 2650*22dc650dSSadaf Ebrahimi</P> 2651*22dc650dSSadaf Ebrahimi<P> 2652*22dc650dSSadaf EbrahimiThe current matching point is then reset to the start of the subject, and the 2653*22dc650dSSadaf Ebrahimirest of the pattern match checks for two occurrences of the captured word, 2654*22dc650dSSadaf Ebrahimiusing an ungreedy .*? to scan from the left. If this succeeds, we are done, but 2655*22dc650dSSadaf Ebrahimiif the last word in the string does not occur twice, this part of the pattern 2656*22dc650dSSadaf Ebrahimifails. If a traditional atomic lookahead (?= or (*pla: had been used, the 2657*22dc650dSSadaf Ebrahimiassertion could not be re-entered, and the whole match would fail. The pattern 2658*22dc650dSSadaf Ebrahimiwould succeed only if the very last word in the subject was found twice. 2659*22dc650dSSadaf Ebrahimi</P> 2660*22dc650dSSadaf Ebrahimi<P> 2661*22dc650dSSadaf EbrahimiUsing a non-atomic lookahead, however, means that when the last word does not 2662*22dc650dSSadaf Ebrahimioccur twice in the string, the lookahead can backtrack and find the second-last 2663*22dc650dSSadaf Ebrahimiword, and so on, until either the match succeeds, or all words have been 2664*22dc650dSSadaf Ebrahimitested. 2665*22dc650dSSadaf Ebrahimi</P> 2666*22dc650dSSadaf Ebrahimi<P> 2667*22dc650dSSadaf EbrahimiTwo conditions must be met for a non-atomic assertion to be useful: the 2668*22dc650dSSadaf Ebrahimicontents of one or more capturing groups must change after a backtrack into the 2669*22dc650dSSadaf Ebrahimiassertion, and there must be a backreference to a changed group later in the 2670*22dc650dSSadaf Ebrahimipattern. If this is not the case, the rest of the pattern match fails exactly 2671*22dc650dSSadaf Ebrahimias before because nothing has changed, so using a non-atomic assertion just 2672*22dc650dSSadaf Ebrahimiwastes resources. 2673*22dc650dSSadaf Ebrahimi</P> 2674*22dc650dSSadaf Ebrahimi<P> 2675*22dc650dSSadaf EbrahimiThere is one exception to backtracking into a non-atomic assertion. If an 2676*22dc650dSSadaf Ebrahimi(*ACCEPT) control verb is triggered, the assertion succeeds atomically. That 2677*22dc650dSSadaf Ebrahimiis, a subsequent match failure cannot backtrack into the assertion. 2678*22dc650dSSadaf Ebrahimi</P> 2679*22dc650dSSadaf Ebrahimi<P> 2680*22dc650dSSadaf EbrahimiNon-atomic assertions are not supported by the alternative matching function 2681*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b>. They are supported by JIT, but only if they do not 2682*22dc650dSSadaf Ebrahimicontain any control verbs such as (*ACCEPT). (This may change in future). Note 2683*22dc650dSSadaf Ebrahimithat assertions that appear as conditions for 2684*22dc650dSSadaf Ebrahimi<a href="#conditions">conditional groups</a> 2685*22dc650dSSadaf Ebrahimi(see below) must be atomic. 2686*22dc650dSSadaf Ebrahimi</P> 2687*22dc650dSSadaf Ebrahimi<br><a name="SEC22" href="#TOC1">SCRIPT RUNS</a><br> 2688*22dc650dSSadaf Ebrahimi<P> 2689*22dc650dSSadaf EbrahimiIn concept, a script run is a sequence of characters that are all from the same 2690*22dc650dSSadaf EbrahimiUnicode script such as Latin or Greek. However, because some scripts are 2691*22dc650dSSadaf Ebrahimicommonly used together, and because some diacritical and other marks are used 2692*22dc650dSSadaf Ebrahimiwith multiple scripts, it is not that simple. There is a full description of 2693*22dc650dSSadaf Ebrahimithe rules that PCRE2 uses in the section entitled 2694*22dc650dSSadaf Ebrahimi<a href="pcre2unicode.html#scriptruns">"Script Runs"</a> 2695*22dc650dSSadaf Ebrahimiin the 2696*22dc650dSSadaf Ebrahimi<a href="pcre2unicode.html"><b>pcre2unicode</b></a> 2697*22dc650dSSadaf Ebrahimidocumentation. 2698*22dc650dSSadaf Ebrahimi</P> 2699*22dc650dSSadaf Ebrahimi<P> 2700*22dc650dSSadaf EbrahimiIf part of a pattern is enclosed between (*script_run: or (*sr: and a closing 2701*22dc650dSSadaf Ebrahimiparenthesis, it fails if the sequence of characters that it matches are not a 2702*22dc650dSSadaf Ebrahimiscript run. After a failure, normal backtracking occurs. Script runs can be 2703*22dc650dSSadaf Ebrahimiused to detect spoofing attacks using characters that look the same, but are 2704*22dc650dSSadaf Ebrahimifrom different scripts. The string "paypal.com" is an infamous example, where 2705*22dc650dSSadaf Ebrahimithe letters could be a mixture of Latin and Cyrillic. This pattern ensures that 2706*22dc650dSSadaf Ebrahimithe matched characters in a sequence of non-spaces that follow white space are 2707*22dc650dSSadaf Ebrahimia script run: 2708*22dc650dSSadaf Ebrahimi<pre> 2709*22dc650dSSadaf Ebrahimi \s+(*sr:\S+) 2710*22dc650dSSadaf Ebrahimi</pre> 2711*22dc650dSSadaf EbrahimiTo be sure that they are all from the Latin script (for example), a lookahead 2712*22dc650dSSadaf Ebrahimican be used: 2713*22dc650dSSadaf Ebrahimi<pre> 2714*22dc650dSSadaf Ebrahimi \s+(?=\p{Latin})(*sr:\S+) 2715*22dc650dSSadaf Ebrahimi</pre> 2716*22dc650dSSadaf EbrahimiThis works as long as the first character is expected to be a character in that 2717*22dc650dSSadaf Ebrahimiscript, and not (for example) punctuation, which is allowed with any script. If 2718*22dc650dSSadaf Ebrahimithis is not the case, a more creative lookahead is needed. For example, if 2719*22dc650dSSadaf Ebrahimidigits, underscore, and dots are permitted at the start: 2720*22dc650dSSadaf Ebrahimi<pre> 2721*22dc650dSSadaf Ebrahimi \s+(?=[0-9_.]*\p{Latin})(*sr:\S+) 2722*22dc650dSSadaf Ebrahimi 2723*22dc650dSSadaf Ebrahimi</PRE> 2724*22dc650dSSadaf Ebrahimi</P> 2725*22dc650dSSadaf Ebrahimi<P> 2726*22dc650dSSadaf EbrahimiIn many cases, backtracking into a script run pattern fragment is not 2727*22dc650dSSadaf Ebrahimidesirable. The script run can employ an atomic group to prevent this. Because 2728*22dc650dSSadaf Ebrahimithis is a common requirement, a shorthand notation is provided by 2729*22dc650dSSadaf Ebrahimi(*atomic_script_run: or (*asr: 2730*22dc650dSSadaf Ebrahimi<pre> 2731*22dc650dSSadaf Ebrahimi (*asr:...) is the same as (*sr:(?>...)) 2732*22dc650dSSadaf Ebrahimi</pre> 2733*22dc650dSSadaf EbrahimiNote that the atomic group is inside the script run. Putting it outside would 2734*22dc650dSSadaf Ebrahiminot prevent backtracking into the script run pattern. 2735*22dc650dSSadaf Ebrahimi</P> 2736*22dc650dSSadaf Ebrahimi<P> 2737*22dc650dSSadaf EbrahimiSupport for script runs is not available if PCRE2 is compiled without Unicode 2738*22dc650dSSadaf Ebrahimisupport. A compile-time error is given if any of the above constructs is 2739*22dc650dSSadaf Ebrahimiencountered. Script runs are not supported by the alternate matching function, 2740*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b> because they use the same mechanism as capturing 2741*22dc650dSSadaf Ebrahimiparentheses. 2742*22dc650dSSadaf Ebrahimi</P> 2743*22dc650dSSadaf Ebrahimi<P> 2744*22dc650dSSadaf Ebrahimi<b>Warning:</b> The (*ACCEPT) control verb 2745*22dc650dSSadaf Ebrahimi<a href="#acceptverb">(see below)</a> 2746*22dc650dSSadaf Ebrahimishould not be used within a script run group, because it causes an immediate 2747*22dc650dSSadaf Ebrahimiexit from the group, bypassing the script run checking. 2748*22dc650dSSadaf Ebrahimi<a name="conditions"></a></P> 2749*22dc650dSSadaf Ebrahimi<br><a name="SEC23" href="#TOC1">CONDITIONAL GROUPS</a><br> 2750*22dc650dSSadaf Ebrahimi<P> 2751*22dc650dSSadaf EbrahimiIt is possible to cause the matching process to obey a pattern fragment 2752*22dc650dSSadaf Ebrahimiconditionally or to choose between two alternative fragments, depending on 2753*22dc650dSSadaf Ebrahimithe result of an assertion, or whether a specific capture group has 2754*22dc650dSSadaf Ebrahimialready been matched. The two possible forms of conditional group are: 2755*22dc650dSSadaf Ebrahimi<pre> 2756*22dc650dSSadaf Ebrahimi (?(condition)yes-pattern) 2757*22dc650dSSadaf Ebrahimi (?(condition)yes-pattern|no-pattern) 2758*22dc650dSSadaf Ebrahimi</pre> 2759*22dc650dSSadaf EbrahimiIf the condition is satisfied, the yes-pattern is used; otherwise the 2760*22dc650dSSadaf Ebrahimino-pattern (if present) is used. An absent no-pattern is equivalent to an empty 2761*22dc650dSSadaf Ebrahimistring (it always matches). If there are more than two alternatives in the 2762*22dc650dSSadaf Ebrahimigroup, a compile-time error occurs. Each of the two alternatives may itself 2763*22dc650dSSadaf Ebrahimicontain nested groups of any form, including conditional groups; the 2764*22dc650dSSadaf Ebrahimirestriction to two alternatives applies only at the level of the condition 2765*22dc650dSSadaf Ebrahimiitself. This pattern fragment is an example where the alternatives are complex: 2766*22dc650dSSadaf Ebrahimi<pre> 2767*22dc650dSSadaf Ebrahimi (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) 2768*22dc650dSSadaf Ebrahimi 2769*22dc650dSSadaf Ebrahimi</PRE> 2770*22dc650dSSadaf Ebrahimi</P> 2771*22dc650dSSadaf Ebrahimi<P> 2772*22dc650dSSadaf EbrahimiThere are five kinds of condition: references to capture groups, references to 2773*22dc650dSSadaf Ebrahimirecursion, two pseudo-conditions called DEFINE and VERSION, and assertions. 2774*22dc650dSSadaf Ebrahimi</P> 2775*22dc650dSSadaf Ebrahimi<br><b> 2776*22dc650dSSadaf EbrahimiChecking for a used capture group by number 2777*22dc650dSSadaf Ebrahimi</b><br> 2778*22dc650dSSadaf Ebrahimi<P> 2779*22dc650dSSadaf EbrahimiIf the text between the parentheses consists of a sequence of digits, the 2780*22dc650dSSadaf Ebrahimicondition is true if a capture group of that number has previously matched. If 2781*22dc650dSSadaf Ebrahimithere is more than one capture group with the same number (see the earlier 2782*22dc650dSSadaf Ebrahimi<a href="#recursion">section about duplicate group numbers),</a> 2783*22dc650dSSadaf Ebrahimithe condition is true if any of them have matched. An alternative notation, 2784*22dc650dSSadaf Ebrahimiwhich is a PCRE2 extension, not supported by Perl, is to precede the digits 2785*22dc650dSSadaf Ebrahimiwith a plus or minus sign. In this case, the group number is relative rather 2786*22dc650dSSadaf Ebrahimithan absolute. The most recently opened capture group (which could be enclosing 2787*22dc650dSSadaf Ebrahimithis condition) can be referenced by (?(-1), the next most recent by (?(-2), 2788*22dc650dSSadaf Ebrahimiand so on. Inside loops it can also make sense to refer to subsequent groups. 2789*22dc650dSSadaf EbrahimiThe next capture group to be opened can be referenced as (?(+1), and so on. The 2790*22dc650dSSadaf Ebrahimivalue zero in any of these forms is not used; it provokes a compile-time error. 2791*22dc650dSSadaf Ebrahimi</P> 2792*22dc650dSSadaf Ebrahimi<P> 2793*22dc650dSSadaf EbrahimiConsider the following pattern, which contains non-significant white space to 2794*22dc650dSSadaf Ebrahimimake it more readable (assume the PCRE2_EXTENDED option) and to divide it into 2795*22dc650dSSadaf Ebrahimithree parts for ease of discussion: 2796*22dc650dSSadaf Ebrahimi<pre> 2797*22dc650dSSadaf Ebrahimi ( \( )? [^()]+ (?(1) \) ) 2798*22dc650dSSadaf Ebrahimi</pre> 2799*22dc650dSSadaf EbrahimiThe first part matches an optional opening parenthesis, and if that 2800*22dc650dSSadaf Ebrahimicharacter is present, sets it as the first captured substring. The second part 2801*22dc650dSSadaf Ebrahimimatches one or more characters that are not parentheses. The third part is a 2802*22dc650dSSadaf Ebrahimiconditional group that tests whether or not the first capture group 2803*22dc650dSSadaf Ebrahimimatched. If it did, that is, if subject started with an opening parenthesis, 2804*22dc650dSSadaf Ebrahimithe condition is true, and so the yes-pattern is executed and a closing 2805*22dc650dSSadaf Ebrahimiparenthesis is required. Otherwise, since no-pattern is not present, the 2806*22dc650dSSadaf Ebrahimiconditional group matches nothing. In other words, this pattern matches a 2807*22dc650dSSadaf Ebrahimisequence of non-parentheses, optionally enclosed in parentheses. 2808*22dc650dSSadaf Ebrahimi</P> 2809*22dc650dSSadaf Ebrahimi<P> 2810*22dc650dSSadaf EbrahimiIf you were embedding this pattern in a larger one, you could use a relative 2811*22dc650dSSadaf Ebrahimireference: 2812*22dc650dSSadaf Ebrahimi<pre> 2813*22dc650dSSadaf Ebrahimi ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... 2814*22dc650dSSadaf Ebrahimi</pre> 2815*22dc650dSSadaf EbrahimiThis makes the fragment independent of the parentheses in the larger pattern. 2816*22dc650dSSadaf Ebrahimi</P> 2817*22dc650dSSadaf Ebrahimi<br><b> 2818*22dc650dSSadaf EbrahimiChecking for a used capture group by name 2819*22dc650dSSadaf Ebrahimi</b><br> 2820*22dc650dSSadaf Ebrahimi<P> 2821*22dc650dSSadaf EbrahimiPerl uses the syntax (?(<name>)...) or (?('name')...) to test for a used 2822*22dc650dSSadaf Ebrahimicapture group by name. For compatibility with earlier versions of PCRE1, which 2823*22dc650dSSadaf Ebrahimihad this facility before Perl, the syntax (?(name)...) is also recognized. 2824*22dc650dSSadaf EbrahimiNote, however, that undelimited names consisting of the letter R followed by 2825*22dc650dSSadaf Ebrahimidigits are ambiguous (see the following section). Rewriting the above example 2826*22dc650dSSadaf Ebrahimito use a named group gives this: 2827*22dc650dSSadaf Ebrahimi<pre> 2828*22dc650dSSadaf Ebrahimi (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) 2829*22dc650dSSadaf Ebrahimi</pre> 2830*22dc650dSSadaf EbrahimiIf the name used in a condition of this kind is a duplicate, the test is 2831*22dc650dSSadaf Ebrahimiapplied to all groups of the same name, and is true if any one of them has 2832*22dc650dSSadaf Ebrahimimatched. 2833*22dc650dSSadaf Ebrahimi</P> 2834*22dc650dSSadaf Ebrahimi<br><b> 2835*22dc650dSSadaf EbrahimiChecking for pattern recursion 2836*22dc650dSSadaf Ebrahimi</b><br> 2837*22dc650dSSadaf Ebrahimi<P> 2838*22dc650dSSadaf Ebrahimi"Recursion" in this sense refers to any subroutine-like call from one part of 2839*22dc650dSSadaf Ebrahimithe pattern to another, whether or not it is actually recursive. See the 2840*22dc650dSSadaf Ebrahimisections entitled 2841*22dc650dSSadaf Ebrahimi<a href="#recursion">"Recursive patterns"</a> 2842*22dc650dSSadaf Ebrahimiand 2843*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">"Groups as subroutines"</a> 2844*22dc650dSSadaf Ebrahimibelow for details of recursion and subroutine calls. 2845*22dc650dSSadaf Ebrahimi</P> 2846*22dc650dSSadaf Ebrahimi<P> 2847*22dc650dSSadaf EbrahimiIf a condition is the string (R), and there is no capture group with the name 2848*22dc650dSSadaf EbrahimiR, the condition is true if matching is currently in a recursion or subroutine 2849*22dc650dSSadaf Ebrahimicall to the whole pattern or any capture group. If digits follow the letter R, 2850*22dc650dSSadaf Ebrahimiand there is no group with that name, the condition is true if the most recent 2851*22dc650dSSadaf Ebrahimicall is into a group with the given number, which must exist somewhere in the 2852*22dc650dSSadaf Ebrahimioverall pattern. This is a contrived example that is equivalent to a+b: 2853*22dc650dSSadaf Ebrahimi<pre> 2854*22dc650dSSadaf Ebrahimi ((?(R1)a+|(?1)b)) 2855*22dc650dSSadaf Ebrahimi</pre> 2856*22dc650dSSadaf EbrahimiHowever, in both cases, if there is a capture group with a matching name, the 2857*22dc650dSSadaf Ebrahimicondition tests for its being set, as described in the section above, instead 2858*22dc650dSSadaf Ebrahimiof testing for recursion. For example, creating a group with the name R1 by 2859*22dc650dSSadaf Ebrahimiadding (?<R1>) to the above pattern completely changes its meaning. 2860*22dc650dSSadaf Ebrahimi</P> 2861*22dc650dSSadaf Ebrahimi<P> 2862*22dc650dSSadaf EbrahimiIf a name preceded by ampersand follows the letter R, for example: 2863*22dc650dSSadaf Ebrahimi<pre> 2864*22dc650dSSadaf Ebrahimi (?(R&name)...) 2865*22dc650dSSadaf Ebrahimi</pre> 2866*22dc650dSSadaf Ebrahimithe condition is true if the most recent recursion is into a group of that name 2867*22dc650dSSadaf Ebrahimi(which must exist within the pattern). 2868*22dc650dSSadaf Ebrahimi</P> 2869*22dc650dSSadaf Ebrahimi<P> 2870*22dc650dSSadaf EbrahimiThis condition does not check the entire recursion stack. It tests only the 2871*22dc650dSSadaf Ebrahimicurrent level. If the name used in a condition of this kind is a duplicate, the 2872*22dc650dSSadaf Ebrahimitest is applied to all groups of the same name, and is true if any one of 2873*22dc650dSSadaf Ebrahimithem is the most recent recursion. 2874*22dc650dSSadaf Ebrahimi</P> 2875*22dc650dSSadaf Ebrahimi<P> 2876*22dc650dSSadaf EbrahimiAt "top level", all these recursion test conditions are false. 2877*22dc650dSSadaf Ebrahimi<a name="subdefine"></a></P> 2878*22dc650dSSadaf Ebrahimi<br><b> 2879*22dc650dSSadaf EbrahimiDefining capture groups for use by reference only 2880*22dc650dSSadaf Ebrahimi</b><br> 2881*22dc650dSSadaf Ebrahimi<P> 2882*22dc650dSSadaf EbrahimiIf the condition is the string (DEFINE), the condition is always false, even if 2883*22dc650dSSadaf Ebrahimithere is a group with the name DEFINE. In this case, there may be only one 2884*22dc650dSSadaf Ebrahimialternative in the rest of the conditional group. It is always skipped if 2885*22dc650dSSadaf Ebrahimicontrol reaches this point in the pattern; the idea of DEFINE is that it can be 2886*22dc650dSSadaf Ebrahimiused to define subroutines that can be referenced from elsewhere. (The use of 2887*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">subroutines</a> 2888*22dc650dSSadaf Ebrahimiis described below.) For example, a pattern to match an IPv4 address such as 2889*22dc650dSSadaf Ebrahimi"192.168.23.245" could be written like this (ignore white space and line 2890*22dc650dSSadaf Ebrahimibreaks): 2891*22dc650dSSadaf Ebrahimi<pre> 2892*22dc650dSSadaf Ebrahimi (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) 2893*22dc650dSSadaf Ebrahimi \b (?&byte) (\.(?&byte)){3} \b 2894*22dc650dSSadaf Ebrahimi</pre> 2895*22dc650dSSadaf EbrahimiThe first part of the pattern is a DEFINE group inside which another group 2896*22dc650dSSadaf Ebrahiminamed "byte" is defined. This matches an individual component of an IPv4 2897*22dc650dSSadaf Ebrahimiaddress (a number less than 256). When matching takes place, this part of the 2898*22dc650dSSadaf Ebrahimipattern is skipped because DEFINE acts like a false condition. The rest of the 2899*22dc650dSSadaf Ebrahimipattern uses references to the named group to match the four dot-separated 2900*22dc650dSSadaf Ebrahimicomponents of an IPv4 address, insisting on a word boundary at each end. 2901*22dc650dSSadaf Ebrahimi</P> 2902*22dc650dSSadaf Ebrahimi<br><b> 2903*22dc650dSSadaf EbrahimiChecking the PCRE2 version 2904*22dc650dSSadaf Ebrahimi</b><br> 2905*22dc650dSSadaf Ebrahimi<P> 2906*22dc650dSSadaf EbrahimiPrograms that link with a PCRE2 library can check the version by calling 2907*22dc650dSSadaf Ebrahimi<b>pcre2_config()</b> with appropriate arguments. Users of applications that do 2908*22dc650dSSadaf Ebrahiminot have access to the underlying code cannot do this. A special "condition" 2909*22dc650dSSadaf Ebrahimicalled VERSION exists to allow such users to discover which version of PCRE2 2910*22dc650dSSadaf Ebrahimithey are dealing with by using this condition to match a string such as 2911*22dc650dSSadaf Ebrahimi"yesno". VERSION must be followed either by "=" or ">=" and a version number. 2912*22dc650dSSadaf EbrahimiFor example: 2913*22dc650dSSadaf Ebrahimi<pre> 2914*22dc650dSSadaf Ebrahimi (?(VERSION>=10.4)yes|no) 2915*22dc650dSSadaf Ebrahimi</pre> 2916*22dc650dSSadaf EbrahimiThis pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or 2917*22dc650dSSadaf Ebrahimi"no" otherwise. The fractional part of the version number may not contain more 2918*22dc650dSSadaf Ebrahimithan two digits. 2919*22dc650dSSadaf Ebrahimi</P> 2920*22dc650dSSadaf Ebrahimi<br><b> 2921*22dc650dSSadaf EbrahimiAssertion conditions 2922*22dc650dSSadaf Ebrahimi</b><br> 2923*22dc650dSSadaf Ebrahimi<P> 2924*22dc650dSSadaf EbrahimiIf the condition is not in any of the above formats, it must be a parenthesized 2925*22dc650dSSadaf Ebrahimiassertion. This may be a positive or negative lookahead or lookbehind 2926*22dc650dSSadaf Ebrahimiassertion. However, it must be a traditional atomic assertion, not one of the 2927*22dc650dSSadaf Ebrahimi<a href="#nonatomicassertions">non-atomic assertions.</a> 2928*22dc650dSSadaf Ebrahimi</P> 2929*22dc650dSSadaf Ebrahimi<P> 2930*22dc650dSSadaf EbrahimiConsider this pattern, again containing non-significant white space, and with 2931*22dc650dSSadaf Ebrahimithe two alternatives on the second line: 2932*22dc650dSSadaf Ebrahimi<pre> 2933*22dc650dSSadaf Ebrahimi (?(?=[^a-z]*[a-z]) 2934*22dc650dSSadaf Ebrahimi \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) 2935*22dc650dSSadaf Ebrahimi</pre> 2936*22dc650dSSadaf EbrahimiThe condition is a positive lookahead assertion that matches an optional 2937*22dc650dSSadaf Ebrahimisequence of non-letters followed by a letter. In other words, it tests for the 2938*22dc650dSSadaf Ebrahimipresence of at least one letter in the subject. If a letter is found, the 2939*22dc650dSSadaf Ebrahimisubject is matched against the first alternative; otherwise it is matched 2940*22dc650dSSadaf Ebrahimiagainst the second. This pattern matches strings in one of the two forms 2941*22dc650dSSadaf Ebrahimidd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. 2942*22dc650dSSadaf Ebrahimi</P> 2943*22dc650dSSadaf Ebrahimi<P> 2944*22dc650dSSadaf EbrahimiWhen an assertion that is a condition contains capture groups, any 2945*22dc650dSSadaf Ebrahimicapturing that occurs in a matching branch is retained afterwards, for both 2946*22dc650dSSadaf Ebrahimipositive and negative assertions, because matching always continues after the 2947*22dc650dSSadaf Ebrahimiassertion, whether it succeeds or fails. (Compare non-conditional assertions, 2948*22dc650dSSadaf Ebrahimifor which captures are retained only for positive assertions that succeed.) 2949*22dc650dSSadaf Ebrahimi<a name="comments"></a></P> 2950*22dc650dSSadaf Ebrahimi<br><a name="SEC24" href="#TOC1">COMMENTS</a><br> 2951*22dc650dSSadaf Ebrahimi<P> 2952*22dc650dSSadaf EbrahimiThere are two ways of including comments in patterns that are processed by 2953*22dc650dSSadaf EbrahimiPCRE2. In both cases, the start of the comment must not be in a character 2954*22dc650dSSadaf Ebrahimiclass, nor in the middle of any other sequence of related characters such as 2955*22dc650dSSadaf Ebrahimi(?: or a group name or number. The characters that make up a comment play 2956*22dc650dSSadaf Ebrahimino part in the pattern matching. 2957*22dc650dSSadaf Ebrahimi</P> 2958*22dc650dSSadaf Ebrahimi<P> 2959*22dc650dSSadaf EbrahimiThe sequence (?# marks the start of a comment that continues up to the next 2960*22dc650dSSadaf Ebrahimiclosing parenthesis. Nested parentheses are not permitted. If the 2961*22dc650dSSadaf EbrahimiPCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character 2962*22dc650dSSadaf Ebrahimialso introduces a comment, which in this case continues to immediately after 2963*22dc650dSSadaf Ebrahimithe next newline character or character sequence in the pattern. Which 2964*22dc650dSSadaf Ebrahimicharacters are interpreted as newlines is controlled by an option passed to the 2965*22dc650dSSadaf Ebrahimicompiling function or by a special sequence at the start of the pattern, as 2966*22dc650dSSadaf Ebrahimidescribed in the section entitled 2967*22dc650dSSadaf Ebrahimi<a href="#newlines">"Newline conventions"</a> 2968*22dc650dSSadaf Ebrahimiabove. Note that the end of this type of comment is a literal newline sequence 2969*22dc650dSSadaf Ebrahimiin the pattern; escape sequences that happen to represent a newline do not 2970*22dc650dSSadaf Ebrahimicount. For example, consider this pattern when PCRE2_EXTENDED is set, and the 2971*22dc650dSSadaf Ebrahimidefault newline convention (a single linefeed character) is in force: 2972*22dc650dSSadaf Ebrahimi<pre> 2973*22dc650dSSadaf Ebrahimi abc #comment \n still comment 2974*22dc650dSSadaf Ebrahimi</pre> 2975*22dc650dSSadaf EbrahimiOn encountering the # character, <b>pcre2_compile()</b> skips along, looking for 2976*22dc650dSSadaf Ebrahimia newline in the pattern. The sequence \n is still literal at this stage, so 2977*22dc650dSSadaf Ebrahimiit does not terminate the comment. Only an actual character with the code value 2978*22dc650dSSadaf Ebrahimi0x0a (the default newline) does so. 2979*22dc650dSSadaf Ebrahimi<a name="recursion"></a></P> 2980*22dc650dSSadaf Ebrahimi<br><a name="SEC25" href="#TOC1">RECURSIVE PATTERNS</a><br> 2981*22dc650dSSadaf Ebrahimi<P> 2982*22dc650dSSadaf EbrahimiConsider the problem of matching a string in parentheses, allowing for 2983*22dc650dSSadaf Ebrahimiunlimited nested parentheses. Without the use of recursion, the best that can 2984*22dc650dSSadaf Ebrahimibe done is to use a pattern that matches up to some fixed depth of nesting. It 2985*22dc650dSSadaf Ebrahimiis not possible to handle an arbitrary nesting depth. 2986*22dc650dSSadaf Ebrahimi</P> 2987*22dc650dSSadaf Ebrahimi<P> 2988*22dc650dSSadaf EbrahimiFor some time, Perl has provided a facility that allows regular expressions to 2989*22dc650dSSadaf Ebrahimirecurse (amongst other things). It does this by interpolating Perl code in the 2990*22dc650dSSadaf Ebrahimiexpression at run time, and the code can refer to the expression itself. A Perl 2991*22dc650dSSadaf Ebrahimipattern using code interpolation to solve the parentheses problem can be 2992*22dc650dSSadaf Ebrahimicreated like this: 2993*22dc650dSSadaf Ebrahimi<pre> 2994*22dc650dSSadaf Ebrahimi $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; 2995*22dc650dSSadaf Ebrahimi</pre> 2996*22dc650dSSadaf EbrahimiThe (?p{...}) item interpolates Perl code at run time, and in this case refers 2997*22dc650dSSadaf Ebrahimirecursively to the pattern in which it appears. 2998*22dc650dSSadaf Ebrahimi</P> 2999*22dc650dSSadaf Ebrahimi<P> 3000*22dc650dSSadaf EbrahimiObviously, PCRE2 cannot support the interpolation of Perl code. Instead, it 3001*22dc650dSSadaf Ebrahimisupports special syntax for recursion of the entire pattern, and also for 3002*22dc650dSSadaf Ebrahimiindividual capture group recursion. After its introduction in PCRE1 and Python, 3003*22dc650dSSadaf Ebrahimithis kind of recursion was subsequently introduced into Perl at release 5.10. 3004*22dc650dSSadaf Ebrahimi</P> 3005*22dc650dSSadaf Ebrahimi<P> 3006*22dc650dSSadaf EbrahimiA special item that consists of (? followed by a number greater than zero and a 3007*22dc650dSSadaf Ebrahimiclosing parenthesis is a recursive subroutine call of the capture group of the 3008*22dc650dSSadaf Ebrahimigiven number, provided that it occurs inside that group. (If not, it is a 3009*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">non-recursive subroutine</a> 3010*22dc650dSSadaf Ebrahimicall, which is described in the next section.) The special item (?R) or (?0) is 3011*22dc650dSSadaf Ebrahimia recursive call of the entire regular expression. 3012*22dc650dSSadaf Ebrahimi</P> 3013*22dc650dSSadaf Ebrahimi<P> 3014*22dc650dSSadaf EbrahimiThis PCRE2 pattern solves the nested parentheses problem (assume the 3015*22dc650dSSadaf EbrahimiPCRE2_EXTENDED option is set so that white space is ignored): 3016*22dc650dSSadaf Ebrahimi<pre> 3017*22dc650dSSadaf Ebrahimi \( ( [^()]++ | (?R) )* \) 3018*22dc650dSSadaf Ebrahimi</pre> 3019*22dc650dSSadaf EbrahimiFirst it matches an opening parenthesis. Then it matches any number of 3020*22dc650dSSadaf Ebrahimisubstrings which can either be a sequence of non-parentheses, or a recursive 3021*22dc650dSSadaf Ebrahimimatch of the pattern itself (that is, a correctly parenthesized substring). 3022*22dc650dSSadaf EbrahimiFinally there is a closing parenthesis. Note the use of a possessive quantifier 3023*22dc650dSSadaf Ebrahimito avoid backtracking into sequences of non-parentheses. 3024*22dc650dSSadaf Ebrahimi</P> 3025*22dc650dSSadaf Ebrahimi<P> 3026*22dc650dSSadaf EbrahimiIf this were part of a larger pattern, you would not want to recurse the entire 3027*22dc650dSSadaf Ebrahimipattern, so instead you could use this: 3028*22dc650dSSadaf Ebrahimi<pre> 3029*22dc650dSSadaf Ebrahimi ( \( ( [^()]++ | (?1) )* \) ) 3030*22dc650dSSadaf Ebrahimi</pre> 3031*22dc650dSSadaf EbrahimiWe have put the pattern into parentheses, and caused the recursion to refer to 3032*22dc650dSSadaf Ebrahimithem instead of the whole pattern. 3033*22dc650dSSadaf Ebrahimi</P> 3034*22dc650dSSadaf Ebrahimi<P> 3035*22dc650dSSadaf EbrahimiIn a larger pattern, keeping track of parenthesis numbers can be tricky. This 3036*22dc650dSSadaf Ebrahimiis made easier by the use of relative references. Instead of (?1) in the 3037*22dc650dSSadaf Ebrahimipattern above you can write (?-2) to refer to the second most recently opened 3038*22dc650dSSadaf Ebrahimiparentheses preceding the recursion. In other words, a negative number counts 3039*22dc650dSSadaf Ebrahimicapturing parentheses leftwards from the point at which it is encountered. 3040*22dc650dSSadaf Ebrahimi</P> 3041*22dc650dSSadaf Ebrahimi<P> 3042*22dc650dSSadaf EbrahimiBe aware however, that if 3043*22dc650dSSadaf Ebrahimi<a href="#dupgroupnumber">duplicate capture group numbers</a> 3044*22dc650dSSadaf Ebrahimiare in use, relative references refer to the earliest group with the 3045*22dc650dSSadaf Ebrahimiappropriate number. Consider, for example: 3046*22dc650dSSadaf Ebrahimi<pre> 3047*22dc650dSSadaf Ebrahimi (?|(a)|(b)) (c) (?-2) 3048*22dc650dSSadaf Ebrahimi</pre> 3049*22dc650dSSadaf EbrahimiThe first two capture groups (a) and (b) are both numbered 1, and group (c) 3050*22dc650dSSadaf Ebrahimiis number 2. When the reference (?-2) is encountered, the second most recently 3051*22dc650dSSadaf Ebrahimiopened parentheses has the number 1, but it is the first such group (the (a) 3052*22dc650dSSadaf Ebrahimigroup) to which the recursion refers. This would be the same if an absolute 3053*22dc650dSSadaf Ebrahimireference (?1) was used. In other words, relative references are just a 3054*22dc650dSSadaf Ebrahimishorthand for computing a group number. 3055*22dc650dSSadaf Ebrahimi</P> 3056*22dc650dSSadaf Ebrahimi<P> 3057*22dc650dSSadaf EbrahimiIt is also possible to refer to subsequent capture groups, by writing 3058*22dc650dSSadaf Ebrahimireferences such as (?+2). However, these cannot be recursive because the 3059*22dc650dSSadaf Ebrahimireference is not inside the parentheses that are referenced. They are always 3060*22dc650dSSadaf Ebrahimi<a href="#groupsassubroutines">non-recursive subroutine</a> 3061*22dc650dSSadaf Ebrahimicalls, as described in the next section. 3062*22dc650dSSadaf Ebrahimi</P> 3063*22dc650dSSadaf Ebrahimi<P> 3064*22dc650dSSadaf EbrahimiAn alternative approach is to use named parentheses. The Perl syntax for this 3065*22dc650dSSadaf Ebrahimiis (?&name); PCRE1's earlier syntax (?P>name) is also supported. We could 3066*22dc650dSSadaf Ebrahimirewrite the above example as follows: 3067*22dc650dSSadaf Ebrahimi<pre> 3068*22dc650dSSadaf Ebrahimi (?<pn> \( ( [^()]++ | (?&pn) )* \) ) 3069*22dc650dSSadaf Ebrahimi</pre> 3070*22dc650dSSadaf EbrahimiIf there is more than one group with the same name, the earliest one is 3071*22dc650dSSadaf Ebrahimiused. 3072*22dc650dSSadaf Ebrahimi</P> 3073*22dc650dSSadaf Ebrahimi<P> 3074*22dc650dSSadaf EbrahimiThe example pattern that we have been looking at contains nested unlimited 3075*22dc650dSSadaf Ebrahimirepeats, and so the use of a possessive quantifier for matching strings of 3076*22dc650dSSadaf Ebrahiminon-parentheses is important when applying the pattern to strings that do not 3077*22dc650dSSadaf Ebrahimimatch. For example, when this pattern is applied to 3078*22dc650dSSadaf Ebrahimi<pre> 3079*22dc650dSSadaf Ebrahimi (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() 3080*22dc650dSSadaf Ebrahimi</pre> 3081*22dc650dSSadaf Ebrahimiit yields "no match" quickly. However, if a possessive quantifier is not used, 3082*22dc650dSSadaf Ebrahimithe match runs for a very long time indeed because there are so many different 3083*22dc650dSSadaf Ebrahimiways the + and * repeats can carve up the subject, and all have to be tested 3084*22dc650dSSadaf Ebrahimibefore failure can be reported. 3085*22dc650dSSadaf Ebrahimi</P> 3086*22dc650dSSadaf Ebrahimi<P> 3087*22dc650dSSadaf EbrahimiAt the end of a match, the values of capturing parentheses are those from 3088*22dc650dSSadaf Ebrahimithe outermost level. If you want to obtain intermediate values, a callout 3089*22dc650dSSadaf Ebrahimifunction can be used (see below and the 3090*22dc650dSSadaf Ebrahimi<a href="pcre2callout.html"><b>pcre2callout</b></a> 3091*22dc650dSSadaf Ebrahimidocumentation). If the pattern above is matched against 3092*22dc650dSSadaf Ebrahimi<pre> 3093*22dc650dSSadaf Ebrahimi (ab(cd)ef) 3094*22dc650dSSadaf Ebrahimi</pre> 3095*22dc650dSSadaf Ebrahimithe value for the inner capturing parentheses (numbered 2) is "ef", which is 3096*22dc650dSSadaf Ebrahimithe last value taken on at the top level. If a capture group is not matched at 3097*22dc650dSSadaf Ebrahimithe top level, its final captured value is unset, even if it was (temporarily) 3098*22dc650dSSadaf Ebrahimiset at a deeper level during the matching process. 3099*22dc650dSSadaf Ebrahimi</P> 3100*22dc650dSSadaf Ebrahimi<P> 3101*22dc650dSSadaf EbrahimiDo not confuse the (?R) item with the condition (R), which tests for recursion. 3102*22dc650dSSadaf EbrahimiConsider this pattern, which matches text in angle brackets, allowing for 3103*22dc650dSSadaf Ebrahimiarbitrary nesting. Only digits are allowed in nested brackets (that is, when 3104*22dc650dSSadaf Ebrahimirecursing), whereas any characters are permitted at the outer level. 3105*22dc650dSSadaf Ebrahimi<pre> 3106*22dc650dSSadaf Ebrahimi < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > 3107*22dc650dSSadaf Ebrahimi</pre> 3108*22dc650dSSadaf EbrahimiIn this pattern, (?(R) is the start of a conditional group, with two different 3109*22dc650dSSadaf Ebrahimialternatives for the recursive and non-recursive cases. The (?R) item is the 3110*22dc650dSSadaf Ebrahimiactual recursive call. 3111*22dc650dSSadaf Ebrahimi<a name="recursiondifference"></a></P> 3112*22dc650dSSadaf Ebrahimi<br><b> 3113*22dc650dSSadaf EbrahimiDifferences in recursion processing between PCRE2 and Perl 3114*22dc650dSSadaf Ebrahimi</b><br> 3115*22dc650dSSadaf Ebrahimi<P> 3116*22dc650dSSadaf EbrahimiSome former differences between PCRE2 and Perl no longer exist. 3117*22dc650dSSadaf Ebrahimi</P> 3118*22dc650dSSadaf Ebrahimi<P> 3119*22dc650dSSadaf EbrahimiBefore release 10.30, recursion processing in PCRE2 differed from Perl in that 3120*22dc650dSSadaf Ebrahimia recursive subroutine call was always treated as an atomic group. That is, 3121*22dc650dSSadaf Ebrahimionce it had matched some of the subject string, it was never re-entered, even 3122*22dc650dSSadaf Ebrahimiif it contained untried alternatives and there was a subsequent matching 3123*22dc650dSSadaf Ebrahimifailure. (Historical note: PCRE implemented recursion before Perl did.) 3124*22dc650dSSadaf Ebrahimi</P> 3125*22dc650dSSadaf Ebrahimi<P> 3126*22dc650dSSadaf EbrahimiStarting with release 10.30, recursive subroutine calls are no longer treated 3127*22dc650dSSadaf Ebrahimias atomic. That is, they can be re-entered to try unused alternatives if there 3128*22dc650dSSadaf Ebrahimiis a matching failure later in the pattern. This is now compatible with the way 3129*22dc650dSSadaf EbrahimiPerl works. If you want a subroutine call to be atomic, you must explicitly 3130*22dc650dSSadaf Ebrahimienclose it in an atomic group. 3131*22dc650dSSadaf Ebrahimi</P> 3132*22dc650dSSadaf Ebrahimi<P> 3133*22dc650dSSadaf EbrahimiSupporting backtracking into recursions simplifies certain types of recursive 3134*22dc650dSSadaf Ebrahimipattern. For example, this pattern matches palindromic strings: 3135*22dc650dSSadaf Ebrahimi<pre> 3136*22dc650dSSadaf Ebrahimi ^((.)(?1)\2|.?)$ 3137*22dc650dSSadaf Ebrahimi</pre> 3138*22dc650dSSadaf EbrahimiThe second branch in the group matches a single central character in the 3139*22dc650dSSadaf Ebrahimipalindrome when there are an odd number of characters, or nothing when there 3140*22dc650dSSadaf Ebrahimiare an even number of characters, but in order to work it has to be able to try 3141*22dc650dSSadaf Ebrahimithe second case when the rest of the pattern match fails. If you want to match 3142*22dc650dSSadaf Ebrahimitypical palindromic phrases, the pattern has to ignore all non-word characters, 3143*22dc650dSSadaf Ebrahimiwhich can be done like this: 3144*22dc650dSSadaf Ebrahimi<pre> 3145*22dc650dSSadaf Ebrahimi ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$ 3146*22dc650dSSadaf Ebrahimi</pre> 3147*22dc650dSSadaf EbrahimiIf run with the PCRE2_CASELESS option, this pattern matches phrases such as "A 3148*22dc650dSSadaf Ebrahimiman, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to 3149*22dc650dSSadaf Ebrahimiavoid backtracking into sequences of non-word characters. Without this, PCRE2 3150*22dc650dSSadaf Ebrahimitakes a great deal longer (ten times or more) to match typical phrases, and 3151*22dc650dSSadaf EbrahimiPerl takes so long that you think it has gone into a loop. 3152*22dc650dSSadaf Ebrahimi</P> 3153*22dc650dSSadaf Ebrahimi<P> 3154*22dc650dSSadaf EbrahimiAnother way in which PCRE2 and Perl used to differ in their recursion 3155*22dc650dSSadaf Ebrahimiprocessing is in the handling of captured values. Formerly in Perl, when a 3156*22dc650dSSadaf Ebrahimigroup was called recursively or as a subroutine (see the next section), it 3157*22dc650dSSadaf Ebrahimihad no access to any values that were captured outside the recursion, whereas 3158*22dc650dSSadaf Ebrahimiin PCRE2 these values can be referenced. Consider this pattern: 3159*22dc650dSSadaf Ebrahimi<pre> 3160*22dc650dSSadaf Ebrahimi ^(.)(\1|a(?2)) 3161*22dc650dSSadaf Ebrahimi</pre> 3162*22dc650dSSadaf EbrahimiThis pattern matches "bab". The first capturing parentheses match "b", then in 3163*22dc650dSSadaf Ebrahimithe second group, when the backreference \1 fails to match "b", the second 3164*22dc650dSSadaf Ebrahimialternative matches "a" and then recurses. In the recursion, \1 does now match 3165*22dc650dSSadaf Ebrahimi"b" and so the whole match succeeds. This match used to fail in Perl, but in 3166*22dc650dSSadaf Ebrahimilater versions (I tried 5.024) it now works. 3167*22dc650dSSadaf Ebrahimi<a name="groupsassubroutines"></a></P> 3168*22dc650dSSadaf Ebrahimi<br><a name="SEC26" href="#TOC1">GROUPS AS SUBROUTINES</a><br> 3169*22dc650dSSadaf Ebrahimi<P> 3170*22dc650dSSadaf EbrahimiIf the syntax for a recursive group call (either by number or by name) is used 3171*22dc650dSSadaf Ebrahimioutside the parentheses to which it refers, it operates a bit like a subroutine 3172*22dc650dSSadaf Ebrahimiin a programming language. More accurately, PCRE2 treats the referenced group 3173*22dc650dSSadaf Ebrahimias an independent subpattern which it tries to match at the current matching 3174*22dc650dSSadaf Ebrahimiposition. The called group may be defined before or after the reference. A 3175*22dc650dSSadaf Ebrahiminumbered reference can be absolute or relative, as in these examples: 3176*22dc650dSSadaf Ebrahimi<pre> 3177*22dc650dSSadaf Ebrahimi (...(absolute)...)...(?2)... 3178*22dc650dSSadaf Ebrahimi (...(relative)...)...(?-1)... 3179*22dc650dSSadaf Ebrahimi (...(?+1)...(relative)... 3180*22dc650dSSadaf Ebrahimi</pre> 3181*22dc650dSSadaf EbrahimiAn earlier example pointed out that the pattern 3182*22dc650dSSadaf Ebrahimi<pre> 3183*22dc650dSSadaf Ebrahimi (sens|respons)e and \1ibility 3184*22dc650dSSadaf Ebrahimi</pre> 3185*22dc650dSSadaf Ebrahimimatches "sense and sensibility" and "response and responsibility", but not 3186*22dc650dSSadaf Ebrahimi"sense and responsibility". If instead the pattern 3187*22dc650dSSadaf Ebrahimi<pre> 3188*22dc650dSSadaf Ebrahimi (sens|respons)e and (?1)ibility 3189*22dc650dSSadaf Ebrahimi</pre> 3190*22dc650dSSadaf Ebrahimiis used, it does match "sense and responsibility" as well as the other two 3191*22dc650dSSadaf Ebrahimistrings. Another example is given in the discussion of DEFINE above. 3192*22dc650dSSadaf Ebrahimi</P> 3193*22dc650dSSadaf Ebrahimi<P> 3194*22dc650dSSadaf EbrahimiLike recursions, subroutine calls used to be treated as atomic, but this 3195*22dc650dSSadaf Ebrahimichanged at PCRE2 release 10.30, so backtracking into subroutine calls can now 3196*22dc650dSSadaf Ebrahimioccur. However, any capturing parentheses that are set during the subroutine 3197*22dc650dSSadaf Ebrahimicall revert to their previous values afterwards. 3198*22dc650dSSadaf Ebrahimi</P> 3199*22dc650dSSadaf Ebrahimi<P> 3200*22dc650dSSadaf EbrahimiProcessing options such as case-independence are fixed when a group is 3201*22dc650dSSadaf Ebrahimidefined, so if it is used as a subroutine, such options cannot be changed for 3202*22dc650dSSadaf Ebrahimidifferent calls. For example, consider this pattern: 3203*22dc650dSSadaf Ebrahimi<pre> 3204*22dc650dSSadaf Ebrahimi (abc)(?i:(?-1)) 3205*22dc650dSSadaf Ebrahimi</pre> 3206*22dc650dSSadaf EbrahimiIt matches "abcabc". It does not match "abcABC" because the change of 3207*22dc650dSSadaf Ebrahimiprocessing option does not affect the called group. 3208*22dc650dSSadaf Ebrahimi</P> 3209*22dc650dSSadaf Ebrahimi<P> 3210*22dc650dSSadaf EbrahimiThe behaviour of 3211*22dc650dSSadaf Ebrahimi<a href="#backtrackcontrol">backtracking control verbs</a> 3212*22dc650dSSadaf Ebrahimiin groups when called as subroutines is described in the section entitled 3213*22dc650dSSadaf Ebrahimi<a href="#btsub">"Backtracking verbs in subroutines"</a> 3214*22dc650dSSadaf Ebrahimibelow. 3215*22dc650dSSadaf Ebrahimi<a name="onigurumasubroutines"></a></P> 3216*22dc650dSSadaf Ebrahimi<br><a name="SEC27" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br> 3217*22dc650dSSadaf Ebrahimi<P> 3218*22dc650dSSadaf EbrahimiFor compatibility with Oniguruma, the non-Perl syntax \g followed by a name or 3219*22dc650dSSadaf Ebrahimia number enclosed either in angle brackets or single quotes, is an alternative 3220*22dc650dSSadaf Ebrahimisyntax for calling a group as a subroutine, possibly recursively. Here are two 3221*22dc650dSSadaf Ebrahimiof the examples used above, rewritten using this syntax: 3222*22dc650dSSadaf Ebrahimi<pre> 3223*22dc650dSSadaf Ebrahimi (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) 3224*22dc650dSSadaf Ebrahimi (sens|respons)e and \g'1'ibility 3225*22dc650dSSadaf Ebrahimi</pre> 3226*22dc650dSSadaf EbrahimiPCRE2 supports an extension to Oniguruma: if a number is preceded by a 3227*22dc650dSSadaf Ebrahimiplus or a minus sign it is taken as a relative reference. For example: 3228*22dc650dSSadaf Ebrahimi<pre> 3229*22dc650dSSadaf Ebrahimi (abc)(?i:\g<-1>) 3230*22dc650dSSadaf Ebrahimi</pre> 3231*22dc650dSSadaf EbrahimiNote that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i> 3232*22dc650dSSadaf Ebrahimisynonymous. The former is a backreference; the latter is a subroutine call. 3233*22dc650dSSadaf Ebrahimi</P> 3234*22dc650dSSadaf Ebrahimi<br><a name="SEC28" href="#TOC1">CALLOUTS</a><br> 3235*22dc650dSSadaf Ebrahimi<P> 3236*22dc650dSSadaf EbrahimiPerl has a feature whereby using the sequence (?{...}) causes arbitrary Perl 3237*22dc650dSSadaf Ebrahimicode to be obeyed in the middle of matching a regular expression. This makes it 3238*22dc650dSSadaf Ebrahimipossible, amongst other things, to extract different substrings that match the 3239*22dc650dSSadaf Ebrahimisame pair of parentheses when there is a repetition. 3240*22dc650dSSadaf Ebrahimi</P> 3241*22dc650dSSadaf Ebrahimi<P> 3242*22dc650dSSadaf EbrahimiPCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl 3243*22dc650dSSadaf Ebrahimicode. The feature is called "callout". The caller of PCRE2 provides an external 3244*22dc650dSSadaf Ebrahimifunction by putting its entry point in a match context using the function 3245*22dc650dSSadaf Ebrahimi<b>pcre2_set_callout()</b>, and then passing that context to <b>pcre2_match()</b> 3246*22dc650dSSadaf Ebrahimior <b>pcre2_dfa_match()</b>. If no match context is passed, or if the callout 3247*22dc650dSSadaf Ebrahimientry point is set to NULL, callouts are disabled. 3248*22dc650dSSadaf Ebrahimi</P> 3249*22dc650dSSadaf Ebrahimi<P> 3250*22dc650dSSadaf EbrahimiWithin a regular expression, (?C<arg>) indicates a point at which the external 3251*22dc650dSSadaf Ebrahimifunction is to be called. There are two kinds of callout: those with a 3252*22dc650dSSadaf Ebrahiminumerical argument and those with a string argument. (?C) on its own with no 3253*22dc650dSSadaf Ebrahimiargument is treated as (?C0). A numerical argument allows the application to 3254*22dc650dSSadaf Ebrahimidistinguish between different callouts. String arguments were added for release 3255*22dc650dSSadaf Ebrahimi10.20 to make it possible for script languages that use PCRE2 to embed short 3256*22dc650dSSadaf Ebrahimiscripts within patterns in a similar way to Perl. 3257*22dc650dSSadaf Ebrahimi</P> 3258*22dc650dSSadaf Ebrahimi<P> 3259*22dc650dSSadaf EbrahimiDuring matching, when PCRE2 reaches a callout point, the external function is 3260*22dc650dSSadaf Ebrahimicalled. It is provided with the number or string argument of the callout, the 3261*22dc650dSSadaf Ebrahimiposition in the pattern, and one item of data that is also set in the match 3262*22dc650dSSadaf Ebrahimiblock. The callout function may cause matching to proceed, to backtrack, or to 3263*22dc650dSSadaf Ebrahimifail. 3264*22dc650dSSadaf Ebrahimi</P> 3265*22dc650dSSadaf Ebrahimi<P> 3266*22dc650dSSadaf EbrahimiBy default, PCRE2 implements a number of optimizations at matching time, and 3267*22dc650dSSadaf Ebrahimione side-effect is that sometimes callouts are skipped. If you need all 3268*22dc650dSSadaf Ebrahimipossible callouts to happen, you need to set options that disable the relevant 3269*22dc650dSSadaf Ebrahimioptimizations. More details, including a complete description of the 3270*22dc650dSSadaf Ebrahimiprogramming interface to the callout function, are given in the 3271*22dc650dSSadaf Ebrahimi<a href="pcre2callout.html"><b>pcre2callout</b></a> 3272*22dc650dSSadaf Ebrahimidocumentation. 3273*22dc650dSSadaf Ebrahimi</P> 3274*22dc650dSSadaf Ebrahimi<br><b> 3275*22dc650dSSadaf EbrahimiCallouts with numerical arguments 3276*22dc650dSSadaf Ebrahimi</b><br> 3277*22dc650dSSadaf Ebrahimi<P> 3278*22dc650dSSadaf EbrahimiIf you just want to have a means of identifying different callout points, put a 3279*22dc650dSSadaf Ebrahiminumber less than 256 after the letter C. For example, this pattern has two 3280*22dc650dSSadaf Ebrahimicallout points: 3281*22dc650dSSadaf Ebrahimi<pre> 3282*22dc650dSSadaf Ebrahimi (?C1)abc(?C2)def 3283*22dc650dSSadaf Ebrahimi</pre> 3284*22dc650dSSadaf EbrahimiIf the PCRE2_AUTO_CALLOUT flag is passed to <b>pcre2_compile()</b>, numerical 3285*22dc650dSSadaf Ebrahimicallouts are automatically installed before each item in the pattern. They are 3286*22dc650dSSadaf Ebrahimiall numbered 255. If there is a conditional group in the pattern whose 3287*22dc650dSSadaf Ebrahimicondition is an assertion, an additional callout is inserted just before the 3288*22dc650dSSadaf Ebrahimicondition. An explicit callout may also be set at this position, as in this 3289*22dc650dSSadaf Ebrahimiexample: 3290*22dc650dSSadaf Ebrahimi<pre> 3291*22dc650dSSadaf Ebrahimi (?(?C9)(?=a)abc|def) 3292*22dc650dSSadaf Ebrahimi</pre> 3293*22dc650dSSadaf EbrahimiNote that this applies only to assertion conditions, not to other types of 3294*22dc650dSSadaf Ebrahimicondition. 3295*22dc650dSSadaf Ebrahimi</P> 3296*22dc650dSSadaf Ebrahimi<br><b> 3297*22dc650dSSadaf EbrahimiCallouts with string arguments 3298*22dc650dSSadaf Ebrahimi</b><br> 3299*22dc650dSSadaf Ebrahimi<P> 3300*22dc650dSSadaf EbrahimiA delimited string may be used instead of a number as a callout argument. The 3301*22dc650dSSadaf Ebrahimistarting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is 3302*22dc650dSSadaf Ebrahimithe same as the start, except for {, where the ending delimiter is }. If the 3303*22dc650dSSadaf Ebrahimiending delimiter is needed within the string, it must be doubled. For 3304*22dc650dSSadaf Ebrahimiexample: 3305*22dc650dSSadaf Ebrahimi<pre> 3306*22dc650dSSadaf Ebrahimi (?C'ab ''c'' d')xyz(?C{any text})pqr 3307*22dc650dSSadaf Ebrahimi</pre> 3308*22dc650dSSadaf EbrahimiThe doubling is removed before the string is passed to the callout function. 3309*22dc650dSSadaf Ebrahimi<a name="backtrackcontrol"></a></P> 3310*22dc650dSSadaf Ebrahimi<br><a name="SEC29" href="#TOC1">BACKTRACKING CONTROL</a><br> 3311*22dc650dSSadaf Ebrahimi<P> 3312*22dc650dSSadaf EbrahimiThere are a number of special "Backtracking Control Verbs" (to use Perl's 3313*22dc650dSSadaf Ebrahimiterminology) that modify the behaviour of backtracking during matching. They 3314*22dc650dSSadaf Ebrahimiare generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form, 3315*22dc650dSSadaf Ebrahimiand may behave differently depending on whether or not a name argument is 3316*22dc650dSSadaf Ebrahimipresent. The names are not required to be unique within the pattern. 3317*22dc650dSSadaf Ebrahimi</P> 3318*22dc650dSSadaf Ebrahimi<P> 3319*22dc650dSSadaf EbrahimiBy default, for compatibility with Perl, a name is any sequence of characters 3320*22dc650dSSadaf Ebrahimithat does not include a closing parenthesis. The name is not processed in 3321*22dc650dSSadaf Ebrahimiany way, and it is not possible to include a closing parenthesis in the name. 3322*22dc650dSSadaf EbrahimiThis can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result 3323*22dc650dSSadaf Ebrahimiis no longer Perl-compatible. 3324*22dc650dSSadaf Ebrahimi</P> 3325*22dc650dSSadaf Ebrahimi<P> 3326*22dc650dSSadaf EbrahimiWhen PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names 3327*22dc650dSSadaf Ebrahimiand only an unescaped closing parenthesis terminates the name. However, the 3328*22dc650dSSadaf Ebrahimionly backslash items that are permitted are \Q, \E, and sequences such as 3329*22dc650dSSadaf Ebrahimi\x{100} that define character code points. Character type escapes such as \d 3330*22dc650dSSadaf Ebrahimiare faulted. 3331*22dc650dSSadaf Ebrahimi</P> 3332*22dc650dSSadaf Ebrahimi<P> 3333*22dc650dSSadaf EbrahimiA closing parenthesis can be included in a name either as \) or between \Q 3334*22dc650dSSadaf Ebrahimiand \E. In addition to backslash processing, if the PCRE2_EXTENDED or 3335*22dc650dSSadaf EbrahimiPCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is 3336*22dc650dSSadaf Ebrahimiskipped, and #-comments are recognized, exactly as in the rest of the pattern. 3337*22dc650dSSadaf EbrahimiPCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless 3338*22dc650dSSadaf EbrahimiPCRE2_ALT_VERBNAMES is also set. 3339*22dc650dSSadaf Ebrahimi</P> 3340*22dc650dSSadaf Ebrahimi<P> 3341*22dc650dSSadaf EbrahimiThe maximum length of a name is 255 in the 8-bit library and 65535 in the 3342*22dc650dSSadaf Ebrahimi16-bit and 32-bit libraries. If the name is empty, that is, if the closing 3343*22dc650dSSadaf Ebrahimiparenthesis immediately follows the colon, the effect is as if the colon were 3344*22dc650dSSadaf Ebrahiminot there. Any number of these verbs may occur in a pattern. Except for 3345*22dc650dSSadaf Ebrahimi(*ACCEPT), they may not be quantified. 3346*22dc650dSSadaf Ebrahimi</P> 3347*22dc650dSSadaf Ebrahimi<P> 3348*22dc650dSSadaf EbrahimiSince these verbs are specifically related to backtracking, most of them can be 3349*22dc650dSSadaf Ebrahimiused only when the pattern is to be matched using the traditional matching 3350*22dc650dSSadaf Ebrahimifunction, because that uses a backtracking algorithm. With the exception of 3351*22dc650dSSadaf Ebrahimi(*FAIL), which behaves like a failing negative assertion, the backtracking 3352*22dc650dSSadaf Ebrahimicontrol verbs cause an error if encountered by the DFA matching function. 3353*22dc650dSSadaf Ebrahimi</P> 3354*22dc650dSSadaf Ebrahimi<P> 3355*22dc650dSSadaf EbrahimiThe behaviour of these verbs in 3356*22dc650dSSadaf Ebrahimi<a href="#btrepeat">repeated groups,</a> 3357*22dc650dSSadaf Ebrahimi<a href="#btassert">assertions,</a> 3358*22dc650dSSadaf Ebrahimiand in 3359*22dc650dSSadaf Ebrahimi<a href="#btsub">capture groups called as subroutines</a> 3360*22dc650dSSadaf Ebrahimi(whether or not recursively) is documented below. 3361*22dc650dSSadaf Ebrahimi<a name="nooptimize"></a></P> 3362*22dc650dSSadaf Ebrahimi<br><b> 3363*22dc650dSSadaf EbrahimiOptimizations that affect backtracking verbs 3364*22dc650dSSadaf Ebrahimi</b><br> 3365*22dc650dSSadaf Ebrahimi<P> 3366*22dc650dSSadaf EbrahimiPCRE2 contains some optimizations that are used to speed up matching by running 3367*22dc650dSSadaf Ebrahimisome checks at the start of each match attempt. For example, it may know the 3368*22dc650dSSadaf Ebrahimiminimum length of matching subject, or that a particular character must be 3369*22dc650dSSadaf Ebrahimipresent. When one of these optimizations bypasses the running of a match, any 3370*22dc650dSSadaf Ebrahimiincluded backtracking verbs will not, of course, be processed. You can suppress 3371*22dc650dSSadaf Ebrahimithe start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option 3372*22dc650dSSadaf Ebrahimiwhen calling <b>pcre2_compile()</b>, or by starting the pattern with 3373*22dc650dSSadaf Ebrahimi(*NO_START_OPT). There is more discussion of this option in the section 3374*22dc650dSSadaf Ebrahimientitled 3375*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#compiling">"Compiling a pattern"</a> 3376*22dc650dSSadaf Ebrahimiin the 3377*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 3378*22dc650dSSadaf Ebrahimidocumentation. 3379*22dc650dSSadaf Ebrahimi</P> 3380*22dc650dSSadaf Ebrahimi<P> 3381*22dc650dSSadaf EbrahimiExperiments with Perl suggest that it too has similar optimizations, and like 3382*22dc650dSSadaf EbrahimiPCRE2, turning them off can change the result of a match. 3383*22dc650dSSadaf Ebrahimi<a name="acceptverb"></a></P> 3384*22dc650dSSadaf Ebrahimi<br><b> 3385*22dc650dSSadaf EbrahimiVerbs that act immediately 3386*22dc650dSSadaf Ebrahimi</b><br> 3387*22dc650dSSadaf Ebrahimi<P> 3388*22dc650dSSadaf EbrahimiThe following verbs act as soon as they are encountered. 3389*22dc650dSSadaf Ebrahimi<pre> 3390*22dc650dSSadaf Ebrahimi (*ACCEPT) or (*ACCEPT:NAME) 3391*22dc650dSSadaf Ebrahimi</pre> 3392*22dc650dSSadaf EbrahimiThis verb causes the match to end successfully, skipping the remainder of the 3393*22dc650dSSadaf Ebrahimipattern. However, when it is inside a capture group that is called as a 3394*22dc650dSSadaf Ebrahimisubroutine, only that group is ended successfully. Matching then continues 3395*22dc650dSSadaf Ebrahimiat the outer level. If (*ACCEPT) in triggered in a positive assertion, the 3396*22dc650dSSadaf Ebrahimiassertion succeeds; in a negative assertion, the assertion fails. 3397*22dc650dSSadaf Ebrahimi</P> 3398*22dc650dSSadaf Ebrahimi<P> 3399*22dc650dSSadaf EbrahimiIf (*ACCEPT) is inside capturing parentheses, the data so far is captured. For 3400*22dc650dSSadaf Ebrahimiexample: 3401*22dc650dSSadaf Ebrahimi<pre> 3402*22dc650dSSadaf Ebrahimi A((?:A|B(*ACCEPT)|C)D) 3403*22dc650dSSadaf Ebrahimi</pre> 3404*22dc650dSSadaf EbrahimiThis matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by 3405*22dc650dSSadaf Ebrahimithe outer parentheses. 3406*22dc650dSSadaf Ebrahimi</P> 3407*22dc650dSSadaf Ebrahimi<P> 3408*22dc650dSSadaf Ebrahimi(*ACCEPT) is the only backtracking verb that is allowed to be quantified 3409*22dc650dSSadaf Ebrahimibecause an ungreedy quantification with a minimum of zero acts only when a 3410*22dc650dSSadaf Ebrahimibacktrack happens. Consider, for example, 3411*22dc650dSSadaf Ebrahimi<pre> 3412*22dc650dSSadaf Ebrahimi (A(*ACCEPT)??B)C 3413*22dc650dSSadaf Ebrahimi</pre> 3414*22dc650dSSadaf Ebrahimiwhere A, B, and C may be complex expressions. After matching "A", the matcher 3415*22dc650dSSadaf Ebrahimiprocesses "BC"; if that fails, causing a backtrack, (*ACCEPT) is triggered and 3416*22dc650dSSadaf Ebrahimithe match succeeds. In both cases, all but C is captured. Whereas (*COMMIT) 3417*22dc650dSSadaf Ebrahimi(see below) means "fail on backtrack", a repeated (*ACCEPT) of this type means 3418*22dc650dSSadaf Ebrahimi"succeed on backtrack". 3419*22dc650dSSadaf Ebrahimi</P> 3420*22dc650dSSadaf Ebrahimi<P> 3421*22dc650dSSadaf Ebrahimi<b>Warning:</b> (*ACCEPT) should not be used within a script run group, because 3422*22dc650dSSadaf Ebrahimiit causes an immediate exit from the group, bypassing the script run checking. 3423*22dc650dSSadaf Ebrahimi<pre> 3424*22dc650dSSadaf Ebrahimi (*FAIL) or (*FAIL:NAME) 3425*22dc650dSSadaf Ebrahimi</pre> 3426*22dc650dSSadaf EbrahimiThis verb causes a matching failure, forcing backtracking to occur. It may be 3427*22dc650dSSadaf Ebrahimiabbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl 3428*22dc650dSSadaf Ebrahimidocumentation notes that it is probably useful only when combined with (?{}) or 3429*22dc650dSSadaf Ebrahimi(??{}). Those are, of course, Perl features that are not present in PCRE2. The 3430*22dc650dSSadaf Ebrahiminearest equivalent is the callout feature, as for example in this pattern: 3431*22dc650dSSadaf Ebrahimi<pre> 3432*22dc650dSSadaf Ebrahimi a+(?C)(*FAIL) 3433*22dc650dSSadaf Ebrahimi</pre> 3434*22dc650dSSadaf EbrahimiA match with the string "aaaa" always fails, but the callout is taken before 3435*22dc650dSSadaf Ebrahimieach backtrack happens (in this example, 10 times). 3436*22dc650dSSadaf Ebrahimi</P> 3437*22dc650dSSadaf Ebrahimi<P> 3438*22dc650dSSadaf Ebrahimi(*ACCEPT:NAME) and (*FAIL:NAME) behave the same as (*MARK:NAME)(*ACCEPT) and 3439*22dc650dSSadaf Ebrahimi(*MARK:NAME)(*FAIL), respectively, that is, a (*MARK) is recorded just before 3440*22dc650dSSadaf Ebrahimithe verb acts. 3441*22dc650dSSadaf Ebrahimi</P> 3442*22dc650dSSadaf Ebrahimi<br><b> 3443*22dc650dSSadaf EbrahimiRecording which path was taken 3444*22dc650dSSadaf Ebrahimi</b><br> 3445*22dc650dSSadaf Ebrahimi<P> 3446*22dc650dSSadaf EbrahimiThere is one verb whose main purpose is to track how a match was arrived at, 3447*22dc650dSSadaf Ebrahimithough it also has a secondary use in conjunction with advancing the match 3448*22dc650dSSadaf Ebrahimistarting point (see (*SKIP) below). 3449*22dc650dSSadaf Ebrahimi<pre> 3450*22dc650dSSadaf Ebrahimi (*MARK:NAME) or (*:NAME) 3451*22dc650dSSadaf Ebrahimi</pre> 3452*22dc650dSSadaf EbrahimiA name is always required with this verb. For all the other backtracking 3453*22dc650dSSadaf Ebrahimicontrol verbs, a NAME argument is optional. 3454*22dc650dSSadaf Ebrahimi</P> 3455*22dc650dSSadaf Ebrahimi<P> 3456*22dc650dSSadaf EbrahimiWhen a match succeeds, the name of the last-encountered mark name on the 3457*22dc650dSSadaf Ebrahimimatching path is passed back to the caller as described in the section entitled 3458*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#matchotherdata">"Other information about the match"</a> 3459*22dc650dSSadaf Ebrahimiin the 3460*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 3461*22dc650dSSadaf Ebrahimidocumentation. This applies to all instances of (*MARK) and other verbs, 3462*22dc650dSSadaf Ebrahimiincluding those inside assertions and atomic groups. However, there are 3463*22dc650dSSadaf Ebrahimidifferences in those cases when (*MARK) is used in conjunction with (*SKIP) as 3464*22dc650dSSadaf Ebrahimidescribed below. 3465*22dc650dSSadaf Ebrahimi</P> 3466*22dc650dSSadaf Ebrahimi<P> 3467*22dc650dSSadaf EbrahimiThe mark name that was last encountered on the matching path is passed back. A 3468*22dc650dSSadaf Ebrahimiverb without a NAME argument is ignored for this purpose. Here is an example of 3469*22dc650dSSadaf Ebrahimi<b>pcre2test</b> output, where the "mark" modifier requests the retrieval and 3470*22dc650dSSadaf Ebrahimioutputting of (*MARK) data: 3471*22dc650dSSadaf Ebrahimi<pre> 3472*22dc650dSSadaf Ebrahimi re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 3473*22dc650dSSadaf Ebrahimi data> XY 3474*22dc650dSSadaf Ebrahimi 0: XY 3475*22dc650dSSadaf Ebrahimi MK: A 3476*22dc650dSSadaf Ebrahimi XZ 3477*22dc650dSSadaf Ebrahimi 0: XZ 3478*22dc650dSSadaf Ebrahimi MK: B 3479*22dc650dSSadaf Ebrahimi</pre> 3480*22dc650dSSadaf EbrahimiThe (*MARK) name is tagged with "MK:" in this output, and in this example it 3481*22dc650dSSadaf Ebrahimiindicates which of the two alternatives matched. This is a more efficient way 3482*22dc650dSSadaf Ebrahimiof obtaining this information than putting each alternative in its own 3483*22dc650dSSadaf Ebrahimicapturing parentheses. 3484*22dc650dSSadaf Ebrahimi</P> 3485*22dc650dSSadaf Ebrahimi<P> 3486*22dc650dSSadaf EbrahimiIf a verb with a name is encountered in a positive assertion that is true, the 3487*22dc650dSSadaf Ebrahiminame is recorded and passed back if it is the last-encountered. This does not 3488*22dc650dSSadaf Ebrahimihappen for negative assertions or failing positive assertions. 3489*22dc650dSSadaf Ebrahimi</P> 3490*22dc650dSSadaf Ebrahimi<P> 3491*22dc650dSSadaf EbrahimiAfter a partial match or a failed match, the last encountered name in the 3492*22dc650dSSadaf Ebrahimientire match process is returned. For example: 3493*22dc650dSSadaf Ebrahimi<pre> 3494*22dc650dSSadaf Ebrahimi re> /X(*MARK:A)Y|X(*MARK:B)Z/mark 3495*22dc650dSSadaf Ebrahimi data> XP 3496*22dc650dSSadaf Ebrahimi No match, mark = B 3497*22dc650dSSadaf Ebrahimi</pre> 3498*22dc650dSSadaf EbrahimiNote that in this unanchored example the mark is retained from the match 3499*22dc650dSSadaf Ebrahimiattempt that started at the letter "X" in the subject. Subsequent match 3500*22dc650dSSadaf Ebrahimiattempts starting at "P" and then with an empty string do not get as far as the 3501*22dc650dSSadaf Ebrahimi(*MARK) item, but nevertheless do not reset it. 3502*22dc650dSSadaf Ebrahimi</P> 3503*22dc650dSSadaf Ebrahimi<P> 3504*22dc650dSSadaf EbrahimiIf you are interested in (*MARK) values after failed matches, you should 3505*22dc650dSSadaf Ebrahimiprobably set the PCRE2_NO_START_OPTIMIZE option 3506*22dc650dSSadaf Ebrahimi<a href="#nooptimize">(see above)</a> 3507*22dc650dSSadaf Ebrahimito ensure that the match is always attempted. 3508*22dc650dSSadaf Ebrahimi</P> 3509*22dc650dSSadaf Ebrahimi<br><b> 3510*22dc650dSSadaf EbrahimiVerbs that act after backtracking 3511*22dc650dSSadaf Ebrahimi</b><br> 3512*22dc650dSSadaf Ebrahimi<P> 3513*22dc650dSSadaf EbrahimiThe following verbs do nothing when they are encountered. Matching continues 3514*22dc650dSSadaf Ebrahimiwith what follows, but if there is a subsequent match failure, causing a 3515*22dc650dSSadaf Ebrahimibacktrack to the verb, a failure is forced. That is, backtracking cannot pass 3516*22dc650dSSadaf Ebrahimito the left of the verb. However, when one of these verbs appears inside an 3517*22dc650dSSadaf Ebrahimiatomic group or in a lookaround assertion that is true, its effect is confined 3518*22dc650dSSadaf Ebrahimito that group, because once the group has been matched, there is never any 3519*22dc650dSSadaf Ebrahimibacktracking into it. Backtracking from beyond an assertion or an atomic group 3520*22dc650dSSadaf Ebrahimiignores the entire group, and seeks a preceding backtracking point. 3521*22dc650dSSadaf Ebrahimi</P> 3522*22dc650dSSadaf Ebrahimi<P> 3523*22dc650dSSadaf EbrahimiThese verbs differ in exactly what kind of failure occurs when backtracking 3524*22dc650dSSadaf Ebrahimireaches them. The behaviour described below is what happens when the verb is 3525*22dc650dSSadaf Ebrahiminot in a subroutine or an assertion. Subsequent sections cover these special 3526*22dc650dSSadaf Ebrahimicases. 3527*22dc650dSSadaf Ebrahimi<pre> 3528*22dc650dSSadaf Ebrahimi (*COMMIT) or (*COMMIT:NAME) 3529*22dc650dSSadaf Ebrahimi</pre> 3530*22dc650dSSadaf EbrahimiThis verb causes the whole match to fail outright if there is a later matching 3531*22dc650dSSadaf Ebrahimifailure that causes backtracking to reach it. Even if the pattern is 3532*22dc650dSSadaf Ebrahimiunanchored, no further attempts to find a match by advancing the starting point 3533*22dc650dSSadaf Ebrahimitake place. If (*COMMIT) is the only backtracking verb that is encountered, 3534*22dc650dSSadaf Ebrahimionce it has been passed <b>pcre2_match()</b> is committed to finding a match at 3535*22dc650dSSadaf Ebrahimithe current starting point, or not at all. For example: 3536*22dc650dSSadaf Ebrahimi<pre> 3537*22dc650dSSadaf Ebrahimi a+(*COMMIT)b 3538*22dc650dSSadaf Ebrahimi</pre> 3539*22dc650dSSadaf EbrahimiThis matches "xxaab" but not "aacaab". It can be thought of as a kind of 3540*22dc650dSSadaf Ebrahimidynamic anchor, or "I've started, so I must finish." 3541*22dc650dSSadaf Ebrahimi</P> 3542*22dc650dSSadaf Ebrahimi<P> 3543*22dc650dSSadaf EbrahimiThe behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is 3544*22dc650dSSadaf Ebrahimilike (*MARK:NAME) in that the name is remembered for passing back to the 3545*22dc650dSSadaf Ebrahimicaller. However, (*SKIP:NAME) searches only for names that are set with 3546*22dc650dSSadaf Ebrahimi(*MARK), ignoring those set by any of the other backtracking verbs. 3547*22dc650dSSadaf Ebrahimi</P> 3548*22dc650dSSadaf Ebrahimi<P> 3549*22dc650dSSadaf EbrahimiIf there is more than one backtracking verb in a pattern, a different one that 3550*22dc650dSSadaf Ebrahimifollows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a 3551*22dc650dSSadaf Ebrahimimatch does not always guarantee that a match must be at this starting point. 3552*22dc650dSSadaf Ebrahimi</P> 3553*22dc650dSSadaf Ebrahimi<P> 3554*22dc650dSSadaf EbrahimiNote that (*COMMIT) at the start of a pattern is not the same as an anchor, 3555*22dc650dSSadaf Ebrahimiunless PCRE2's start-of-match optimizations are turned off, as shown in this 3556*22dc650dSSadaf Ebrahimioutput from <b>pcre2test</b>: 3557*22dc650dSSadaf Ebrahimi<pre> 3558*22dc650dSSadaf Ebrahimi re> /(*COMMIT)abc/ 3559*22dc650dSSadaf Ebrahimi data> xyzabc 3560*22dc650dSSadaf Ebrahimi 0: abc 3561*22dc650dSSadaf Ebrahimi data> 3562*22dc650dSSadaf Ebrahimi re> /(*COMMIT)abc/no_start_optimize 3563*22dc650dSSadaf Ebrahimi data> xyzabc 3564*22dc650dSSadaf Ebrahimi No match 3565*22dc650dSSadaf Ebrahimi</pre> 3566*22dc650dSSadaf EbrahimiFor the first pattern, PCRE2 knows that any match must start with "a", so the 3567*22dc650dSSadaf Ebrahimioptimization skips along the subject to "a" before applying the pattern to the 3568*22dc650dSSadaf Ebrahimifirst set of data. The match attempt then succeeds. The second pattern disables 3569*22dc650dSSadaf Ebrahimithe optimization that skips along to the first character. The pattern is now 3570*22dc650dSSadaf Ebrahimiapplied starting at "x", and so the (*COMMIT) causes the match to fail without 3571*22dc650dSSadaf Ebrahimitrying any other starting points. 3572*22dc650dSSadaf Ebrahimi<pre> 3573*22dc650dSSadaf Ebrahimi (*PRUNE) or (*PRUNE:NAME) 3574*22dc650dSSadaf Ebrahimi</pre> 3575*22dc650dSSadaf EbrahimiThis verb causes the match to fail at the current starting position in the 3576*22dc650dSSadaf Ebrahimisubject if there is a later matching failure that causes backtracking to reach 3577*22dc650dSSadaf Ebrahimiit. If the pattern is unanchored, the normal "bumpalong" advance to the next 3578*22dc650dSSadaf Ebrahimistarting character then happens. Backtracking can occur as usual to the left of 3579*22dc650dSSadaf Ebrahimi(*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but 3580*22dc650dSSadaf Ebrahimiif there is no match to the right, backtracking cannot cross (*PRUNE). In 3581*22dc650dSSadaf Ebrahimisimple cases, the use of (*PRUNE) is just an alternative to an atomic group or 3582*22dc650dSSadaf Ebrahimipossessive quantifier, but there are some uses of (*PRUNE) that cannot be 3583*22dc650dSSadaf Ebrahimiexpressed in any other way. In an anchored pattern (*PRUNE) has the same effect 3584*22dc650dSSadaf Ebrahimias (*COMMIT). 3585*22dc650dSSadaf Ebrahimi</P> 3586*22dc650dSSadaf Ebrahimi<P> 3587*22dc650dSSadaf EbrahimiThe behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is 3588*22dc650dSSadaf Ebrahimilike (*MARK:NAME) in that the name is remembered for passing back to the 3589*22dc650dSSadaf Ebrahimicaller. However, (*SKIP:NAME) searches only for names set with (*MARK), 3590*22dc650dSSadaf Ebrahimiignoring those set by other backtracking verbs. 3591*22dc650dSSadaf Ebrahimi<pre> 3592*22dc650dSSadaf Ebrahimi (*SKIP) 3593*22dc650dSSadaf Ebrahimi</pre> 3594*22dc650dSSadaf EbrahimiThis verb, when given without a name, is like (*PRUNE), except that if the 3595*22dc650dSSadaf Ebrahimipattern is unanchored, the "bumpalong" advance is not to the next character, 3596*22dc650dSSadaf Ebrahimibut to the position in the subject where (*SKIP) was encountered. (*SKIP) 3597*22dc650dSSadaf Ebrahimisignifies that whatever text was matched leading up to it cannot be part of a 3598*22dc650dSSadaf Ebrahimisuccessful match if there is a later mismatch. Consider: 3599*22dc650dSSadaf Ebrahimi<pre> 3600*22dc650dSSadaf Ebrahimi a+(*SKIP)b 3601*22dc650dSSadaf Ebrahimi</pre> 3602*22dc650dSSadaf EbrahimiIf the subject is "aaaac...", after the first match attempt fails (starting at 3603*22dc650dSSadaf Ebrahimithe first character in the string), the starting point skips on to start the 3604*22dc650dSSadaf Ebrahiminext attempt at "c". Note that a possessive quantifier does not have the same 3605*22dc650dSSadaf Ebrahimieffect as this example; although it would suppress backtracking during the 3606*22dc650dSSadaf Ebrahimifirst match attempt, the second attempt would start at the second character 3607*22dc650dSSadaf Ebrahimiinstead of skipping on to "c". 3608*22dc650dSSadaf Ebrahimi</P> 3609*22dc650dSSadaf Ebrahimi<P> 3610*22dc650dSSadaf EbrahimiIf (*SKIP) is used to specify a new starting position that is the same as the 3611*22dc650dSSadaf Ebrahimistarting position of the current match, or (by being inside a lookbehind) 3612*22dc650dSSadaf Ebrahimiearlier, the position specified by (*SKIP) is ignored, and instead the normal 3613*22dc650dSSadaf Ebrahimi"bumpalong" occurs. 3614*22dc650dSSadaf Ebrahimi<pre> 3615*22dc650dSSadaf Ebrahimi (*SKIP:NAME) 3616*22dc650dSSadaf Ebrahimi</pre> 3617*22dc650dSSadaf EbrahimiWhen (*SKIP) has an associated name, its behaviour is modified. When such a 3618*22dc650dSSadaf Ebrahimi(*SKIP) is triggered, the previous path through the pattern is searched for the 3619*22dc650dSSadaf Ebrahimimost recent (*MARK) that has the same name. If one is found, the "bumpalong" 3620*22dc650dSSadaf Ebrahimiadvance is to the subject position that corresponds to that (*MARK) instead of 3621*22dc650dSSadaf Ebrahimito where (*SKIP) was encountered. If no (*MARK) with a matching name is found, 3622*22dc650dSSadaf Ebrahimithe (*SKIP) is ignored. 3623*22dc650dSSadaf Ebrahimi</P> 3624*22dc650dSSadaf Ebrahimi<P> 3625*22dc650dSSadaf EbrahimiThe search for a (*MARK) name uses the normal backtracking mechanism, which 3626*22dc650dSSadaf Ebrahimimeans that it does not see (*MARK) settings that are inside atomic groups or 3627*22dc650dSSadaf Ebrahimiassertions, because they are never re-entered by backtracking. Compare the 3628*22dc650dSSadaf Ebrahimifollowing <b>pcre2test</b> examples: 3629*22dc650dSSadaf Ebrahimi<pre> 3630*22dc650dSSadaf Ebrahimi re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/ 3631*22dc650dSSadaf Ebrahimi data: abc 3632*22dc650dSSadaf Ebrahimi 0: a 3633*22dc650dSSadaf Ebrahimi 1: a 3634*22dc650dSSadaf Ebrahimi data: 3635*22dc650dSSadaf Ebrahimi re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/ 3636*22dc650dSSadaf Ebrahimi data: abc 3637*22dc650dSSadaf Ebrahimi 0: b 3638*22dc650dSSadaf Ebrahimi 1: b 3639*22dc650dSSadaf Ebrahimi</pre> 3640*22dc650dSSadaf EbrahimiIn the first example, the (*MARK) setting is in an atomic group, so it is not 3641*22dc650dSSadaf Ebrahimiseen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. This allows 3642*22dc650dSSadaf Ebrahimithe second branch of the pattern to be tried at the first character position. 3643*22dc650dSSadaf EbrahimiIn the second example, the (*MARK) setting is not in an atomic group. This 3644*22dc650dSSadaf Ebrahimiallows (*SKIP:X) to find the (*MARK) when it backtracks, and this causes a new 3645*22dc650dSSadaf Ebrahimimatching attempt to start at the second character. This time, the (*MARK) is 3646*22dc650dSSadaf Ebrahiminever seen because "a" does not match "b", so the matcher immediately jumps to 3647*22dc650dSSadaf Ebrahimithe second branch of the pattern. 3648*22dc650dSSadaf Ebrahimi</P> 3649*22dc650dSSadaf Ebrahimi<P> 3650*22dc650dSSadaf EbrahimiNote that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores 3651*22dc650dSSadaf Ebrahiminames that are set by other backtracking verbs. 3652*22dc650dSSadaf Ebrahimi<pre> 3653*22dc650dSSadaf Ebrahimi (*THEN) or (*THEN:NAME) 3654*22dc650dSSadaf Ebrahimi</pre> 3655*22dc650dSSadaf EbrahimiThis verb causes a skip to the next innermost alternative when backtracking 3656*22dc650dSSadaf Ebrahimireaches it. That is, it cancels any further backtracking within the current 3657*22dc650dSSadaf Ebrahimialternative. Its name comes from the observation that it can be used for a 3658*22dc650dSSadaf Ebrahimipattern-based if-then-else block: 3659*22dc650dSSadaf Ebrahimi<pre> 3660*22dc650dSSadaf Ebrahimi ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... 3661*22dc650dSSadaf Ebrahimi</pre> 3662*22dc650dSSadaf EbrahimiIf the COND1 pattern matches, FOO is tried (and possibly further items after 3663*22dc650dSSadaf Ebrahimithe end of the group if FOO succeeds); on failure, the matcher skips to the 3664*22dc650dSSadaf Ebrahimisecond alternative and tries COND2, without backtracking into COND1. If that 3665*22dc650dSSadaf Ebrahimisucceeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no 3666*22dc650dSSadaf Ebrahimimore alternatives, so there is a backtrack to whatever came before the entire 3667*22dc650dSSadaf Ebrahimigroup. If (*THEN) is not inside an alternation, it acts like (*PRUNE). 3668*22dc650dSSadaf Ebrahimi</P> 3669*22dc650dSSadaf Ebrahimi<P> 3670*22dc650dSSadaf EbrahimiThe behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). It is 3671*22dc650dSSadaf Ebrahimilike (*MARK:NAME) in that the name is remembered for passing back to the 3672*22dc650dSSadaf Ebrahimicaller. However, (*SKIP:NAME) searches only for names set with (*MARK), 3673*22dc650dSSadaf Ebrahimiignoring those set by other backtracking verbs. 3674*22dc650dSSadaf Ebrahimi</P> 3675*22dc650dSSadaf Ebrahimi<P> 3676*22dc650dSSadaf EbrahimiA group that does not contain a | character is just a part of the enclosing 3677*22dc650dSSadaf Ebrahimialternative; it is not a nested alternation with only one alternative. The 3678*22dc650dSSadaf Ebrahimieffect of (*THEN) extends beyond such a group to the enclosing alternative. 3679*22dc650dSSadaf EbrahimiConsider this pattern, where A, B, etc. are complex pattern fragments that do 3680*22dc650dSSadaf Ebrahiminot contain any | characters at this level: 3681*22dc650dSSadaf Ebrahimi<pre> 3682*22dc650dSSadaf Ebrahimi A (B(*THEN)C) | D 3683*22dc650dSSadaf Ebrahimi</pre> 3684*22dc650dSSadaf EbrahimiIf A and B are matched, but there is a failure in C, matching does not 3685*22dc650dSSadaf Ebrahimibacktrack into A; instead it moves to the next alternative, that is, D. 3686*22dc650dSSadaf EbrahimiHowever, if the group containing (*THEN) is given an alternative, it 3687*22dc650dSSadaf Ebrahimibehaves differently: 3688*22dc650dSSadaf Ebrahimi<pre> 3689*22dc650dSSadaf Ebrahimi A (B(*THEN)C | (*FAIL)) | D 3690*22dc650dSSadaf Ebrahimi</pre> 3691*22dc650dSSadaf EbrahimiThe effect of (*THEN) is now confined to the inner group. After a failure in C, 3692*22dc650dSSadaf Ebrahimimatching moves to (*FAIL), which causes the whole group to fail because there 3693*22dc650dSSadaf Ebrahimiare no more alternatives to try. In this case, matching does backtrack into A. 3694*22dc650dSSadaf Ebrahimi</P> 3695*22dc650dSSadaf Ebrahimi<P> 3696*22dc650dSSadaf EbrahimiNote that a conditional group is not considered as having two alternatives, 3697*22dc650dSSadaf Ebrahimibecause only one is ever used. In other words, the | character in a conditional 3698*22dc650dSSadaf Ebrahimigroup has a different meaning. Ignoring white space, consider: 3699*22dc650dSSadaf Ebrahimi<pre> 3700*22dc650dSSadaf Ebrahimi ^.*? (?(?=a) a | b(*THEN)c ) 3701*22dc650dSSadaf Ebrahimi</pre> 3702*22dc650dSSadaf EbrahimiIf the subject is "ba", this pattern does not match. Because .*? is ungreedy, 3703*22dc650dSSadaf Ebrahimiit initially matches zero characters. The condition (?=a) then fails, the 3704*22dc650dSSadaf Ebrahimicharacter "b" is matched, but "c" is not. At this point, matching does not 3705*22dc650dSSadaf Ebrahimibacktrack to .*? as might perhaps be expected from the presence of the | 3706*22dc650dSSadaf Ebrahimicharacter. The conditional group is part of the single alternative that 3707*22dc650dSSadaf Ebrahimicomprises the whole pattern, and so the match fails. (If there was a backtrack 3708*22dc650dSSadaf Ebrahimiinto .*?, allowing it to match "b", the match would succeed.) 3709*22dc650dSSadaf Ebrahimi</P> 3710*22dc650dSSadaf Ebrahimi<P> 3711*22dc650dSSadaf EbrahimiThe verbs just described provide four different "strengths" of control when 3712*22dc650dSSadaf Ebrahimisubsequent matching fails. (*THEN) is the weakest, carrying on the match at the 3713*22dc650dSSadaf Ebrahiminext alternative. (*PRUNE) comes next, failing the match at the current 3714*22dc650dSSadaf Ebrahimistarting position, but allowing an advance to the next character (for an 3715*22dc650dSSadaf Ebrahimiunanchored pattern). (*SKIP) is similar, except that the advance may be more 3716*22dc650dSSadaf Ebrahimithan one character. (*COMMIT) is the strongest, causing the entire match to 3717*22dc650dSSadaf Ebrahimifail. 3718*22dc650dSSadaf Ebrahimi</P> 3719*22dc650dSSadaf Ebrahimi<br><b> 3720*22dc650dSSadaf EbrahimiMore than one backtracking verb 3721*22dc650dSSadaf Ebrahimi</b><br> 3722*22dc650dSSadaf Ebrahimi<P> 3723*22dc650dSSadaf EbrahimiIf more than one backtracking verb is present in a pattern, the one that is 3724*22dc650dSSadaf Ebrahimibacktracked onto first acts. For example, consider this pattern, where A, B, 3725*22dc650dSSadaf Ebrahimietc. are complex pattern fragments: 3726*22dc650dSSadaf Ebrahimi<pre> 3727*22dc650dSSadaf Ebrahimi (A(*COMMIT)B(*THEN)C|ABD) 3728*22dc650dSSadaf Ebrahimi</pre> 3729*22dc650dSSadaf EbrahimiIf A matches but B fails, the backtrack to (*COMMIT) causes the entire match to 3730*22dc650dSSadaf Ebrahimifail. However, if A and B match, but C fails, the backtrack to (*THEN) causes 3731*22dc650dSSadaf Ebrahimithe next alternative (ABD) to be tried. This behaviour is consistent, but is 3732*22dc650dSSadaf Ebrahiminot always the same as Perl's. It means that if two or more backtracking verbs 3733*22dc650dSSadaf Ebrahimiappear in succession, all but the last of them has no effect. Consider this 3734*22dc650dSSadaf Ebrahimiexample: 3735*22dc650dSSadaf Ebrahimi<pre> 3736*22dc650dSSadaf Ebrahimi ...(*COMMIT)(*PRUNE)... 3737*22dc650dSSadaf Ebrahimi</pre> 3738*22dc650dSSadaf EbrahimiIf there is a matching failure to the right, backtracking onto (*PRUNE) causes 3739*22dc650dSSadaf Ebrahimiit to be triggered, and its action is taken. There can never be a backtrack 3740*22dc650dSSadaf Ebrahimionto (*COMMIT). 3741*22dc650dSSadaf Ebrahimi<a name="btrepeat"></a></P> 3742*22dc650dSSadaf Ebrahimi<br><b> 3743*22dc650dSSadaf EbrahimiBacktracking verbs in repeated groups 3744*22dc650dSSadaf Ebrahimi</b><br> 3745*22dc650dSSadaf Ebrahimi<P> 3746*22dc650dSSadaf EbrahimiPCRE2 sometimes differs from Perl in its handling of backtracking verbs in 3747*22dc650dSSadaf Ebrahimirepeated groups. For example, consider: 3748*22dc650dSSadaf Ebrahimi<pre> 3749*22dc650dSSadaf Ebrahimi /(a(*COMMIT)b)+ac/ 3750*22dc650dSSadaf Ebrahimi</pre> 3751*22dc650dSSadaf EbrahimiIf the subject is "abac", Perl matches unless its optimizations are disabled, 3752*22dc650dSSadaf Ebrahimibut PCRE2 always fails because the (*COMMIT) in the second repeat of the group 3753*22dc650dSSadaf Ebrahimiacts. 3754*22dc650dSSadaf Ebrahimi<a name="btassert"></a></P> 3755*22dc650dSSadaf Ebrahimi<br><b> 3756*22dc650dSSadaf EbrahimiBacktracking verbs in assertions 3757*22dc650dSSadaf Ebrahimi</b><br> 3758*22dc650dSSadaf Ebrahimi<P> 3759*22dc650dSSadaf Ebrahimi(*FAIL) in any assertion has its normal effect: it forces an immediate 3760*22dc650dSSadaf Ebrahimibacktrack. The behaviour of the other backtracking verbs depends on whether or 3761*22dc650dSSadaf Ebrahiminot the assertion is standalone or acting as the condition in a conditional 3762*22dc650dSSadaf Ebrahimigroup. 3763*22dc650dSSadaf Ebrahimi</P> 3764*22dc650dSSadaf Ebrahimi<P> 3765*22dc650dSSadaf Ebrahimi(*ACCEPT) in a standalone positive assertion causes the assertion to succeed 3766*22dc650dSSadaf Ebrahimiwithout any further processing; captured strings and a mark name (if set) are 3767*22dc650dSSadaf Ebrahimiretained. In a standalone negative assertion, (*ACCEPT) causes the assertion to 3768*22dc650dSSadaf Ebrahimifail without any further processing; captured substrings and any mark name are 3769*22dc650dSSadaf Ebrahimidiscarded. 3770*22dc650dSSadaf Ebrahimi</P> 3771*22dc650dSSadaf Ebrahimi<P> 3772*22dc650dSSadaf EbrahimiIf the assertion is a condition, (*ACCEPT) causes the condition to be true for 3773*22dc650dSSadaf Ebrahimia positive assertion and false for a negative one; captured substrings are 3774*22dc650dSSadaf Ebrahimiretained in both cases. 3775*22dc650dSSadaf Ebrahimi</P> 3776*22dc650dSSadaf Ebrahimi<P> 3777*22dc650dSSadaf EbrahimiThe remaining verbs act only when a later failure causes a backtrack to 3778*22dc650dSSadaf Ebrahimireach them. This means that, for the Perl-compatible assertions, their effect 3779*22dc650dSSadaf Ebrahimiis confined to the assertion, because Perl lookaround assertions are atomic. A 3780*22dc650dSSadaf Ebrahimibacktrack that occurs after such an assertion is complete does not jump back 3781*22dc650dSSadaf Ebrahimiinto the assertion. Note in particular that a (*MARK) name that is set in an 3782*22dc650dSSadaf Ebrahimiassertion is not "seen" by an instance of (*SKIP:NAME) later in the pattern. 3783*22dc650dSSadaf Ebrahimi</P> 3784*22dc650dSSadaf Ebrahimi<P> 3785*22dc650dSSadaf EbrahimiPCRE2 now supports non-atomic positive assertions, as described in the section 3786*22dc650dSSadaf Ebrahimientitled 3787*22dc650dSSadaf Ebrahimi<a href="#nonatomicassertions">"Non-atomic assertions"</a> 3788*22dc650dSSadaf Ebrahimiabove. These assertions must be standalone (not used as conditions). They are 3789*22dc650dSSadaf Ebrahiminot Perl-compatible. For these assertions, a later backtrack does jump back 3790*22dc650dSSadaf Ebrahimiinto the assertion, and therefore verbs such as (*COMMIT) can be triggered by 3791*22dc650dSSadaf Ebrahimibacktracks from later in the pattern. 3792*22dc650dSSadaf Ebrahimi</P> 3793*22dc650dSSadaf Ebrahimi<P> 3794*22dc650dSSadaf EbrahimiThe effect of (*THEN) is not allowed to escape beyond an assertion. If there 3795*22dc650dSSadaf Ebrahimiare no more branches to try, (*THEN) causes a positive assertion to be false, 3796*22dc650dSSadaf Ebrahimiand a negative assertion to be true. 3797*22dc650dSSadaf Ebrahimi</P> 3798*22dc650dSSadaf Ebrahimi<P> 3799*22dc650dSSadaf EbrahimiThe other backtracking verbs are not treated specially if they appear in a 3800*22dc650dSSadaf Ebrahimistandalone positive assertion. In a conditional positive assertion, 3801*22dc650dSSadaf Ebrahimibacktracking (from within the assertion) into (*COMMIT), (*SKIP), or (*PRUNE) 3802*22dc650dSSadaf Ebrahimicauses the condition to be false. However, for both standalone and conditional 3803*22dc650dSSadaf Ebrahiminegative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes 3804*22dc650dSSadaf Ebrahimithe assertion to be true, without considering any further alternative branches. 3805*22dc650dSSadaf Ebrahimi<a name="btsub"></a></P> 3806*22dc650dSSadaf Ebrahimi<br><b> 3807*22dc650dSSadaf EbrahimiBacktracking verbs in subroutines 3808*22dc650dSSadaf Ebrahimi</b><br> 3809*22dc650dSSadaf Ebrahimi<P> 3810*22dc650dSSadaf EbrahimiThese behaviours occur whether or not the group is called recursively. 3811*22dc650dSSadaf Ebrahimi</P> 3812*22dc650dSSadaf Ebrahimi<P> 3813*22dc650dSSadaf Ebrahimi(*ACCEPT) in a group called as a subroutine causes the subroutine match to 3814*22dc650dSSadaf Ebrahimisucceed without any further processing. Matching then continues after the 3815*22dc650dSSadaf Ebrahimisubroutine call. Perl documents this behaviour. Perl's treatment of the other 3816*22dc650dSSadaf Ebrahimiverbs in subroutines is different in some cases. 3817*22dc650dSSadaf Ebrahimi</P> 3818*22dc650dSSadaf Ebrahimi<P> 3819*22dc650dSSadaf Ebrahimi(*FAIL) in a group called as a subroutine has its normal effect: it forces 3820*22dc650dSSadaf Ebrahimian immediate backtrack. 3821*22dc650dSSadaf Ebrahimi</P> 3822*22dc650dSSadaf Ebrahimi<P> 3823*22dc650dSSadaf Ebrahimi(*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when 3824*22dc650dSSadaf Ebrahimitriggered by being backtracked to in a group called as a subroutine. There is 3825*22dc650dSSadaf Ebrahimithen a backtrack at the outer level. 3826*22dc650dSSadaf Ebrahimi</P> 3827*22dc650dSSadaf Ebrahimi<P> 3828*22dc650dSSadaf Ebrahimi(*THEN), when triggered, skips to the next alternative in the innermost 3829*22dc650dSSadaf Ebrahimienclosing group that has alternatives (its normal behaviour). However, if there 3830*22dc650dSSadaf Ebrahimiis no such group within the subroutine's group, the subroutine match fails and 3831*22dc650dSSadaf Ebrahimithere is a backtrack at the outer level. 3832*22dc650dSSadaf Ebrahimi</P> 3833*22dc650dSSadaf Ebrahimi<br><a name="SEC30" href="#TOC1">SEE ALSO</a><br> 3834*22dc650dSSadaf Ebrahimi<P> 3835*22dc650dSSadaf Ebrahimi<b>pcre2api</b>(3), <b>pcre2callout</b>(3), <b>pcre2matching</b>(3), 3836*22dc650dSSadaf Ebrahimi<b>pcre2syntax</b>(3), <b>pcre2</b>(3). 3837*22dc650dSSadaf Ebrahimi</P> 3838*22dc650dSSadaf Ebrahimi<br><a name="SEC31" href="#TOC1">AUTHOR</a><br> 3839*22dc650dSSadaf Ebrahimi<P> 3840*22dc650dSSadaf EbrahimiPhilip Hazel 3841*22dc650dSSadaf Ebrahimi<br> 3842*22dc650dSSadaf EbrahimiRetired from University Computing Service 3843*22dc650dSSadaf Ebrahimi<br> 3844*22dc650dSSadaf EbrahimiCambridge, England. 3845*22dc650dSSadaf Ebrahimi<br> 3846*22dc650dSSadaf Ebrahimi</P> 3847*22dc650dSSadaf Ebrahimi<br><a name="SEC32" href="#TOC1">REVISION</a><br> 3848*22dc650dSSadaf Ebrahimi<P> 3849*22dc650dSSadaf EbrahimiLast updated: 04 June 2024 3850*22dc650dSSadaf Ebrahimi<br> 3851*22dc650dSSadaf EbrahimiCopyright © 1997-2024 University of Cambridge. 3852*22dc650dSSadaf Ebrahimi<br> 3853*22dc650dSSadaf Ebrahimi<p> 3854*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>. 3855*22dc650dSSadaf Ebrahimi</p> 3856