1 2PCRE2TEST(1) General Commands Manual PCRE2TEST(1) 3 4 5NAME 6 pcre2test - a program for testing Perl-compatible regular expressions. 7 8 9SYNOPSIS 10 11 pcre2test [options] [input file [output file]] 12 13 pcre2test is a test program for the PCRE2 regular expression libraries, 14 but it can also be used for experimenting with regular expressions. 15 This document describes the features of the test program; for details 16 of the regular expressions themselves, see the pcre2pattern documenta- 17 tion. For details of the PCRE2 library function calls and their op- 18 tions, see the pcre2api documentation. 19 20 The input for pcre2test is a sequence of regular expression patterns 21 and subject strings to be matched. There are also command lines for 22 setting defaults and controlling some special actions. The output shows 23 the result of each match attempt. Modifiers on external or internal 24 command lines, the patterns, and the subject lines specify PCRE2 func- 25 tion options, control how the subject is processed, and what output is 26 produced. 27 28 There are many obscure modifiers, some of which are specifically de- 29 signed for use in conjunction with the test script and data files that 30 are distributed as part of PCRE2. All the modifiers are documented 31 here, some without much justification, but many of them are unlikely to 32 be of use except when testing the libraries. 33 34 35PCRE2's 8-BIT, 16-BIT AND 32-BIT LIBRARIES 36 37 Different versions of the PCRE2 library can be built to support charac- 38 ter strings that are encoded in 8-bit, 16-bit, or 32-bit code units. 39 One, two, or all three of these libraries may be simultaneously in- 40 stalled. The pcre2test program can be used to test all the libraries. 41 However, its own input and output are always in 8-bit format. When 42 testing the 16-bit or 32-bit libraries, patterns and subject strings 43 are converted to 16-bit or 32-bit format before being passed to the li- 44 brary functions. Results are converted back to 8-bit code units for 45 output. 46 47 In the rest of this document, the names of library functions and struc- 48 tures are given in generic form, for example, pcre2_compile(). The ac- 49 tual names used in the libraries have a suffix _8, _16, or _32, as ap- 50 propriate. 51 52 53INPUT ENCODING 54 55 Input to pcre2test is processed line by line, either by calling the C 56 library's fgets() function, or via the libreadline or libedit library. 57 In some Windows environments character 26 (hex 1A) causes an immediate 58 end of file, and no further data is read, so this character should be 59 avoided unless you really want that action. 60 61 The input is processed using C's string functions, so must not contain 62 binary zeros, even though in Unix-like environments, fgets() treats any 63 bytes other than newline as data characters. An error is generated if a 64 binary zero is encountered. By default subject lines are processed for 65 backslash escapes, which makes it possible to include any data value in 66 strings that are passed to the library for matching. For patterns, 67 there is a facility for specifying some or all of the 8-bit input char- 68 acters as hexadecimal pairs, which makes it possible to include binary 69 zeros. 70 71 Input for the 16-bit and 32-bit libraries 72 73 When testing the 16-bit or 32-bit libraries, there is a need to be able 74 to generate character code points greater than 255 in the strings that 75 are passed to the library. For subject lines, backslash escapes can be 76 used. In addition, when the utf modifier (see "Setting compilation op- 77 tions" below) is set, the pattern and any following subject lines are 78 interpreted as UTF-8 strings and translated to UTF-16 or UTF-32 as ap- 79 propriate. 80 81 For non-UTF testing of wide characters, the utf8_input modifier can be 82 used. This is mutually exclusive with utf, and is allowed only in 83 16-bit or 32-bit mode. It causes the pattern and following subject 84 lines to be treated as UTF-8 according to the original definition (RFC 85 2279), which allows for character values up to 0x7fffffff. Each charac- 86 ter is placed in one 16-bit or 32-bit code unit (in the 16-bit case, 87 values greater than 0xffff cause an error to occur). 88 89 UTF-8 (in its original definition) is not capable of encoding values 90 greater than 0x7fffffff, but such values can be handled by the 32-bit 91 library. When testing this library in non-UTF mode with utf8_input set, 92 if any character is preceded by the byte 0xff (which is an invalid byte 93 in UTF-8) 0x80000000 is added to the character's value. This is the 94 only way of passing such code points in a pattern string. For subject 95 strings, using an escape sequence is preferable. 96 97 98COMMAND LINE OPTIONS 99 100 -8 If the 8-bit library has been built, this option causes it to 101 be used (this is the default). If the 8-bit library has not 102 been built, this option causes an error. 103 104 -16 If the 16-bit library has been built, this option causes it 105 to be used. If the 8-bit library has not been built, this is 106 the default. If the 16-bit library has not been built, this 107 option causes an error. 108 109 -32 If the 32-bit library has been built, this option causes it 110 to be used. If no other library has been built, this is the 111 default. If the 32-bit library has not been built, this op- 112 tion causes an error. 113 114 -ac Behave as if each pattern has the auto_callout modifier, that 115 is, insert automatic callouts into every pattern that is com- 116 piled. 117 118 -AC As for -ac, but in addition behave as if each subject line 119 has the callout_extra modifier, that is, show additional in- 120 formation from callouts. 121 122 -b Behave as if each pattern has the fullbincode modifier; the 123 full internal binary form of the pattern is output after com- 124 pilation. 125 126 -C Output the version number of the PCRE2 library, and all 127 available information about the optional features that are 128 included, and then exit with zero exit code. All other op- 129 tions are ignored. If both -C and -LM are present, whichever 130 is first is recognized. 131 132 -C option Output information about a specific build-time option, then 133 exit. This functionality is intended for use in scripts such 134 as RunTest. The following options output the value and set 135 the exit code as indicated: 136 137 ebcdic-nl the code for LF (= NL) in an EBCDIC environment: 138 0x15 or 0x25 139 0 if used in an ASCII environment 140 exit code is always 0 141 linksize the configured internal link size (2, 3, or 4) 142 exit code is set to the link size 143 newline the default newline setting: 144 CR, LF, CRLF, ANYCRLF, ANY, or NUL 145 exit code is always 0 146 bsr the default setting for what \R matches: 147 ANYCRLF or ANY 148 exit code is always 0 149 150 The following options output 1 for true or 0 for false, and 151 set the exit code to the same value: 152 153 backslash-C \C is supported (not locked out) 154 ebcdic compiled for an EBCDIC environment 155 jit just-in-time support is available 156 pcre2-16 the 16-bit library was built 157 pcre2-32 the 32-bit library was built 158 pcre2-8 the 8-bit library was built 159 unicode Unicode support is available 160 161 If an unknown option is given, an error message is output; 162 the exit code is 0. 163 164 -d Behave as if each pattern has the debug modifier; the inter- 165 nal form and information about the compiled pattern is output 166 after compilation; -d is equivalent to -b -i. 167 168 -dfa Behave as if each subject line has the dfa modifier; matching 169 is done using the pcre2_dfa_match() function instead of the 170 default pcre2_match(). 171 172 -error number[,number,...] 173 Call pcre2_get_error_message() for each of the error numbers 174 in the comma-separated list, display the resulting messages 175 on the standard output, then exit with zero exit code. The 176 numbers may be positive or negative. This is a convenience 177 facility for PCRE2 maintainers. 178 179 -help Output a brief summary these options and then exit. 180 181 -i Behave as if each pattern has the info modifier; information 182 about the compiled pattern is given after compilation. 183 184 -jit Behave as if each pattern line has the jit modifier; after 185 successful compilation, each pattern is passed to the just- 186 in-time compiler, if available. 187 188 -jitfast Behave as if each pattern line has the jitfast modifier; af- 189 ter successful compilation, each pattern is passed to the 190 just-in-time compiler, if available, and each subject line is 191 passed directly to the JIT matcher via its "fast path". 192 193 -jitverify 194 Behave as if each pattern line has the jitverify modifier; 195 after successful compilation, each pattern is passed to the 196 just-in-time compiler, if available, and the use of JIT for 197 matching is verified. 198 199 -LM List modifiers: write a list of available pattern and subject 200 modifiers to the standard output, then exit with zero exit 201 code. All other options are ignored. If both -C and any -Lx 202 options are present, whichever is first is recognized. 203 204 -LP List properties: write a list of recognized Unicode proper- 205 ties to the standard output, then exit with zero exit code. 206 All other options are ignored. If both -C and any -Lx options 207 are present, whichever is first is recognized. 208 209 -LS List scripts: write a list of recognized Unicode script names 210 to the standard output, then exit with zero exit code. All 211 other options are ignored. If both -C and any -Lx options are 212 present, whichever is first is recognized. 213 214 -pattern modifier-list 215 Behave as if each pattern line contains the given modifiers. 216 217 -q Do not output the version number of pcre2test at the start of 218 execution. 219 220 -S size On Unix-like systems, set the size of the run-time stack to 221 size mebibytes (units of 1024*1024 bytes). 222 223 -subject modifier-list 224 Behave as if each subject line contains the given modifiers. 225 226 -t Run each compile and match many times with a timer, and out- 227 put the resulting times per compile or match. When JIT is 228 used, separate times are given for the initial compile and 229 the JIT compile. You can control the number of iterations 230 that are used for timing by following -t with a number (as a 231 separate item on the command line). For example, "-t 1000" 232 iterates 1000 times. The default is to iterate 500,000 times. 233 234 -tm This is like -t except that it times only the matching phase, 235 not the compile phase. 236 237 -T -TM These behave like -t and -tm, but in addition, at the end of 238 a run, the total times for all compiles and matches are out- 239 put. 240 241 -version Output the PCRE2 version number and then exit. 242 243 244DESCRIPTION 245 246 If pcre2test is given two filename arguments, it reads from the first 247 and writes to the second. If the first name is "-", input is taken from 248 the standard input. If pcre2test is given only one argument, it reads 249 from that file and writes to stdout. Otherwise, it reads from stdin and 250 writes to stdout. 251 252 When pcre2test is built, a configuration option can specify that it 253 should be linked with the libreadline or libedit library. When this is 254 done, if the input is from a terminal, it is read using the readline() 255 function. This provides line-editing and history facilities. The output 256 from the -help option states whether or not readline() will be used. 257 258 The program handles any number of tests, each of which consists of a 259 set of input lines. Each set starts with a regular expression pattern, 260 followed by any number of subject lines to be matched against that pat- 261 tern. In between sets of test data, command lines that begin with # may 262 appear. This file format, with some restrictions, can also be processed 263 by the perltest.sh script that is distributed with PCRE2 as a means of 264 checking that the behaviour of PCRE2 and Perl is the same. For a speci- 265 fication of perltest.sh, see the comments near its beginning. See also 266 the #perltest command below. 267 268 When the input is a terminal, pcre2test prompts for each line of input, 269 using "re>" to prompt for regular expression patterns, and "data>" to 270 prompt for subject lines. Command lines starting with # can be entered 271 only in response to the "re>" prompt. 272 273 Each subject line is matched separately and independently. If you want 274 to do multi-line matches, you have to use the \n escape sequence (or \r 275 or \r\n, etc., depending on the newline setting) in a single line of 276 input to encode the newline sequences. There is no limit on the length 277 of subject lines; the input buffer is automatically extended if it is 278 too small. There are replication features that makes it possible to 279 generate long repetitive pattern or subject lines without having to 280 supply them explicitly. 281 282 An empty line or the end of the file signals the end of the subject 283 lines for a test, at which point a new pattern or command line is ex- 284 pected if there is still input to be read. 285 286 287COMMAND LINES 288 289 In between sets of test data, a line that begins with # is interpreted 290 as a command line. If the first character is followed by white space or 291 an exclamation mark, the line is treated as a comment, and ignored. 292 Otherwise, the following commands are recognized: 293 294 #forbid_utf 295 296 Subsequent patterns automatically have the PCRE2_NEVER_UTF and 297 PCRE2_NEVER_UCP options set, which locks out the use of the PCRE2_UTF 298 and PCRE2_UCP options and the use of (*UTF) and (*UCP) at the start of 299 patterns. This command also forces an error if a subsequent pattern 300 contains any occurrences of \P, \p, or \X, which are still supported 301 when PCRE2_UTF is not set, but which require Unicode property support 302 to be included in the library. 303 304 This is a trigger guard that is used in test files to ensure that UTF 305 or Unicode property tests are not accidentally added to files that are 306 used when Unicode support is not included in the library. Setting 307 PCRE2_NEVER_UTF and PCRE2_NEVER_UCP as a default can also be obtained 308 by the use of #pattern; the difference is that #forbid_utf cannot be 309 unset, and the automatic options are not displayed in pattern informa- 310 tion, to avoid cluttering up test output. 311 312 #load <filename> 313 314 This command is used to load a set of precompiled patterns from a file, 315 as described in the section entitled "Saving and restoring compiled 316 patterns" below. 317 318 #loadtables <filename> 319 320 This command is used to load a set of binary character tables that can 321 be accessed by the tables=3 qualifier. Such tables can be created by 322 the pcre2_dftables program with the -b option. 323 324 #newline_default [<newline-list>] 325 326 When PCRE2 is built, a default newline convention can be specified. 327 This determines which characters and/or character pairs are recognized 328 as indicating a newline in a pattern or subject string. The default can 329 be overridden when a pattern is compiled. The standard test files con- 330 tain tests of various newline conventions, but the majority of the 331 tests expect a single linefeed to be recognized as a newline by de- 332 fault. Without special action the tests would fail when PCRE2 is com- 333 piled with either CR or CRLF as the default newline. 334 335 The #newline_default command specifies a list of newline types that are 336 acceptable as the default. The types must be one of CR, LF, CRLF, ANY- 337 CRLF, ANY, or NUL (in upper or lower case), for example: 338 339 #newline_default LF Any anyCRLF 340 341 If the default newline is in the list, this command has no effect. Oth- 342 erwise, except when testing the POSIX API, a newline modifier that 343 specifies the first newline convention in the list (LF in the above ex- 344 ample) is added to any pattern that does not already have a newline 345 modifier. If the newline list is empty, the feature is turned off. This 346 command is present in a number of the standard test input files. 347 348 When the POSIX API is being tested there is no way to override the de- 349 fault newline convention, though it is possible to set the newline con- 350 vention from within the pattern. A warning is given if the posix or 351 posix_nosub modifier is used when #newline_default would set a default 352 for the non-POSIX API. 353 354 #pattern <modifier-list> 355 356 This command sets a default modifier list that applies to all subse- 357 quent patterns. Modifiers on a pattern can change these settings. 358 359 #perltest 360 361 This line is used in test files that can also be processed by perl- 362 test.sh to confirm that Perl gives the same results as PCRE2. Subse- 363 quent tests are checked for the use of pcre2test features that are in- 364 compatible with the perltest.sh script. 365 366 Patterns must use '/' as their delimiter, and only certain modifiers 367 are supported. Comment lines, #pattern commands, and #subject commands 368 that set or unset "mark" are recognized and acted on. The #perltest, 369 #forbid_utf, and #newline_default commands, which are needed in the 370 relevant pcre2test files, are silently ignored. All other command lines 371 are ignored, but give a warning message. The #perltest command helps 372 detect tests that are accidentally put in the wrong file or use the 373 wrong delimiter. For more details of the perltest.sh script see the 374 comments it contains. 375 376 #pop [<modifiers>] 377 #popcopy [<modifiers>] 378 379 These commands are used to manipulate the stack of compiled patterns, 380 as described in the section entitled "Saving and restoring compiled 381 patterns" below. 382 383 #save <filename> 384 385 This command is used to save a set of compiled patterns to a file, as 386 described in the section entitled "Saving and restoring compiled pat- 387 terns" below. 388 389 #subject <modifier-list> 390 391 This command sets a default modifier list that applies to all subse- 392 quent subject lines. Modifiers on a subject line can change these set- 393 tings. 394 395 396MODIFIER SYNTAX 397 398 Modifier lists are used with both pattern and subject lines. Items in a 399 list are separated by commas followed by optional white space. Trailing 400 whitespace in a modifier list is ignored. Some modifiers may be given 401 for both patterns and subject lines, whereas others are valid only for 402 one or the other. Each modifier has a long name, for example "an- 403 chored", and some of them must be followed by an equals sign and a 404 value, for example, "offset=12". Values cannot contain comma charac- 405 ters, but may contain spaces. Modifiers that do not take values may be 406 preceded by a minus sign to turn off a previous setting. 407 408 A few of the more common modifiers can also be specified as single let- 409 ters, for example "i" for "caseless". In documentation, following the 410 Perl convention, these are written with a slash ("the /i modifier") for 411 clarity. Abbreviated modifiers must all be concatenated in the first 412 item of a modifier list. If the first item is not recognized as a long 413 modifier name, it is interpreted as a sequence of these abbreviations. 414 For example: 415 416 /abc/ig,newline=cr,jit=3 417 418 This is a pattern line whose modifier list starts with two one-letter 419 modifiers (/i and /g). The lower-case abbreviated modifiers are the 420 same as used in Perl. 421 422 423PATTERN SYNTAX 424 425 A pattern line must start with one of the following characters (common 426 symbols, excluding pattern meta-characters): 427 428 / ! " ' ` - = _ : ; , % & @ ~ 429 430 This is interpreted as the pattern's delimiter. A regular expression 431 may be continued over several input lines, in which case the newline 432 characters are included within it. It is possible to include the delim- 433 iter as a literal within the pattern by escaping it with a backslash, 434 for example 435 436 /abc\/def/ 437 438 If you do this, the escape and the delimiter form part of the pattern, 439 but since the delimiters are all non-alphanumeric, the inclusion of the 440 backslash does not affect the pattern's interpretation. Note, however, 441 that this trick does not work within \Q...\E literal bracketing because 442 the backslash will itself be interpreted as a literal. If the terminat- 443 ing delimiter is immediately followed by a backslash, for example, 444 445 /abc/\ 446 447 a backslash is added to the end of the pattern. This is done to provide 448 a way of testing the error condition that arises if a pattern finishes 449 with a backslash, because 450 451 /abc\/ 452 453 is interpreted as the first line of a pattern that starts with "abc/", 454 causing pcre2test to read the next line as a continuation of the regu- 455 lar expression. 456 457 A pattern can be followed by a modifier list (details below). 458 459 460SUBJECT LINE SYNTAX 461 462 Before each subject line is passed to pcre2_match(), pcre2_dfa_match(), 463 or pcre2_jit_match(), leading and trailing white space is removed, and 464 the line is scanned for backslash escapes, unless the subject_literal 465 modifier was set for the pattern. The following provide a means of en- 466 coding non-printing characters in a visible way: 467 468 \a alarm (BEL, \x07) 469 \b backspace (\x08) 470 \e escape (\x27) 471 \f form feed (\x0c) 472 \n newline (\x0a) 473 \r carriage return (\x0d) 474 \t tab (\x09) 475 \v vertical tab (\x0b) 476 \nnn octal character (up to 3 octal digits); always 477 a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode 478 \o{dd...} octal character (any number of octal digits} 479 \xhh hexadecimal byte (up to 2 hex digits) 480 \x{hh...} hexadecimal character (any number of hex digits) 481 482 The use of \x{hh...} is not dependent on the use of the utf modifier on 483 the pattern. It is recognized always. There may be any number of hexa- 484 decimal digits inside the braces; invalid values provoke error mes- 485 sages. 486 487 Note that \xhh specifies one byte rather than one character in UTF-8 488 mode; this makes it possible to construct invalid UTF-8 sequences for 489 testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8 490 character in UTF-8 mode, generating more than one byte if the value is 491 greater than 127. When testing the 8-bit library not in UTF-8 mode, 492 \x{hh} generates one byte for values less than 256, and causes an error 493 for greater values. 494 495 In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it 496 possible to construct invalid UTF-16 sequences for testing purposes. 497 498 In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This 499 makes it possible to construct invalid UTF-32 sequences for testing 500 purposes. 501 502 There is a special backslash sequence that specifies replication of one 503 or more characters: 504 505 \[<characters>]{<count>} 506 507 This makes it possible to test long strings without having to provide 508 them as part of the file. For example: 509 510 \[abc]{4} 511 512 is converted to "abcabcabcabc". This feature does not support nesting. 513 To include a closing square bracket in the characters, code it as \x5D. 514 515 A backslash followed by an equals sign marks the end of the subject 516 string and the start of a modifier list. For example: 517 518 abc\=notbol,notempty 519 520 If the subject string is empty and \= is followed by whitespace, the 521 line is treated as a comment line, and is not used for matching. For 522 example: 523 524 \= This is a comment. 525 abc\= This is an invalid modifier list. 526 527 A backslash followed by any other non-alphanumeric character just es- 528 capes that character. A backslash followed by anything else causes an 529 error. However, if the very last character in the line is a backslash 530 (and there is no modifier list), it is ignored. This gives a way of 531 passing an empty line as data, since a real empty line terminates the 532 data input. 533 534 If the subject_literal modifier is set for a pattern, all subject lines 535 that follow are treated as literals, with no special treatment of back- 536 slashes. No replication is possible, and any subject modifiers must be 537 set as defaults by a #subject command. 538 539 540PATTERN MODIFIERS 541 542 There are several types of modifier that can appear in pattern lines. 543 Except where noted below, they may also be used in #pattern commands. A 544 pattern's modifier list can add to or override default modifiers that 545 were set by a previous #pattern command. 546 547 Setting compilation options 548 549 The following modifiers set options for pcre2_compile(). Most of them 550 set bits in the options argument of that function, but those whose 551 names start with PCRE2_EXTRA are additional options that are set in the 552 compile context. Some of these options have single-letter abbrevia- 553 tions. There is special handling for /x: if a second x is present, 554 PCRE2_EXTENDED is converted into PCRE2_EXTENDED_MORE as in Perl. A 555 third appearance adds PCRE2_EXTENDED as well, though this makes no dif- 556 ference to the way pcre2_compile() behaves. See pcre2api for a descrip- 557 tion of the effects of these options. 558 559 allow_empty_class set PCRE2_ALLOW_EMPTY_CLASS 560 allow_lookaround_bsk set PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK 561 allow_surrogate_escapes set PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 562 alt_bsux set PCRE2_ALT_BSUX 563 alt_circumflex set PCRE2_ALT_CIRCUMFLEX 564 alt_verbnames set PCRE2_ALT_VERBNAMES 565 anchored set PCRE2_ANCHORED 566 /a ascii_all set all ASCII options 567 ascii_bsd set PCRE2_EXTRA_ASCII_BSD 568 ascii_bss set PCRE2_EXTRA_ASCII_BSS 569 ascii_bsw set PCRE2_EXTRA_ASCII_BSW 570 ascii_digit set PCRE2_EXTRA_ASCII_DIGIT 571 ascii_posix set PCRE2_EXTRA_ASCII_POSIX 572 auto_callout set PCRE2_AUTO_CALLOUT 573 bad_escape_is_literal set PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL 574 /i caseless set PCRE2_CASELESS 575 /r caseless_restrict set PCRE2_EXTRA_CASELESS_RESTRICT 576 dollar_endonly set PCRE2_DOLLAR_ENDONLY 577 /s dotall set PCRE2_DOTALL 578 dupnames set PCRE2_DUPNAMES 579 endanchored set PCRE2_ENDANCHORED 580 escaped_cr_is_lf set PCRE2_EXTRA_ESCAPED_CR_IS_LF 581 /x extended set PCRE2_EXTENDED 582 /xx extended_more set PCRE2_EXTENDED_MORE 583 extra_alt_bsux set PCRE2_EXTRA_ALT_BSUX 584 firstline set PCRE2_FIRSTLINE 585 literal set PCRE2_LITERAL 586 match_line set PCRE2_EXTRA_MATCH_LINE 587 match_invalid_utf set PCRE2_MATCH_INVALID_UTF 588 match_unset_backref set PCRE2_MATCH_UNSET_BACKREF 589 match_word set PCRE2_EXTRA_MATCH_WORD 590 /m multiline set PCRE2_MULTILINE 591 never_backslash_c set PCRE2_NEVER_BACKSLASH_C 592 never_ucp set PCRE2_NEVER_UCP 593 never_utf set PCRE2_NEVER_UTF 594 /n no_auto_capture set PCRE2_NO_AUTO_CAPTURE 595 no_auto_possess set PCRE2_NO_AUTO_POSSESS 596 no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR 597 no_start_optimize set PCRE2_NO_START_OPTIMIZE 598 no_utf_check set PCRE2_NO_UTF_CHECK 599 ucp set PCRE2_UCP 600 ungreedy set PCRE2_UNGREEDY 601 use_offset_limit set PCRE2_USE_OFFSET_LIMIT 602 utf set PCRE2_UTF 603 604 As well as turning on the PCRE2_UTF option, the utf modifier causes all 605 non-printing characters in output strings to be printed using the 606 \x{hh...} notation. Otherwise, those less than 0x100 are output in hex 607 without the curly brackets. Setting utf in 16-bit or 32-bit mode also 608 causes pattern and subject strings to be translated to UTF-16 or 609 UTF-32, respectively, before being passed to library functions. 610 611 Setting compilation controls 612 613 The following modifiers affect the compilation process or request in- 614 formation about the pattern. There are single-letter abbreviations for 615 some that are heavily used in the test files. 616 617 bsr=[anycrlf|unicode] specify \R handling 618 /B bincode show binary code without lengths 619 callout_info show callout information 620 convert=<options> request foreign pattern conversion 621 convert_glob_escape=c set glob escape character 622 convert_glob_separator=c set glob separator character 623 convert_length set convert buffer length 624 debug same as info,fullbincode 625 framesize show matching frame size 626 fullbincode show binary code with lengths 627 /I info show info about compiled pattern 628 hex unquoted characters are hexadecimal 629 jit[=<number>] use JIT 630 jitfast use JIT fast path 631 jitverify verify JIT use 632 locale=<name> use this locale 633 max_pattern_compiled ) set maximum compiled pattern 634 _length=<n> ) length (bytes) 635 max_pattern_length=<n> set maximum pattern length (code units) 636 max_varlookbehind=<n> set maximum variable lookbehind length 637 memory show memory used 638 newline=<type> set newline type 639 null_context compile with a NULL context 640 null_pattern pass pattern as NULL 641 parens_nest_limit=<n> set maximum parentheses depth 642 posix use the POSIX API 643 posix_nosub use the POSIX API with REG_NOSUB 644 push push compiled pattern onto the stack 645 pushcopy push a copy onto the stack 646 stackguard=<number> test the stackguard feature 647 subject_literal treat all subject lines as literal 648 tables=[0|1|2|3] select internal tables 649 use_length do not zero-terminate the pattern 650 utf8_input treat input as UTF-8 651 652 The effects of these modifiers are described in the following sections. 653 654 Newline and \R handling 655 656 The bsr modifier specifies what \R in a pattern should match. If it is 657 set to "anycrlf", \R matches CR, LF, or CRLF only. If it is set to 658 "unicode", \R matches any Unicode newline sequence. The default can be 659 specified when PCRE2 is built; if it is not, the default is set to Uni- 660 code. 661 662 The newline modifier specifies which characters are to be interpreted 663 as newlines, both in the pattern and in subject lines. The type must be 664 one of CR, LF, CRLF, ANYCRLF, ANY, or NUL (in upper or lower case). 665 666 Information about a pattern 667 668 The debug modifier is a shorthand for info,fullbincode, requesting all 669 available information. 670 671 The bincode modifier causes a representation of the compiled code to be 672 output after compilation. This information does not contain length and 673 offset values, which ensures that the same output is generated for dif- 674 ferent internal link sizes and different code unit widths. By using 675 bincode, the same regression tests can be used in different environ- 676 ments. 677 678 The fullbincode modifier, by contrast, does include length and offset 679 values. This is used in a few special tests that run only for specific 680 code unit widths and link sizes, and is also useful for one-off tests. 681 682 The info modifier requests information about the compiled pattern 683 (whether it is anchored, has a fixed first character, and so on). The 684 information is obtained from the pcre2_pattern_info() function. Here 685 are some typical examples: 686 687 re> /(?i)(^a|^b)/m,info 688 Capture group count = 1 689 Compile options: multiline 690 Overall options: caseless multiline 691 First code unit at start or follows newline 692 Subject length lower bound = 1 693 694 re> /(?i)abc/info 695 Capture group count = 0 696 Compile options: <none> 697 Overall options: caseless 698 First code unit = 'a' (caseless) 699 Last code unit = 'c' (caseless) 700 Subject length lower bound = 3 701 702 "Compile options" are those specified by modifiers; "overall options" 703 have added options that are taken or deduced from the pattern. If both 704 sets of options are the same, just a single "options" line is output; 705 if there are no options, the line is omitted. "First code unit" is 706 where any match must start; if there is more than one they are listed 707 as "starting code units". "Last code unit" is the last literal code 708 unit that must be present in any match. This is not necessarily the 709 last character. These lines are omitted if no starting or ending code 710 units are recorded. The subject length line is omitted when 711 no_start_optimize is set because the minimum length is not calculated 712 when it can never be used. 713 714 The framesize modifier shows the size, in bytes, of each storage frame 715 used by pcre2_match() for handling backtracking. The size depends on 716 the number of capturing parentheses in the pattern. A vector of these 717 frames is used at matching time; its overall size is shown when the 718 heaframes_size subject modifier is set. 719 720 The callout_info modifier requests information about all the callouts 721 in the pattern. A list of them is output at the end of any other infor- 722 mation that is requested. For each callout, either its number or string 723 is given, followed by the item that follows it in the pattern. 724 725 Passing a NULL context 726 727 Normally, pcre2test passes a context block to pcre2_compile(). If the 728 null_context modifier is set, however, NULL is passed. This is for 729 testing that pcre2_compile() behaves correctly in this case (it uses 730 default values). 731 732 Passing a NULL pattern 733 734 The null_pattern modifier is for testing the behaviour of pcre2_com- 735 pile() when the pattern argument is NULL. The length value passed is 736 the default PCRE2_ZERO_TERMINATED unless use_length is set. Any length 737 other than zero causes an error. 738 739 Specifying pattern characters in hexadecimal 740 741 The hex modifier specifies that the characters of the pattern, except 742 for substrings enclosed in single or double quotes, are to be inter- 743 preted as pairs of hexadecimal digits. This feature is provided as a 744 way of creating patterns that contain binary zeros and other non-print- 745 ing characters. White space is permitted between pairs of digits. For 746 example, this pattern contains three characters: 747 748 /ab 32 59/hex 749 750 Parts of such a pattern are taken literally if quoted. This pattern 751 contains nine characters, only two of which are specified in hexadeci- 752 mal: 753 754 /ab "literal" 32/hex 755 756 Either single or double quotes may be used. There is no way of includ- 757 ing the delimiter within a substring. The hex and expand modifiers are 758 mutually exclusive. 759 760 Specifying the pattern's length 761 762 By default, patterns are passed to the compiling functions as zero-ter- 763 minated strings but can be passed by length instead of being zero-ter- 764 minated. The use_length modifier causes this to happen. Using a length 765 happens automatically (whether or not use_length is set) when hex is 766 set, because patterns specified in hexadecimal may contain binary ze- 767 ros. 768 769 If hex or use_length is used with the POSIX wrapper API (see "Using the 770 POSIX wrapper API" below), the REG_PEND extension is used to pass the 771 pattern's length. 772 773 Specifying a maximum for variable lookbehinds 774 775 Variable lookbehind assertions are supported only if, for each one, 776 there is a maximum length (in characters) that it can match. There is a 777 limit on this, whose default can be set at build time, with an ultimate 778 default of 255. The max_varlookbehind modifier uses the 779 pcre2_set_max_varlookbehind() function to change the limit. Lookbehinds 780 whose branches each match a fixed length are limited to 65535 charac- 781 ters per branch. 782 783 Specifying wide characters in 16-bit and 32-bit modes 784 785 In 16-bit and 32-bit modes, all input is automatically treated as UTF-8 786 and translated to UTF-16 or UTF-32 when the utf modifier is set. For 787 testing the 16-bit and 32-bit libraries in non-UTF mode, the utf8_input 788 modifier can be used. It is mutually exclusive with utf. Input lines 789 are interpreted as UTF-8 as a means of specifying wide characters. More 790 details are given in "Input encoding" above. 791 792 Generating long repetitive patterns 793 794 Some tests use long patterns that are very repetitive. Instead of cre- 795 ating a very long input line for such a pattern, you can use a special 796 repetition feature, similar to the one described for subject lines 797 above. If the expand modifier is present on a pattern, parts of the 798 pattern that have the form 799 800 \[<characters>]{<count>} 801 802 are expanded before the pattern is passed to pcre2_compile(). For exam- 803 ple, \[AB]{6000} is expanded to "ABAB..." 6000 times. This construction 804 cannot be nested. An initial "\[" sequence is recognized only if "]{" 805 followed by decimal digits and "}" is found later in the pattern. If 806 not, the characters remain in the pattern unaltered. The expand and hex 807 modifiers are mutually exclusive. 808 809 If part of an expanded pattern looks like an expansion, but is really 810 part of the actual pattern, unwanted expansion can be avoided by giving 811 two values in the quantifier. For example, \[AB]{6000,6000} is not rec- 812 ognized as an expansion item. 813 814 If the info modifier is set on an expanded pattern, the result of the 815 expansion is included in the information that is output. 816 817 JIT compilation 818 819 Just-in-time (JIT) compiling is a heavyweight optimization that can 820 greatly speed up pattern matching. See the pcre2jit documentation for 821 details. JIT compiling happens, optionally, after a pattern has been 822 successfully compiled into an internal form. The JIT compiler converts 823 this to optimized machine code. It needs to know whether the match-time 824 options PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT are going to be used, 825 because different code is generated for the different cases. See the 826 partial modifier in "Subject Modifiers" below for details of how these 827 options are specified for each match attempt. 828 829 JIT compilation is requested by the jit pattern modifier, which may op- 830 tionally be followed by an equals sign and a number in the range 0 to 831 7. The three bits that make up the number specify which of the three 832 JIT operating modes are to be compiled: 833 834 1 compile JIT code for non-partial matching 835 2 compile JIT code for soft partial matching 836 4 compile JIT code for hard partial matching 837 838 The possible values for the jit modifier are therefore: 839 840 0 disable JIT 841 1 normal matching only 842 2 soft partial matching only 843 3 normal and soft partial matching 844 4 hard partial matching only 845 6 soft and hard partial matching only 846 7 all three modes 847 848 If no number is given, 7 is assumed. The phrase "partial matching" 849 means a call to pcre2_match() with either the PCRE2_PARTIAL_SOFT or the 850 PCRE2_PARTIAL_HARD option set. Note that such a call may return a com- 851 plete match; the options enable the possibility of a partial match, but 852 do not require it. Note also that if you request JIT compilation only 853 for partial matching (for example, jit=2) but do not set the partial 854 modifier on a subject line, that match will not use JIT code because 855 none was compiled for non-partial matching. 856 857 If JIT compilation is successful, the compiled JIT code will automati- 858 cally be used when an appropriate type of match is run, except when in- 859 compatible run-time options are specified. For more details, see the 860 pcre2jit documentation. See also the jitstack modifier below for a way 861 of setting the size of the JIT stack. 862 863 If the jitfast modifier is specified, matching is done using the JIT 864 "fast path" interface, pcre2_jit_match(), which skips some of the san- 865 ity checks that are done by pcre2_match(), and of course does not work 866 when JIT is not supported. If jitfast is specified without jit, jit=7 867 is assumed. 868 869 If the jitverify modifier is specified, information about the compiled 870 pattern shows whether JIT compilation was or was not successful. If 871 jitverify is specified without jit, jit=7 is assumed. If JIT compila- 872 tion is successful when jitverify is set, the text "(JIT)" is added to 873 the first output line after a match or non match when JIT-compiled code 874 was actually used in the match. 875 876 Setting a locale 877 878 The locale modifier must specify the name of a locale, for example: 879 880 /pattern/locale=fr_FR 881 882 The given locale is set, pcre2_maketables() is called to build a set of 883 character tables for the locale, and this is then passed to pcre2_com- 884 pile() when compiling the regular expression. The same tables are used 885 when matching the following subject lines. The locale modifier applies 886 only to the pattern on which it appears, but can be given in a #pattern 887 command if a default is needed. Setting a locale and alternate charac- 888 ter tables are mutually exclusive. 889 890 Showing pattern memory 891 892 The memory modifier causes the size in bytes of the memory used to hold 893 the compiled pattern to be output. This does not include the size of 894 the pcre2_code block; it is just the actual compiled data. If the pat- 895 tern is subsequently passed to the JIT compiler, the size of the JIT 896 compiled code is also output. Here is an example: 897 898 re> /a(b)c/jit,memory 899 Memory allocation (code space): 21 900 Memory allocation (JIT code): 1910 901 902 903 Limiting nested parentheses 904 905 The parens_nest_limit modifier sets a limit on the depth of nested 906 parentheses in a pattern. Breaching the limit causes a compilation er- 907 ror. The default for the library is set when PCRE2 is built, but 908 pcre2test sets its own default of 220, which is required for running 909 the standard test suite. 910 911 Limiting the pattern length 912 913 The max_pattern_length modifier sets a limit, in code units, to the 914 length of pattern that pcre2_compile() will accept. Breaching the limit 915 causes a compilation error. The default is the largest number a 916 PCRE2_SIZE variable can hold (essentially unlimited). 917 918 Limiting the size of a compiled pattern 919 920 The max_pattern_compiled_length modifier sets a limit, in bytes, to the 921 amount of memory used by a compiled pattern. Breaching the limit causes 922 a compilation error. The default is the largest number a PCRE2_SIZE 923 variable can hold (essentially unlimited). 924 925 Using the POSIX wrapper API 926 927 The posix and posix_nosub modifiers cause pcre2test to call PCRE2 via 928 the POSIX wrapper API rather than its native API. When posix_nosub is 929 used, the POSIX option REG_NOSUB is passed to regcomp(). The POSIX 930 wrapper supports only the 8-bit library. Note that it does not imply 931 POSIX matching semantics; for more detail see the pcre2posix documenta- 932 tion. The following pattern modifiers set options for the regcomp() 933 function: 934 935 caseless REG_ICASE 936 multiline REG_NEWLINE 937 dotall REG_DOTALL ) 938 ungreedy REG_UNGREEDY ) These options are not part of 939 ucp REG_UCP ) the POSIX standard 940 utf REG_UTF8 ) 941 942 The regerror_buffsize modifier specifies a size for the error buffer 943 that is passed to regerror() in the event of a compilation error. For 944 example: 945 946 /abc/posix,regerror_buffsize=20 947 948 This provides a means of testing the behaviour of regerror() when the 949 buffer is too small for the error message. If this modifier has not 950 been set, a large buffer is used. 951 952 The aftertext and allaftertext subject modifiers work as described be- 953 low. All other modifiers are either ignored, with a warning message, or 954 cause an error. 955 956 The pattern is passed to regcomp() as a zero-terminated string by de- 957 fault, but if the use_length or hex modifiers are set, the REG_PEND ex- 958 tension is used to pass it by length. 959 960 Testing the stack guard feature 961 962 The stackguard modifier is used to test the use of pcre2_set_com- 963 pile_recursion_guard(), a function that is provided to enable stack 964 availability to be checked during compilation (see the pcre2api docu- 965 mentation for details). If the number specified by the modifier is 966 greater than zero, pcre2_set_compile_recursion_guard() is called to set 967 up callback from pcre2_compile() to a local function. The argument it 968 receives is the current nesting parenthesis depth; if this is greater 969 than the value given by the modifier, non-zero is returned, causing the 970 compilation to be aborted. 971 972 Using alternative character tables 973 974 The value specified for the tables modifier must be one of the digits 975 0, 1, 2, or 3. It causes a specific set of built-in character tables to 976 be passed to pcre2_compile(). This is used in the PCRE2 tests to check 977 behaviour with different character tables. The digit specifies the ta- 978 bles as follows: 979 980 0 do not pass any special character tables 981 1 the default ASCII tables, as distributed in 982 pcre2_chartables.c.dist 983 2 a set of tables defining ISO 8859 characters 984 3 a set of tables loaded by the #loadtables command 985 986 In tables 2, some characters whose codes are greater than 128 are iden- 987 tified as letters, digits, spaces, etc. Tables 3 can be used only after 988 a #loadtables command has loaded them from a binary file. Setting al- 989 ternate character tables and a locale are mutually exclusive. 990 991 Setting certain match controls 992 993 The following modifiers are really subject modifiers, and are described 994 under "Subject Modifiers" below. However, they may be included in a 995 pattern's modifier list, in which case they are applied to every sub- 996 ject line that is processed with that pattern. These modifiers do not 997 affect the compilation process. 998 999 aftertext show text after match 1000 allaftertext show text after captures 1001 allcaptures show all captures 1002 allvector show the entire ovector 1003 allusedtext show all consulted text 1004 altglobal alternative global matching 1005 /g global global matching 1006 heapframes_size show match data heapframes size 1007 jitstack=<n> set size of JIT stack 1008 mark show mark values 1009 replace=<string> specify a replacement string 1010 startchar show starting character when relevant 1011 substitute_callout use substitution callouts 1012 substitute_extended use PCRE2_SUBSTITUTE_EXTENDED 1013 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 1014 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 1015 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1016 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1017 substitute_skip=<n> skip substitution <n> 1018 substitute_stop=<n> skip substitution <n> and following 1019 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1020 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1021 1022 These modifiers may not appear in a #pattern command. If you want them 1023 as defaults, set them in a #subject command. 1024 1025 Specifying literal subject lines 1026 1027 If the subject_literal modifier is present on a pattern, all the sub- 1028 ject lines that it matches are taken as literal strings, with no inter- 1029 pretation of backslashes. It is not possible to set subject modifiers 1030 on such lines, but any that are set as defaults by a #subject command 1031 are recognized. 1032 1033 Saving a compiled pattern 1034 1035 When a pattern with the push modifier is successfully compiled, it is 1036 pushed onto a stack of compiled patterns, and pcre2test expects the 1037 next line to contain a new pattern (or a command) instead of a subject 1038 line. This facility is used when saving compiled patterns to a file, as 1039 described in the section entitled "Saving and restoring compiled pat- 1040 terns" below. If pushcopy is used instead of push, a copy of the com- 1041 piled pattern is stacked, leaving the original as current, ready to 1042 match the following input lines. This provides a way of testing the 1043 pcre2_code_copy() function. The push and pushcopy modifiers are in- 1044 compatible with compilation modifiers such as global that act at match 1045 time. Any that are specified are ignored (for the stacked copy), with a 1046 warning message, except for replace, which causes an error. Note that 1047 jitverify, which is allowed, does not carry through to any subsequent 1048 matching that uses a stacked pattern. 1049 1050 Testing foreign pattern conversion 1051 1052 The experimental foreign pattern conversion functions in PCRE2 can be 1053 tested by setting the convert modifier. Its argument is a colon-sepa- 1054 rated list of options, which set the equivalent option for the 1055 pcre2_pattern_convert() function: 1056 1057 glob PCRE2_CONVERT_GLOB 1058 glob_no_starstar PCRE2_CONVERT_GLOB_NO_STARSTAR 1059 glob_no_wild_separator PCRE2_CONVERT_GLOB_NO_WILD_SEPARATOR 1060 posix_basic PCRE2_CONVERT_POSIX_BASIC 1061 posix_extended PCRE2_CONVERT_POSIX_EXTENDED 1062 unset Unset all options 1063 1064 The "unset" value is useful for turning off a default that has been set 1065 by a #pattern command. When one of these options is set, the input pat- 1066 tern is passed to pcre2_pattern_convert(). If the conversion is suc- 1067 cessful, the result is reflected in the output and then passed to 1068 pcre2_compile(). The normal utf and no_utf_check options, if set, cause 1069 the PCRE2_CONVERT_UTF and PCRE2_CONVERT_NO_UTF_CHECK options to be 1070 passed to pcre2_pattern_convert(). 1071 1072 By default, the conversion function is allowed to allocate a buffer for 1073 its output. However, if the convert_length modifier is set to a value 1074 greater than zero, pcre2test passes a buffer of the given length. This 1075 makes it possible to test the length check. 1076 1077 The convert_glob_escape and convert_glob_separator modifiers can be 1078 used to specify the escape and separator characters for glob process- 1079 ing, overriding the defaults, which are operating-system dependent. 1080 1081 1082SUBJECT MODIFIERS 1083 1084 The modifiers that can appear in subject lines and the #subject command 1085 are of two types. 1086 1087 Setting match options 1088 1089 The following modifiers set options for pcre2_match() or 1090 pcre2_dfa_match(). See pcreapi for a description of their effects. 1091 1092 anchored set PCRE2_ANCHORED 1093 endanchored set PCRE2_ENDANCHORED 1094 dfa_restart set PCRE2_DFA_RESTART 1095 dfa_shortest set PCRE2_DFA_SHORTEST 1096 disable_recurseloop_check set PCRE2_DISABLE_RECURSELOOP_CHECK 1097 no_jit set PCRE2_NO_JIT 1098 no_utf_check set PCRE2_NO_UTF_CHECK 1099 notbol set PCRE2_NOTBOL 1100 notempty set PCRE2_NOTEMPTY 1101 notempty_atstart set PCRE2_NOTEMPTY_ATSTART 1102 noteol set PCRE2_NOTEOL 1103 partial_hard (or ph) set PCRE2_PARTIAL_HARD 1104 partial_soft (or ps) set PCRE2_PARTIAL_SOFT 1105 1106 The partial matching modifiers are provided with abbreviations because 1107 they appear frequently in tests. 1108 1109 If the posix or posix_nosub modifier was present on the pattern, caus- 1110 ing the POSIX wrapper API to be used, the only option-setting modifiers 1111 that have any effect are notbol, notempty, and noteol, causing REG_NOT- 1112 BOL, REG_NOTEMPTY, and REG_NOTEOL, respectively, to be passed to 1113 regexec(). The other modifiers are ignored, with a warning message. 1114 1115 There is one additional modifier that can be used with the POSIX wrap- 1116 per. It is ignored (with a warning) if used for non-POSIX matching. 1117 1118 posix_startend=<n>[:<m>] 1119 1120 This causes the subject string to be passed to regexec() using the 1121 REG_STARTEND option, which uses offsets to specify which part of the 1122 string is searched. If only one number is given, the end offset is 1123 passed as the end of the subject string. For more detail of REG_STAR- 1124 TEND, see the pcre2posix documentation. If the subject string contains 1125 binary zeros (coded as escapes such as \x{00} because pcre2test does 1126 not support actual binary zeros in its input), you must use posix_star- 1127 tend to specify its length. 1128 1129 Setting match controls 1130 1131 The following modifiers affect the matching process or request addi- 1132 tional information. Some of them may also be specified on a pattern 1133 line (see above), in which case they apply to every subject line that 1134 is matched against that pattern, but can be overridden by modifiers on 1135 the subject. 1136 1137 aftertext show text after match 1138 allaftertext show text after captures 1139 allcaptures show all captures 1140 allvector show the entire ovector 1141 allusedtext show all consulted text (non-JIT only) 1142 altglobal alternative global matching 1143 callout_capture show captures at callout time 1144 callout_data=<n> set a value to pass via callouts 1145 callout_error=<n>[:<m>] control callout error 1146 callout_extra show extra callout information 1147 callout_fail=<n>[:<m>] control callout failure 1148 callout_no_where do not show position of a callout 1149 callout_none do not supply a callout function 1150 copy=<number or name> copy captured substring 1151 depth_limit=<n> set a depth limit 1152 dfa use pcre2_dfa_match() 1153 find_limits find heap, match and depth limits 1154 find_limits_noheap find match and depth limits 1155 get=<number or name> extract captured substring 1156 getall extract all captured substrings 1157 /g global global matching 1158 heapframes_size show match data heapframes size 1159 heap_limit=<n> set a limit on heap memory (Kbytes) 1160 jitstack=<n> set size of JIT stack 1161 mark show mark values 1162 match_limit=<n> set a match limit 1163 memory show heap memory usage 1164 null_context match with a NULL context 1165 null_replacement substitute with NULL replacement 1166 null_subject match with NULL subject 1167 offset=<n> set starting offset 1168 offset_limit=<n> set offset limit 1169 ovector=<n> set size of output vector 1170 recursion_limit=<n> obsolete synonym for depth_limit 1171 replace=<string> specify a replacement string 1172 startchar show startchar when relevant 1173 startoffset=<n> same as offset=<n> 1174 substitute_callout use substitution callouts 1175 substitute_extedded use PCRE2_SUBSTITUTE_EXTENDED 1176 substitute_literal use PCRE2_SUBSTITUTE_LITERAL 1177 substitute_matched use PCRE2_SUBSTITUTE_MATCHED 1178 substitute_overflow_length use PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1179 substitute_replacement_only use PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1180 substitute_skip=<n> skip substitution number n 1181 substitute_stop=<n> skip substitution number n and greater 1182 substitute_unknown_unset use PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1183 substitute_unset_empty use PCRE2_SUBSTITUTE_UNSET_EMPTY 1184 zero_terminate pass the subject as zero-terminated 1185 1186 The effects of these modifiers are described in the following sections. 1187 When matching via the POSIX wrapper API, the aftertext, allaftertext, 1188 and ovector subject modifiers work as described below. All other modi- 1189 fiers are either ignored, with a warning message, or cause an error. 1190 1191 Showing more text 1192 1193 The aftertext modifier requests that as well as outputting the part of 1194 the subject string that matched the entire pattern, pcre2test should in 1195 addition output the remainder of the subject string. This is useful for 1196 tests where the subject contains multiple copies of the same substring. 1197 The allaftertext modifier requests the same action for captured sub- 1198 strings as well as the main matched substring. In each case the remain- 1199 der is output on the following line with a plus character following the 1200 capture number. 1201 1202 The allusedtext modifier requests that all the text that was consulted 1203 during a successful pattern match by the interpreter should be shown, 1204 for both full and partial matches. This feature is not supported for 1205 JIT matching, and if requested with JIT it is ignored (with a warning 1206 message). Setting this modifier affects the output if there is a look- 1207 behind at the start of a match, or, for a complete match, a lookahead 1208 at the end, or if \K is used in the pattern. Characters that precede or 1209 follow the start and end of the actual match are indicated in the out- 1210 put by '<' or '>' characters underneath them. Here is an example: 1211 1212 re> /(?<=pqr)abc(?=xyz)/ 1213 data> 123pqrabcxyz456\=allusedtext 1214 0: pqrabcxyz 1215 <<< >>> 1216 data> 123pqrabcxy\=ph,allusedtext 1217 Partial match: pqrabcxy 1218 <<< 1219 1220 The first, complete match shows that the matched string is "abc", with 1221 the preceding and following strings "pqr" and "xyz" having been con- 1222 sulted during the match (when processing the assertions). The partial 1223 match can indicate only the preceding string. 1224 1225 The startchar modifier requests that the starting character for the 1226 match be indicated, if it is different to the start of the matched 1227 string. The only time when this occurs is when \K has been processed as 1228 part of the match. In this situation, the output for the matched string 1229 is displayed from the starting character instead of from the match 1230 point, with circumflex characters under the earlier characters. For ex- 1231 ample: 1232 1233 re> /abc\Kxyz/ 1234 data> abcxyz\=startchar 1235 0: abcxyz 1236 ^^^ 1237 1238 Unlike allusedtext, the startchar modifier can be used with JIT. How- 1239 ever, these two modifiers are mutually exclusive. 1240 1241 Showing the value of all capture groups 1242 1243 The allcaptures modifier requests that the values of all potential cap- 1244 tured parentheses be output after a match. By default, only those up to 1245 the highest one actually used in the match are output (corresponding to 1246 the return code from pcre2_match()). Groups that did not take part in 1247 the match are output as "<unset>". This modifier is not relevant for 1248 DFA matching (which does no capturing) and does not apply when replace 1249 is specified; it is ignored, with a warning message, if present. 1250 1251 Showing the entire ovector, for all outcomes 1252 1253 The allvector modifier requests that the entire ovector be shown, what- 1254 ever the outcome of the match. Compare allcaptures, which shows only up 1255 to the maximum number of capture groups for the pattern, and then only 1256 for a successful complete non-DFA match. This modifier, which acts af- 1257 ter any match result, and also for DFA matching, provides a means of 1258 checking that there are no unexpected modifications to ovector fields. 1259 Before each match attempt, the ovector is filled with a special value, 1260 and if this is found in both elements of a capturing pair, "<un- 1261 changed>" is output. After a successful match, this applies to all 1262 groups after the maximum capture group for the pattern. In other cases 1263 it applies to the entire ovector. After a partial match, the first two 1264 elements are the only ones that should be set. After a DFA match, the 1265 amount of ovector that is used depends on the number of matches that 1266 were found. 1267 1268 Testing pattern callouts 1269 1270 A callout function is supplied when pcre2test calls the library match- 1271 ing functions, unless callout_none is specified. Its behaviour can be 1272 controlled by various modifiers listed above whose names begin with 1273 callout_. Details are given in the section entitled "Callouts" below. 1274 Testing callouts from pcre2_substitute() is described separately in 1275 "Testing the substitution function" below. 1276 1277 Finding all matches in a string 1278 1279 Searching for all possible matches within a subject can be requested by 1280 the global or altglobal modifier. After finding a match, the matching 1281 function is called again to search the remainder of the subject. The 1282 difference between global and altglobal is that the former uses the 1283 start_offset argument to pcre2_match() or pcre2_dfa_match() to start 1284 searching at a new point within the entire string (which is what Perl 1285 does), whereas the latter passes over a shortened subject. This makes a 1286 difference to the matching process if the pattern begins with a lookbe- 1287 hind assertion (including \b or \B). 1288 1289 If an empty string is matched, the next match is done with the 1290 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED flags set, in order to search 1291 for another, non-empty, match at the same point in the subject. If this 1292 match fails, the start offset is advanced, and the normal match is re- 1293 tried. This imitates the way Perl handles such cases when using the /g 1294 modifier or the split() function. Normally, the start offset is ad- 1295 vanced by one character, but if the newline convention recognizes CRLF 1296 as a newline, and the current character is CR followed by LF, an ad- 1297 vance of two characters occurs. 1298 1299 Testing substring extraction functions 1300 1301 The copy and get modifiers can be used to test the pcre2_sub- 1302 string_copy_xxx() and pcre2_substring_get_xxx() functions. They can be 1303 given more than once, and each can specify a capture group name or num- 1304 ber, for example: 1305 1306 abcd\=copy=1,copy=3,get=G1 1307 1308 If the #subject command is used to set default copy and/or get lists, 1309 these can be unset by specifying a negative number to cancel all num- 1310 bered groups and an empty name to cancel all named groups. 1311 1312 The getall modifier tests pcre2_substring_list_get(), which extracts 1313 all captured substrings. 1314 1315 If the subject line is successfully matched, the substrings extracted 1316 by the convenience functions are output with C, G, or L after the 1317 string number instead of a colon. This is in addition to the normal 1318 full list. The string length (that is, the return from the extraction 1319 function) is given in parentheses after each substring, followed by the 1320 name when the extraction was by name. 1321 1322 Testing the substitution function 1323 1324 If the replace modifier is set, the pcre2_substitute() function is 1325 called instead of one of the matching functions (or after one call of 1326 pcre2_match() in the case of PCRE2_SUBSTITUTE_MATCHED). Note that re- 1327 placement strings cannot contain commas, because a comma signifies the 1328 end of a modifier. This is not thought to be an issue in a test pro- 1329 gram. 1330 1331 Specifying a completely empty replacement string disables this modi- 1332 fier. However, it is possible to specify an empty replacement by pro- 1333 viding a buffer length, as described below, for an otherwise empty re- 1334 placement. 1335 1336 Unlike subject strings, pcre2test does not process replacement strings 1337 for escape sequences. In UTF mode, a replacement string is checked to 1338 see if it is a valid UTF-8 string. If so, it is correctly converted to 1339 a UTF string of the appropriate code unit width. If it is not a valid 1340 UTF-8 string, the individual code units are copied directly. This pro- 1341 vides a means of passing an invalid UTF-8 string for testing purposes. 1342 1343 The following modifiers set options (in additional to the normal match 1344 options) for pcre2_substitute(): 1345 1346 global PCRE2_SUBSTITUTE_GLOBAL 1347 substitute_extended PCRE2_SUBSTITUTE_EXTENDED 1348 substitute_literal PCRE2_SUBSTITUTE_LITERAL 1349 substitute_matched PCRE2_SUBSTITUTE_MATCHED 1350 substitute_overflow_length PCRE2_SUBSTITUTE_OVERFLOW_LENGTH 1351 substitute_replacement_only PCRE2_SUBSTITUTE_REPLACEMENT_ONLY 1352 substitute_unknown_unset PCRE2_SUBSTITUTE_UNKNOWN_UNSET 1353 substitute_unset_empty PCRE2_SUBSTITUTE_UNSET_EMPTY 1354 1355 See the pcre2api documentation for details of these options. 1356 1357 After a successful substitution, the modified string is output, pre- 1358 ceded by the number of replacements. This may be zero if there were no 1359 matches. Here is a simple example of a substitution test: 1360 1361 /abc/replace=xxx 1362 =abc=abc= 1363 1: =xxx=abc= 1364 =abc=abc=\=global 1365 2: =xxx=xxx= 1366 1367 Subject and replacement strings should be kept relatively short (fewer 1368 than 256 characters) for substitution tests, as fixed-size buffers are 1369 used. To make it easy to test for buffer overflow, if the replacement 1370 string starts with a number in square brackets, that number is passed 1371 to pcre2_substitute() as the size of the output buffer, with the re- 1372 placement string starting at the next character. Here is an example 1373 that tests the edge case: 1374 1375 /abc/ 1376 123abc123\=replace=[10]XYZ 1377 1: 123XYZ123 1378 123abc123\=replace=[9]XYZ 1379 Failed: error -47: no more memory 1380 1381 The default action of pcre2_substitute() is to return PCRE2_ER- 1382 ROR_NOMEMORY when the output buffer is too small. However, if the 1383 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set (by using the substi- 1384 tute_overflow_length modifier), pcre2_substitute() continues to go 1385 through the motions of matching and substituting (but not doing any 1386 callouts), in order to compute the size of buffer that is required. 1387 When this happens, pcre2test shows the required buffer length (which 1388 includes space for the trailing zero) as part of the error message. For 1389 example: 1390 1391 /abc/substitute_overflow_length 1392 123abc123\=replace=[9]XYZ 1393 Failed: error -47: no more memory: 10 code units are needed 1394 1395 A replacement string is ignored with POSIX and DFA matching. Specifying 1396 partial matching provokes an error return ("bad option value") from 1397 pcre2_substitute(). 1398 1399 Testing substitute callouts 1400 1401 If the substitute_callout modifier is set, a substitution callout func- 1402 tion is set up. The null_context modifier must not be set, because the 1403 address of the callout function is passed in a match context. When the 1404 callout function is called (after each substitution), details of the 1405 input and output strings are output. For example: 1406 1407 /abc/g,replace=<$0>,substitute_callout 1408 abcdefabcpqr 1409 1(1) Old 0 3 "abc" New 0 5 "<abc>" 1410 2(1) Old 6 9 "abc" New 8 13 "<abc>" 1411 2: <abc>def<abc>pqr 1412 1413 The first number on each callout line is the count of matches. The 1414 parenthesized number is the number of pairs that are set in the ovector 1415 (that is, one more than the number of capturing groups that were set). 1416 Then are listed the offsets of the old substring, its contents, and the 1417 same for the replacement. 1418 1419 By default, the substitution callout function returns zero, which ac- 1420 cepts the replacement and causes matching to continue if /g was used. 1421 Two further modifiers can be used to test other return values. If sub- 1422 stitute_skip is set to a value greater than zero the callout function 1423 returns +1 for the match of that number, and similarly substitute_stop 1424 returns -1. These cause the replacement to be rejected, and -1 causes 1425 no further matching to take place. If either of them are set, substi- 1426 tute_callout is assumed. For example: 1427 1428 /abc/g,replace=<$0>,substitute_skip=1 1429 abcdefabcpqr 1430 1(1) Old 0 3 "abc" New 0 5 "<abc> SKIPPED" 1431 2(1) Old 6 9 "abc" New 6 11 "<abc>" 1432 2: abcdef<abc>pqr 1433 abcdefabcpqr\=substitute_stop=1 1434 1(1) Old 0 3 "abc" New 0 5 "<abc> STOPPED" 1435 1: abcdefabcpqr 1436 1437 If both are set for the same number, stop takes precedence. Only a sin- 1438 gle skip or stop is supported, which is sufficient for testing that the 1439 feature works. 1440 1441 Setting the JIT stack size 1442 1443 The jitstack modifier provides a way of setting the maximum stack size 1444 that is used by the just-in-time optimization code. It is ignored if 1445 JIT optimization is not being used. The value is a number of kibibytes 1446 (units of 1024 bytes). Setting zero reverts to the default of 32KiB. 1447 Providing a stack that is larger than the default is necessary only for 1448 very complicated patterns. If jitstack is set non-zero on a subject 1449 line it overrides any value that was set on the pattern. 1450 1451 Setting heap, match, and depth limits 1452 1453 The heap_limit, match_limit, and depth_limit modifiers set the appro- 1454 priate limits in the match context. These values are ignored when the 1455 find_limits or find_limits_noheap modifier is specified. 1456 1457 Finding minimum limits 1458 1459 If the find_limits modifier is present on a subject line, pcre2test 1460 calls the relevant matching function several times, setting different 1461 values in the match context via pcre2_set_heap_limit(), 1462 pcre2_set_match_limit(), or pcre2_set_depth_limit() until it finds the 1463 smallest value for each parameter that allows the match to complete 1464 without a "limit exceeded" error. The match itself may succeed or fail. 1465 An alternative modifier, find_limits_noheap, omits the heap limit. This 1466 is used in the standard tests, because the minimum heap limit varies 1467 between systems. If JIT is being used, only the match limit is rele- 1468 vant, and the other two are automatically omitted. 1469 1470 When using this modifier, the pattern should not contain any limit set- 1471 tings such as (*LIMIT_MATCH=...) within it. If such a setting is 1472 present and is lower than the minimum matching value, the minimum value 1473 cannot be found because pcre2_set_match_limit() etc. are only able to 1474 reduce the value of an in-pattern limit; they cannot increase it. 1475 1476 For non-DFA matching, the minimum depth_limit number is a measure of 1477 how much nested backtracking happens (that is, how deeply the pattern's 1478 tree is searched). In the case of DFA matching, depth_limit controls 1479 the depth of recursive calls of the internal function that is used for 1480 handling pattern recursion, lookaround assertions, and atomic groups. 1481 1482 For non-DFA matching, the match_limit number is a measure of the amount 1483 of backtracking that takes place, and learning the minimum value can be 1484 instructive. For most simple matches, the number is quite small, but 1485 for patterns with very large numbers of matching possibilities, it can 1486 become large very quickly with increasing length of subject string. In 1487 the case of DFA matching, match_limit controls the total number of 1488 calls, both recursive and non-recursive, to the internal matching func- 1489 tion, thus controlling the overall amount of computing resource that is 1490 used. 1491 1492 For both kinds of matching, the heap_limit number, which is in 1493 kibibytes (units of 1024 bytes), limits the amount of heap memory used 1494 for matching. 1495 1496 Showing MARK names 1497 1498 1499 The mark modifier causes the names from backtracking control verbs that 1500 are returned from calls to pcre2_match() to be displayed. If a mark is 1501 returned for a match, non-match, or partial match, pcre2test shows it. 1502 For a match, it is on a line by itself, tagged with "MK:". Otherwise, 1503 it is added to the non-match message. 1504 1505 Showing memory usage 1506 1507 The memory modifier causes pcre2test to log the sizes of all heap mem- 1508 ory allocation and freeing calls that occur during a call to 1509 pcre2_match() or pcre2_dfa_match(). In the latter case, heap memory is 1510 used only when a match requires more internal workspace that the de- 1511 fault allocation on the stack, so in many cases there will be no out- 1512 put. No heap memory is allocated during matching with JIT. For this 1513 modifier to work, the null_context modifier must not be set on both the 1514 pattern and the subject, though it can be set on one or the other. 1515 1516 Showing the heap frame overall vector size 1517 1518 The heapframes_size modifier is relevant for matches using 1519 pcre2_match() without JIT. After a match has run (whether successful or 1520 not) the size, in bytes, of the allocated heap frames vector that is 1521 left attached to the match data block is shown. If the matching action 1522 involved several calls to pcre2_match() (for example, global matching 1523 or for timing) only the final value is shown. 1524 1525 This modifier is ignored, with a warning, for POSIX or DFA matching. 1526 JIT matching does not use the heap frames vector, so the size is always 1527 zero, unless there was a previous non-JIT match. Note that specifing a 1528 size of zero for the output vector (see below) causes pcre2test to free 1529 its match data block (and associated heap frames vector) and allocate a 1530 new one. 1531 1532 Setting a starting offset 1533 1534 The offset modifier sets an offset in the subject string at which 1535 matching starts. Its value is a number of code units, not characters. 1536 1537 Setting an offset limit 1538 1539 The offset_limit modifier sets a limit for unanchored matches. If a 1540 match cannot be found starting at or before this offset in the subject, 1541 a "no match" return is given. The data value is a number of code units, 1542 not characters. When this modifier is used, the use_offset_limit modi- 1543 fier must have been set for the pattern; if not, an error is generated. 1544 1545 Setting the size of the output vector 1546 1547 The ovector modifier applies only to the subject line in which it ap- 1548 pears, though of course it can also be used to set a default in a #sub- 1549 ject command. It specifies the number of pairs of offsets that are 1550 available for storing matching information. The default is 15. 1551 1552 A value of zero is useful when testing the POSIX API because it causes 1553 regexec() to be called with a NULL capture vector. When not testing the 1554 POSIX API, a value of zero is used to cause pcre2_match_data_cre- 1555 ate_from_pattern() to be called, in order to create a new match block 1556 of exactly the right size for the pattern. (It is not possible to cre- 1557 ate a match block with a zero-length ovector; there is always at least 1558 one pair of offsets.) The old match data block is freed. 1559 1560 Passing the subject as zero-terminated 1561 1562 By default, the subject string is passed to a native API matching func- 1563 tion with its correct length. In order to test the facility for passing 1564 a zero-terminated string, the zero_terminate modifier is provided. It 1565 causes the length to be passed as PCRE2_ZERO_TERMINATED. When matching 1566 via the POSIX interface, this modifier is ignored, with a warning. 1567 1568 When testing pcre2_substitute(), this modifier also has the effect of 1569 passing the replacement string as zero-terminated. 1570 1571 Passing a NULL context, subject, or replacement 1572 1573 Normally, pcre2test passes a context block to pcre2_match(), 1574 pcre2_dfa_match(), pcre2_jit_match() or pcre2_substitute(). If the 1575 null_context modifier is set, however, NULL is passed. This is for 1576 testing that the matching and substitution functions behave correctly 1577 in this case (they use default values). This modifier cannot be used 1578 with the find_limits, find_limits_noheap, or substitute_callout modi- 1579 fiers. 1580 1581 Similarly, for testing purposes, if the null_subject or null_replace- 1582 ment modifier is set, the subject or replacement string pointers are 1583 passed as NULL, respectively, to the relevant functions. 1584 1585 1586THE ALTERNATIVE MATCHING FUNCTION 1587 1588 By default, pcre2test uses the standard PCRE2 matching function, 1589 pcre2_match() to match each subject line. PCRE2 also supports an alter- 1590 native matching function, pcre2_dfa_match(), which operates in a dif- 1591 ferent way, and has some restrictions. The differences between the two 1592 functions are described in the pcre2matching documentation. 1593 1594 If the dfa modifier is set, the alternative matching function is used. 1595 This function finds all possible matches at a given point in the sub- 1596 ject. If, however, the dfa_shortest modifier is set, processing stops 1597 after the first match is found. This is always the shortest possible 1598 match. 1599 1600 1601DEFAULT OUTPUT FROM pcre2test 1602 1603 This section describes the output when the normal matching function, 1604 pcre2_match(), is being used. 1605 1606 When a match succeeds, pcre2test outputs the list of captured sub- 1607 strings, starting with number 0 for the string that matched the whole 1608 pattern. Otherwise, it outputs "No match" when the return is PCRE2_ER- 1609 ROR_NOMATCH, or "Partial match:" followed by the partially matching 1610 substring when the return is PCRE2_ERROR_PARTIAL. (Note that this is 1611 the entire substring that was inspected during the partial match; it 1612 may include characters before the actual match start if a lookbehind 1613 assertion, \K, \b, or \B was involved.) 1614 1615 For any other return, pcre2test outputs the PCRE2 negative error number 1616 and a short descriptive phrase. If the error is a failed UTF string 1617 check, the code unit offset of the start of the failing character is 1618 also output. Here is an example of an interactive pcre2test run. 1619 1620 $ pcre2test 1621 PCRE2 version 10.22 2016-07-29 1622 1623 re> /^abc(\d+)/ 1624 data> abc123 1625 0: abc123 1626 1: 123 1627 data> xyz 1628 No match 1629 1630 Unset capturing substrings that are not followed by one that is set are 1631 not shown by pcre2test unless the allcaptures modifier is specified. In 1632 the following example, there are two capturing substrings, but when the 1633 first data line is matched, the second, unset substring is not shown. 1634 An "internal" unset substring is shown as "<unset>", as for the second 1635 data line. 1636 1637 re> /(a)|(b)/ 1638 data> a 1639 0: a 1640 1: a 1641 data> b 1642 0: b 1643 1: <unset> 1644 2: b 1645 1646 If the strings contain any non-printing characters, they are output as 1647 \xhh escapes if the value is less than 256 and UTF mode is not set. 1648 Otherwise they are output as \x{hh...} escapes. See below for the defi- 1649 nition of non-printing characters. If the aftertext modifier is set, 1650 the output for substring 0 is followed by the rest of the subject 1651 string, identified by "0+" like this: 1652 1653 re> /cat/aftertext 1654 data> cataract 1655 0: cat 1656 0+ aract 1657 1658 If global matching is requested, the results of successive matching at- 1659 tempts are output in sequence, like this: 1660 1661 re> /\Bi(\w\w)/g 1662 data> Mississippi 1663 0: iss 1664 1: ss 1665 0: iss 1666 1: ss 1667 0: ipp 1668 1: pp 1669 1670 "No match" is output only if the first match attempt fails. Here is an 1671 example of a failure message (the offset 4 that is specified by the 1672 offset modifier is past the end of the subject string): 1673 1674 re> /xyz/ 1675 data> xyz\=offset=4 1676 Error -24 (bad offset value) 1677 1678 Note that whereas patterns can be continued over several lines (a plain 1679 ">" prompt is used for continuations), subject lines may not. However 1680 newlines can be included in a subject by means of the \n escape (or \r, 1681 \r\n, etc., depending on the newline sequence setting). 1682 1683 1684OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION 1685 1686 When the alternative matching function, pcre2_dfa_match(), is used, the 1687 output consists of a list of all the matches that start at the first 1688 point in the subject where there is at least one match. For example: 1689 1690 re> /(tang|tangerine|tan)/ 1691 data> yellow tangerine\=dfa 1692 0: tangerine 1693 1: tang 1694 2: tan 1695 1696 Using the normal matching function on this data finds only "tang". The 1697 longest matching string is always given first (and numbered zero). Af- 1698 ter a PCRE2_ERROR_PARTIAL return, the output is "Partial match:", fol- 1699 lowed by the partially matching substring. Note that this is the entire 1700 substring that was inspected during the partial match; it may include 1701 characters before the actual match start if a lookbehind assertion, \b, 1702 or \B was involved. (\K is not supported for DFA matching.) 1703 1704 If global matching is requested, the search for further matches resumes 1705 at the end of the longest match. For example: 1706 1707 re> /(tang|tangerine|tan)/g 1708 data> yellow tangerine and tangy sultana\=dfa 1709 0: tangerine 1710 1: tang 1711 2: tan 1712 0: tang 1713 1: tan 1714 0: tan 1715 1716 The alternative matching function does not support substring capture, 1717 so the modifiers that are concerned with captured substrings are not 1718 relevant. 1719 1720 1721RESTARTING AFTER A PARTIAL MATCH 1722 1723 When the alternative matching function has given the PCRE2_ERROR_PAR- 1724 TIAL return, indicating that the subject partially matched the pattern, 1725 you can restart the match with additional subject data by means of the 1726 dfa_restart modifier. For example: 1727 1728 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ 1729 data> 23ja\=ps,dfa 1730 Partial match: 23ja 1731 data> n05\=dfa,dfa_restart 1732 0: n05 1733 1734 For further information about partial matching, see the pcre2partial 1735 documentation. 1736 1737 1738CALLOUTS 1739 1740 If the pattern contains any callout requests, pcre2test's callout func- 1741 tion is called during matching unless callout_none is specified. This 1742 works with both matching functions, and with JIT, though there are some 1743 differences in behaviour. The output for callouts with numerical argu- 1744 ments and those with string arguments is slightly different. 1745 1746 Callouts with numerical arguments 1747 1748 By default, the callout function displays the callout number, the start 1749 and current positions in the subject text at the callout time, and the 1750 next pattern item to be tested. For example: 1751 1752 --->pqrabcdef 1753 0 ^ ^ \d 1754 1755 This output indicates that callout number 0 occurred for a match at- 1756 tempt starting at the fourth character of the subject string, when the 1757 pointer was at the seventh character, and when the next pattern item 1758 was \d. Just one circumflex is output if the start and current posi- 1759 tions are the same, or if the current position precedes the start posi- 1760 tion, which can happen if the callout is in a lookbehind assertion. 1761 1762 Callouts numbered 255 are assumed to be automatic callouts, inserted as 1763 a result of the auto_callout pattern modifier. In this case, instead of 1764 showing the callout number, the offset in the pattern, preceded by a 1765 plus, is output. For example: 1766 1767 re> /\d?[A-E]\*/auto_callout 1768 data> E* 1769 --->E* 1770 +0 ^ \d? 1771 +3 ^ [A-E] 1772 +8 ^^ \* 1773 +10 ^ ^ 1774 0: E* 1775 1776 If a pattern contains (*MARK) items, an additional line is output when- 1777 ever a change of latest mark is passed to the callout function. For ex- 1778 ample: 1779 1780 re> /a(*MARK:X)bc/auto_callout 1781 data> abc 1782 --->abc 1783 +0 ^ a 1784 +1 ^^ (*MARK:X) 1785 +10 ^^ b 1786 Latest Mark: X 1787 +11 ^ ^ c 1788 +12 ^ ^ 1789 0: abc 1790 1791 The mark changes between matching "a" and "b", but stays the same for 1792 the rest of the match, so nothing more is output. If, as a result of 1793 backtracking, the mark reverts to being unset, the text "<unset>" is 1794 output. 1795 1796 Callouts with string arguments 1797 1798 The output for a callout with a string argument is similar, except that 1799 instead of outputting a callout number before the position indicators, 1800 the callout string and its offset in the pattern string are output be- 1801 fore the reflection of the subject string, and the subject string is 1802 reflected for each callout. For example: 1803 1804 re> /^ab(?C'first')cd(?C"second")ef/ 1805 data> abcdefg 1806 Callout (7): 'first' 1807 --->abcdefg 1808 ^ ^ c 1809 Callout (20): "second" 1810 --->abcdefg 1811 ^ ^ e 1812 0: abcdef 1813 1814 1815 Callout modifiers 1816 1817 The callout function in pcre2test returns zero (carry on matching) by 1818 default, but you can use a callout_fail modifier in a subject line to 1819 change this and other parameters of the callout (see below). 1820 1821 If the callout_capture modifier is set, the current captured groups are 1822 output when a callout occurs. This is useful only for non-DFA matching, 1823 as pcre2_dfa_match() does not support capturing, so no captures are 1824 ever shown. 1825 1826 The normal callout output, showing the callout number or pattern offset 1827 (as described above) is suppressed if the callout_no_where modifier is 1828 set. 1829 1830 When using the interpretive matching function pcre2_match() without 1831 JIT, setting the callout_extra modifier causes additional output from 1832 pcre2test's callout function to be generated. For the first callout in 1833 a match attempt at a new starting position in the subject, "New match 1834 attempt" is output. If there has been a backtrack since the last call- 1835 out (or start of matching if this is the first callout), "Backtrack" is 1836 output, followed by "No other matching paths" if the backtrack ended 1837 the previous match attempt. For example: 1838 1839 re> /(a+)b/auto_callout,no_start_optimize,no_auto_possess 1840 data> aac\=callout_extra 1841 New match attempt 1842 --->aac 1843 +0 ^ ( 1844 +1 ^ a+ 1845 +3 ^ ^ ) 1846 +4 ^ ^ b 1847 Backtrack 1848 --->aac 1849 +3 ^^ ) 1850 +4 ^^ b 1851 Backtrack 1852 No other matching paths 1853 New match attempt 1854 --->aac 1855 +0 ^ ( 1856 +1 ^ a+ 1857 +3 ^^ ) 1858 +4 ^^ b 1859 Backtrack 1860 No other matching paths 1861 New match attempt 1862 --->aac 1863 +0 ^ ( 1864 +1 ^ a+ 1865 Backtrack 1866 No other matching paths 1867 New match attempt 1868 --->aac 1869 +0 ^ ( 1870 +1 ^ a+ 1871 No match 1872 1873 Notice that various optimizations must be turned off if you want all 1874 possible matching paths to be scanned. If no_start_optimize is not 1875 used, there is an immediate "no match", without any callouts, because 1876 the starting optimization fails to find "b" in the subject, which it 1877 knows must be present for any match. If no_auto_possess is not used, 1878 the "a+" item is turned into "a++", which reduces the number of back- 1879 tracks. 1880 1881 The callout_extra modifier has no effect if used with the DFA matching 1882 function, or with JIT. 1883 1884 Return values from callouts 1885 1886 The default return from the callout function is zero, which allows 1887 matching to continue. The callout_fail modifier can be given one or two 1888 numbers. If there is only one number, 1 is returned instead of 0 (caus- 1889 ing matching to backtrack) when a callout of that number is reached. If 1890 two numbers (<n>:<m>) are given, 1 is returned when callout <n> is 1891 reached and there have been at least <m> callouts. The callout_error 1892 modifier is similar, except that PCRE2_ERROR_CALLOUT is returned, caus- 1893 ing the entire matching process to be aborted. If both these modifiers 1894 are set for the same callout number, callout_error takes precedence. 1895 Note that callouts with string arguments are always given the number 1896 zero. 1897 1898 The callout_data modifier can be given an unsigned or a negative num- 1899 ber. This is set as the "user data" that is passed to the matching 1900 function, and passed back when the callout function is invoked. Any 1901 value other than zero is used as a return from pcre2test's callout 1902 function. 1903 1904 Inserting callouts can be helpful when using pcre2test to check compli- 1905 cated regular expressions. For further information about callouts, see 1906 the pcre2callout documentation. 1907 1908 1909NON-PRINTING CHARACTERS 1910 1911 When pcre2test is outputting text in the compiled version of a pattern, 1912 bytes other than 32-126 are always treated as non-printing characters 1913 and are therefore shown as hex escapes. 1914 1915 When pcre2test is outputting text that is a matched part of a subject 1916 string, it behaves in the same way, unless a different locale has been 1917 set for the pattern (using the locale modifier). In this case, the is- 1918 print() function is used to distinguish printing and non-printing char- 1919 acters. 1920 1921 1922SAVING AND RESTORING COMPILED PATTERNS 1923 1924 It is possible to save compiled patterns on disc or elsewhere, and re- 1925 load them later, subject to a number of restrictions. JIT data cannot 1926 be saved. The host on which the patterns are reloaded must be running 1927 the same version of PCRE2, with the same code unit width, and must also 1928 have the same endianness, pointer width and PCRE2_SIZE type. Before 1929 compiled patterns can be saved they must be serialized, that is, con- 1930 verted to a stream of bytes. A single byte stream may contain any num- 1931 ber of compiled patterns, but they must all use the same character ta- 1932 bles. A single copy of the tables is included in the byte stream (its 1933 size is 1088 bytes). 1934 1935 The functions whose names begin with pcre2_serialize_ are used for se- 1936 rializing and de-serializing. They are described in the pcre2serialize 1937 documentation. In this section we describe the features of pcre2test 1938 that can be used to test these functions. 1939 1940 Note that "serialization" in PCRE2 does not convert compiled patterns 1941 to an abstract format like Java or .NET. It just makes a reloadable 1942 byte code stream. Hence the restrictions on reloading mentioned above. 1943 1944 In pcre2test, when a pattern with push modifier is successfully com- 1945 piled, it is pushed onto a stack of compiled patterns, and pcre2test 1946 expects the next line to contain a new pattern (or command) instead of 1947 a subject line. By contrast, the pushcopy modifier causes a copy of the 1948 compiled pattern to be stacked, leaving the original available for im- 1949 mediate matching. By using push and/or pushcopy, a number of patterns 1950 can be compiled and retained. These modifiers are incompatible with 1951 posix, and control modifiers that act at match time are ignored (with a 1952 message) for the stacked patterns. The jitverify modifier applies only 1953 at compile time. 1954 1955 The command 1956 1957 #save <filename> 1958 1959 causes all the stacked patterns to be serialized and the result written 1960 to the named file. Afterwards, all the stacked patterns are freed. The 1961 command 1962 1963 #load <filename> 1964 1965 reads the data in the file, and then arranges for it to be de-serial- 1966 ized, with the resulting compiled patterns added to the pattern stack. 1967 The pattern on the top of the stack can be retrieved by the #pop com- 1968 mand, which must be followed by lines of subjects that are to be 1969 matched with the pattern, terminated as usual by an empty line or end 1970 of file. This command may be followed by a modifier list containing 1971 only control modifiers that act after a pattern has been compiled. In 1972 particular, hex, posix, posix_nosub, push, and pushcopy are not al- 1973 lowed, nor are any option-setting modifiers. The JIT modifiers are, 1974 however permitted. Here is an example that saves and reloads two pat- 1975 terns. 1976 1977 /abc/push 1978 /xyz/push 1979 #save tempfile 1980 #load tempfile 1981 #pop info 1982 xyz 1983 1984 #pop jit,bincode 1985 abc 1986 1987 If jitverify is used with #pop, it does not automatically imply jit, 1988 which is different behaviour from when it is used on a pattern. 1989 1990 The #popcopy command is analogous to the pushcopy modifier in that it 1991 makes current a copy of the topmost stack pattern, leaving the original 1992 still on the stack. 1993 1994 1995SEE ALSO 1996 1997 pcre2(3), pcre2api(3), pcre2callout(3), pcre2jit, pcre2matching(3), 1998 pcre2partial(d), pcre2pattern(3), pcre2serialize(3). 1999 2000 2001AUTHOR 2002 2003 Philip Hazel 2004 Retired from University Computing Service 2005 Cambridge, England. 2006 2007 2008REVISION 2009 2010 Last updated: 24 April 2024 2011 Copyright (c) 1997-2024 University of Cambridge. 2012 2013 2014PCRE 10.44 24 April 2024 PCRE2TEST(1) 2015