1*22dc650dSSadaf Ebrahimi<html> 2*22dc650dSSadaf Ebrahimi<head> 3*22dc650dSSadaf Ebrahimi<title>pcre2 specification</title> 4*22dc650dSSadaf Ebrahimi</head> 5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6*22dc650dSSadaf Ebrahimi<h1>pcre2 man page</h1> 7*22dc650dSSadaf Ebrahimi<p> 8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>. 9*22dc650dSSadaf Ebrahimi</p> 10*22dc650dSSadaf Ebrahimi<p> 11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated 12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it, 13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong. 14*22dc650dSSadaf Ebrahimi<br> 15*22dc650dSSadaf Ebrahimi<ul> 16*22dc650dSSadaf Ebrahimi<li><a name="TOC1" href="#SEC1">INTRODUCTION</a> 17*22dc650dSSadaf Ebrahimi<li><a name="TOC2" href="#SEC2">SECURITY CONSIDERATIONS</a> 18*22dc650dSSadaf Ebrahimi<li><a name="TOC3" href="#SEC3">USER DOCUMENTATION</a> 19*22dc650dSSadaf Ebrahimi<li><a name="TOC4" href="#SEC4">AUTHOR</a> 20*22dc650dSSadaf Ebrahimi<li><a name="TOC5" href="#SEC5">REVISION</a> 21*22dc650dSSadaf Ebrahimi</ul> 22*22dc650dSSadaf Ebrahimi<br><a name="SEC1" href="#TOC1">INTRODUCTION</a><br> 23*22dc650dSSadaf Ebrahimi<P> 24*22dc650dSSadaf EbrahimiPCRE2 is the name used for a revised API for the PCRE library, which is a set 25*22dc650dSSadaf Ebrahimiof functions, written in C, that implement regular expression pattern matching 26*22dc650dSSadaf Ebrahimiusing the same syntax and semantics as Perl, with just a few differences. After 27*22dc650dSSadaf Ebrahiminearly two decades, the limitations of the original API were making development 28*22dc650dSSadaf Ebrahimiincreasingly difficult. The new API is more extensible, and it was simplified 29*22dc650dSSadaf Ebrahimiby abolishing the separate "study" optimizing function; in PCRE2, patterns are 30*22dc650dSSadaf Ebrahimiautomatically optimized where possible. Since forking from PCRE1, the code has 31*22dc650dSSadaf Ebrahimibeen extensively refactored and new features introduced. The old library is now 32*22dc650dSSadaf Ebrahimiobsolete and is no longer maintained. 33*22dc650dSSadaf Ebrahimi</P> 34*22dc650dSSadaf Ebrahimi<P> 35*22dc650dSSadaf EbrahimiAs well as Perl-style regular expression patterns, some features that appeared 36*22dc650dSSadaf Ebrahimiin Python and the original PCRE before they appeared in Perl are available 37*22dc650dSSadaf Ebrahimiusing the Python syntax. There is also some support for one or two .NET and 38*22dc650dSSadaf EbrahimiOniguruma syntax items, and there are options for requesting some minor changes 39*22dc650dSSadaf Ebrahimithat give better ECMAScript (aka JavaScript) compatibility. 40*22dc650dSSadaf Ebrahimi</P> 41*22dc650dSSadaf Ebrahimi<P> 42*22dc650dSSadaf EbrahimiThe source code for PCRE2 can be compiled to support strings of 8-bit, 16-bit, 43*22dc650dSSadaf Ebrahimior 32-bit code units, which means that up to three separate libraries may be 44*22dc650dSSadaf Ebrahimiinstalled, one for each code unit size. The size of code unit is not related to 45*22dc650dSSadaf Ebrahimithe bit size of the underlying hardware. In a 64-bit environment that also 46*22dc650dSSadaf Ebrahimisupports 32-bit applications, versions of PCRE2 that are compiled in both 47*22dc650dSSadaf Ebrahimi64-bit and 32-bit modes may be needed. 48*22dc650dSSadaf Ebrahimi</P> 49*22dc650dSSadaf Ebrahimi<P> 50*22dc650dSSadaf EbrahimiThe original work to extend PCRE to 16-bit and 32-bit code units was done by 51*22dc650dSSadaf EbrahimiZoltan Herczeg and Christian Persch, respectively. In all three cases, strings 52*22dc650dSSadaf Ebrahimican be interpreted either as one character per code unit, or as UTF-encoded 53*22dc650dSSadaf EbrahimiUnicode, with support for Unicode general category properties. Unicode support 54*22dc650dSSadaf Ebrahimiis optional at build time (but is the default). However, processing strings as 55*22dc650dSSadaf EbrahimiUTF code units must be enabled explicitly at run time. The version of Unicode 56*22dc650dSSadaf Ebrahimiin use can be discovered by running 57*22dc650dSSadaf Ebrahimi<pre> 58*22dc650dSSadaf Ebrahimi pcre2test -C 59*22dc650dSSadaf Ebrahimi</PRE> 60*22dc650dSSadaf Ebrahimi</P> 61*22dc650dSSadaf Ebrahimi<P> 62*22dc650dSSadaf EbrahimiThe three libraries contain identical sets of functions, with names ending in 63*22dc650dSSadaf Ebrahimi_8, _16, or _32, respectively (for example, <b>pcre2_compile_8()</b>). However, 64*22dc650dSSadaf Ebrahimiby defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or 32, a program that uses just 65*22dc650dSSadaf Ebrahimione code unit width can be written using generic names such as 66*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b>, and the documentation is written assuming that this is 67*22dc650dSSadaf Ebrahimithe case. 68*22dc650dSSadaf Ebrahimi</P> 69*22dc650dSSadaf Ebrahimi<P> 70*22dc650dSSadaf EbrahimiIn addition to the Perl-compatible matching function, PCRE2 contains an 71*22dc650dSSadaf Ebrahimialternative function that matches the same compiled patterns in a different 72*22dc650dSSadaf Ebrahimiway. In certain circumstances, the alternative function has some advantages. 73*22dc650dSSadaf EbrahimiFor a discussion of the two matching algorithms, see the 74*22dc650dSSadaf Ebrahimi<a href="pcre2matching.html"><b>pcre2matching</b></a> 75*22dc650dSSadaf Ebrahimipage. 76*22dc650dSSadaf Ebrahimi</P> 77*22dc650dSSadaf Ebrahimi<P> 78*22dc650dSSadaf EbrahimiDetails of exactly which Perl regular expression features are and are not 79*22dc650dSSadaf Ebrahimisupported by PCRE2 are given in separate documents. See the 80*22dc650dSSadaf Ebrahimi<a href="pcre2pattern.html"><b>pcre2pattern</b></a> 81*22dc650dSSadaf Ebrahimiand 82*22dc650dSSadaf Ebrahimi<a href="pcre2compat.html"><b>pcre2compat</b></a> 83*22dc650dSSadaf Ebrahimipages. There is a syntax summary in the 84*22dc650dSSadaf Ebrahimi<a href="pcre2syntax.html"><b>pcre2syntax</b></a> 85*22dc650dSSadaf Ebrahimipage. 86*22dc650dSSadaf Ebrahimi</P> 87*22dc650dSSadaf Ebrahimi<P> 88*22dc650dSSadaf EbrahimiSome features of PCRE2 can be included, excluded, or changed when the library 89*22dc650dSSadaf Ebrahimiis built. The 90*22dc650dSSadaf Ebrahimi<a href="pcre2_config.html"><b>pcre2_config()</b></a> 91*22dc650dSSadaf Ebrahimifunction makes it possible for a client to discover which features are 92*22dc650dSSadaf Ebrahimiavailable. The features themselves are described in the 93*22dc650dSSadaf Ebrahimi<a href="pcre2build.html"><b>pcre2build</b></a> 94*22dc650dSSadaf Ebrahimipage. Documentation about building PCRE2 for various operating systems can be 95*22dc650dSSadaf Ebrahimifound in the 96*22dc650dSSadaf Ebrahimi<a href="README.txt"><b>README</b></a> 97*22dc650dSSadaf Ebrahimiand 98*22dc650dSSadaf Ebrahimi<a href="NON-AUTOTOOLS-BUILD.txt"><b>NON-AUTOTOOLS_BUILD</b></a> 99*22dc650dSSadaf Ebrahimifiles in the source distribution. 100*22dc650dSSadaf Ebrahimi</P> 101*22dc650dSSadaf Ebrahimi<P> 102*22dc650dSSadaf EbrahimiThe libraries contains a number of undocumented internal functions and data 103*22dc650dSSadaf Ebrahimitables that are used by more than one of the exported external functions, but 104*22dc650dSSadaf Ebrahimiwhich are not intended for use by external callers. Their names all begin with 105*22dc650dSSadaf Ebrahimi"_pcre2", which hopefully will not provoke any name clashes. In some 106*22dc650dSSadaf Ebrahimienvironments, it is possible to control which external symbols are exported 107*22dc650dSSadaf Ebrahimiwhen a shared library is built, and in these cases the undocumented symbols are 108*22dc650dSSadaf Ebrahiminot exported. 109*22dc650dSSadaf Ebrahimi</P> 110*22dc650dSSadaf Ebrahimi<br><a name="SEC2" href="#TOC1">SECURITY CONSIDERATIONS</a><br> 111*22dc650dSSadaf Ebrahimi<P> 112*22dc650dSSadaf EbrahimiIf you are using PCRE2 in a non-UTF application that permits users to supply 113*22dc650dSSadaf Ebrahimiarbitrary patterns for compilation, you should be aware of a feature that 114*22dc650dSSadaf Ebrahimiallows users to turn on UTF support from within a pattern. For example, an 115*22dc650dSSadaf Ebrahimi8-bit pattern that begins with "(*UTF)" turns on UTF-8 mode, which interprets 116*22dc650dSSadaf Ebrahimipatterns and subjects as strings of UTF-8 code units instead of individual 117*22dc650dSSadaf Ebrahimi8-bit characters. This causes both the pattern and any data against which it is 118*22dc650dSSadaf Ebrahimimatched to be checked for UTF-8 validity. If the data string is very long, such 119*22dc650dSSadaf Ebrahimia check might use sufficiently many resources as to cause your application to 120*22dc650dSSadaf Ebrahimilose performance. 121*22dc650dSSadaf Ebrahimi</P> 122*22dc650dSSadaf Ebrahimi<P> 123*22dc650dSSadaf EbrahimiOne way of guarding against this possibility is to use the 124*22dc650dSSadaf Ebrahimi<b>pcre2_pattern_info()</b> function to check the compiled pattern's options for 125*22dc650dSSadaf EbrahimiPCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when calling 126*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b>. This causes a compile time error if the pattern contains 127*22dc650dSSadaf Ebrahimia UTF-setting sequence. 128*22dc650dSSadaf Ebrahimi</P> 129*22dc650dSSadaf Ebrahimi<P> 130*22dc650dSSadaf EbrahimiThe use of Unicode properties for character types such as \d can also be 131*22dc650dSSadaf Ebrahimienabled from within the pattern, by specifying "(*UCP)". This feature can be 132*22dc650dSSadaf Ebrahimidisallowed by setting the PCRE2_NEVER_UCP option. 133*22dc650dSSadaf Ebrahimi</P> 134*22dc650dSSadaf Ebrahimi<P> 135*22dc650dSSadaf EbrahimiIf your application is one that supports UTF, be aware that validity checking 136*22dc650dSSadaf Ebrahimican take time. If the same data string is to be matched many times, you can use 137*22dc650dSSadaf Ebrahimithe PCRE2_NO_UTF_CHECK option for the second and subsequent matches to avoid 138*22dc650dSSadaf Ebrahimirunning redundant checks. 139*22dc650dSSadaf Ebrahimi</P> 140*22dc650dSSadaf Ebrahimi<P> 141*22dc650dSSadaf EbrahimiThe use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead to 142*22dc650dSSadaf Ebrahimiproblems, because it may leave the current matching point in the middle of a 143*22dc650dSSadaf Ebrahimimulti-code-unit character. The PCRE2_NEVER_BACKSLASH_C option can be used by an 144*22dc650dSSadaf Ebrahimiapplication to lock out the use of \C, causing a compile-time error if it is 145*22dc650dSSadaf Ebrahimiencountered. It is also possible to build PCRE2 with the use of \C permanently 146*22dc650dSSadaf Ebrahimidisabled. 147*22dc650dSSadaf Ebrahimi</P> 148*22dc650dSSadaf Ebrahimi<P> 149*22dc650dSSadaf EbrahimiAnother way that performance can be hit is by running a pattern that has a very 150*22dc650dSSadaf Ebrahimilarge search tree against a string that will never match. Nested unlimited 151*22dc650dSSadaf Ebrahimirepeats in a pattern are a common example. PCRE2 provides some protection 152*22dc650dSSadaf Ebrahimiagainst this: see the <b>pcre2_set_match_limit()</b> function in the 153*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 154*22dc650dSSadaf Ebrahimipage. There is a similar function called <b>pcre2_set_depth_limit()</b> that can 155*22dc650dSSadaf Ebrahimibe used to restrict the amount of memory that is used. 156*22dc650dSSadaf Ebrahimi</P> 157*22dc650dSSadaf Ebrahimi<br><a name="SEC3" href="#TOC1">USER DOCUMENTATION</a><br> 158*22dc650dSSadaf Ebrahimi<P> 159*22dc650dSSadaf EbrahimiThe user documentation for PCRE2 comprises a number of different sections. In 160*22dc650dSSadaf Ebrahimithe "man" format, each of these is a separate "man page". In the HTML format, 161*22dc650dSSadaf Ebrahimieach is a separate page, linked from the index page. In the plain text format, 162*22dc650dSSadaf Ebrahimithe descriptions of the <b>pcre2grep</b> and <b>pcre2test</b> programs are in 163*22dc650dSSadaf Ebrahimifiles called <b>pcre2grep.txt</b> and <b>pcre2test.txt</b>, respectively. The 164*22dc650dSSadaf Ebrahimiremaining sections, except for the <b>pcre2demo</b> section (which is a program 165*22dc650dSSadaf Ebrahimilisting), and the short pages for individual functions, are concatenated in 166*22dc650dSSadaf Ebrahimi<b>pcre2.txt</b>, for ease of searching. The sections are as follows: 167*22dc650dSSadaf Ebrahimi<pre> 168*22dc650dSSadaf Ebrahimi pcre2 this document 169*22dc650dSSadaf Ebrahimi pcre2-config show PCRE2 installation configuration information 170*22dc650dSSadaf Ebrahimi pcre2api details of PCRE2's native C API 171*22dc650dSSadaf Ebrahimi pcre2build building PCRE2 172*22dc650dSSadaf Ebrahimi pcre2callout details of the pattern callout feature 173*22dc650dSSadaf Ebrahimi pcre2compat discussion of Perl compatibility 174*22dc650dSSadaf Ebrahimi pcre2convert details of pattern conversion functions 175*22dc650dSSadaf Ebrahimi pcre2demo a demonstration C program that uses PCRE2 176*22dc650dSSadaf Ebrahimi pcre2grep description of the <b>pcre2grep</b> command (8-bit only) 177*22dc650dSSadaf Ebrahimi pcre2jit discussion of just-in-time optimization support 178*22dc650dSSadaf Ebrahimi pcre2limits details of size and other limits 179*22dc650dSSadaf Ebrahimi pcre2matching discussion of the two matching algorithms 180*22dc650dSSadaf Ebrahimi pcre2partial details of the partial matching facility 181*22dc650dSSadaf Ebrahimi pcre2pattern syntax and semantics of supported regular expression patterns 182*22dc650dSSadaf Ebrahimi pcre2perform discussion of performance issues 183*22dc650dSSadaf Ebrahimi pcre2posix the POSIX-compatible C API for the 8-bit library 184*22dc650dSSadaf Ebrahimi pcre2sample discussion of the pcre2demo program 185*22dc650dSSadaf Ebrahimi pcre2serialize details of pattern serialization 186*22dc650dSSadaf Ebrahimi pcre2syntax quick syntax reference 187*22dc650dSSadaf Ebrahimi pcre2test description of the <b>pcre2test</b> command 188*22dc650dSSadaf Ebrahimi pcre2unicode discussion of Unicode and UTF support 189*22dc650dSSadaf Ebrahimi</pre> 190*22dc650dSSadaf EbrahimiIn the "man" and HTML formats, there is also a short page for each C library 191*22dc650dSSadaf Ebrahimifunction, listing its arguments and results. 192*22dc650dSSadaf Ebrahimi</P> 193*22dc650dSSadaf Ebrahimi<br><a name="SEC4" href="#TOC1">AUTHOR</a><br> 194*22dc650dSSadaf Ebrahimi<P> 195*22dc650dSSadaf EbrahimiPhilip Hazel 196*22dc650dSSadaf Ebrahimi<br> 197*22dc650dSSadaf EbrahimiRetired from University Computing Service 198*22dc650dSSadaf Ebrahimi<br> 199*22dc650dSSadaf EbrahimiCambridge, England. 200*22dc650dSSadaf Ebrahimi<br> 201*22dc650dSSadaf Ebrahimi</P> 202*22dc650dSSadaf Ebrahimi<P> 203*22dc650dSSadaf EbrahimiPutting an actual email address here is a spam magnet. If you want to email me, 204*22dc650dSSadaf Ebrahimiuse my two names separated by a dot at gmail.com. 205*22dc650dSSadaf Ebrahimi</P> 206*22dc650dSSadaf Ebrahimi<br><a name="SEC5" href="#TOC1">REVISION</a><br> 207*22dc650dSSadaf Ebrahimi<P> 208*22dc650dSSadaf EbrahimiLast updated: 27 August 2021 209*22dc650dSSadaf Ebrahimi<br> 210*22dc650dSSadaf EbrahimiCopyright © 1997-2021 University of Cambridge. 211*22dc650dSSadaf Ebrahimi<br> 212*22dc650dSSadaf Ebrahimi<p> 213*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>. 214*22dc650dSSadaf Ebrahimi</p> 215