xref: /aosp_15_r20/external/pcre/doc/html/pcre2unicode.html (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf Ebrahimi<html>
2*22dc650dSSadaf Ebrahimi<head>
3*22dc650dSSadaf Ebrahimi<title>pcre2unicode specification</title>
4*22dc650dSSadaf Ebrahimi</head>
5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6*22dc650dSSadaf Ebrahimi<h1>pcre2unicode man page</h1>
7*22dc650dSSadaf Ebrahimi<p>
8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
9*22dc650dSSadaf Ebrahimi</p>
10*22dc650dSSadaf Ebrahimi<p>
11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated
12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it,
13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong.
14*22dc650dSSadaf Ebrahimi<br>
15*22dc650dSSadaf Ebrahimi<br><b>
16*22dc650dSSadaf EbrahimiUNICODE AND UTF SUPPORT
17*22dc650dSSadaf Ebrahimi</b><br>
18*22dc650dSSadaf Ebrahimi<P>
19*22dc650dSSadaf EbrahimiPCRE2 is normally built with Unicode support, though if you do not need it, you
20*22dc650dSSadaf Ebrahimican build it without, in which case the library will be smaller. With Unicode
21*22dc650dSSadaf Ebrahimisupport, PCRE2 has knowledge of Unicode character properties and can process
22*22dc650dSSadaf Ebrahimistrings of text in UTF-8, UTF-16, and UTF-32 format (depending on the code unit
23*22dc650dSSadaf Ebrahimiwidth), but this is not the default. Unless specifically requested, PCRE2
24*22dc650dSSadaf Ebrahimitreats each code unit in a string as one character.
25*22dc650dSSadaf Ebrahimi</P>
26*22dc650dSSadaf Ebrahimi<P>
27*22dc650dSSadaf EbrahimiThere are two ways of telling PCRE2 to switch to UTF mode, where characters may
28*22dc650dSSadaf Ebrahimiconsist of more than one code unit and the range of values is constrained. The
29*22dc650dSSadaf Ebrahimiprogram can call
30*22dc650dSSadaf Ebrahimi<a href="pcre2_compile.html"><b>pcre2_compile()</b></a>
31*22dc650dSSadaf Ebrahimiwith the PCRE2_UTF option, or the pattern may start with the sequence (*UTF).
32*22dc650dSSadaf EbrahimiHowever, the latter facility can be locked out by the PCRE2_NEVER_UTF option.
33*22dc650dSSadaf EbrahimiThat is, the programmer can prevent the supplier of the pattern from switching
34*22dc650dSSadaf Ebrahimito UTF mode.
35*22dc650dSSadaf Ebrahimi</P>
36*22dc650dSSadaf Ebrahimi<P>
37*22dc650dSSadaf EbrahimiNote that the PCRE2_MATCH_INVALID_UTF option (see
38*22dc650dSSadaf Ebrahimi<a href="#matchinvalid">below)</a>
39*22dc650dSSadaf Ebrahimiforces PCRE2_UTF to be set.
40*22dc650dSSadaf Ebrahimi</P>
41*22dc650dSSadaf Ebrahimi<P>
42*22dc650dSSadaf EbrahimiIn UTF mode, both the pattern and any subject strings that are matched against
43*22dc650dSSadaf Ebrahimiit are treated as UTF strings instead of strings of individual one-code-unit
44*22dc650dSSadaf Ebrahimicharacters. There are also some other changes to the way characters are
45*22dc650dSSadaf Ebrahimihandled, as documented below.
46*22dc650dSSadaf Ebrahimi</P>
47*22dc650dSSadaf Ebrahimi<br><b>
48*22dc650dSSadaf EbrahimiUNICODE PROPERTY SUPPORT
49*22dc650dSSadaf Ebrahimi</b><br>
50*22dc650dSSadaf Ebrahimi<P>
51*22dc650dSSadaf EbrahimiWhen PCRE2 is built with Unicode support, the escape sequences \p{..},
52*22dc650dSSadaf Ebrahimi\P{..}, and \X can be used. This is not dependent on the PCRE2_UTF setting.
53*22dc650dSSadaf EbrahimiThe Unicode properties that can be tested are a subset of those that Perl
54*22dc650dSSadaf Ebrahimisupports. Currently they are limited to the general category properties such as
55*22dc650dSSadaf EbrahimiLu for an upper case letter or Nd for a decimal number, the derived properties
56*22dc650dSSadaf EbrahimiAny and LC (synonym L&), the Unicode script names such as Arabic or Han,
57*22dc650dSSadaf EbrahimiBidi_Class, Bidi_Control, and a few binary properties.
58*22dc650dSSadaf Ebrahimi</P>
59*22dc650dSSadaf Ebrahimi<P>
60*22dc650dSSadaf EbrahimiThe full lists are given in the
61*22dc650dSSadaf Ebrahimi<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
62*22dc650dSSadaf Ebrahimiand
63*22dc650dSSadaf Ebrahimi<a href="pcre2syntax.html"><b>pcre2syntax</b></a>
64*22dc650dSSadaf Ebrahimidocumentation. In general, only the short names for properties are supported.
65*22dc650dSSadaf EbrahimiFor example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is not
66*22dc650dSSadaf Ebrahimisupported. Furthermore, in Perl, many properties may optionally be prefixed by
67*22dc650dSSadaf Ebrahimi"Is", for compatibility with Perl 5.6. PCRE2 does not support this.
68*22dc650dSSadaf Ebrahimi</P>
69*22dc650dSSadaf Ebrahimi<br><b>
70*22dc650dSSadaf EbrahimiWIDE CHARACTERS AND UTF MODES
71*22dc650dSSadaf Ebrahimi</b><br>
72*22dc650dSSadaf Ebrahimi<P>
73*22dc650dSSadaf EbrahimiCode points less than 256 can be specified in patterns by either braced or
74*22dc650dSSadaf Ebrahimiunbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). Larger
75*22dc650dSSadaf Ebrahimivalues have to use braced sequences. Unbraced octal code points up to \777 are
76*22dc650dSSadaf Ebrahimialso recognized; larger ones can be coded using \o{...}.
77*22dc650dSSadaf Ebrahimi</P>
78*22dc650dSSadaf Ebrahimi<P>
79*22dc650dSSadaf EbrahimiThe escape sequence \N{U+&#60;hex digits&#62;} is recognized as another way of
80*22dc650dSSadaf Ebrahimispecifying a Unicode character by code point in a UTF mode. It is not allowed
81*22dc650dSSadaf Ebrahimiin non-UTF mode.
82*22dc650dSSadaf Ebrahimi</P>
83*22dc650dSSadaf Ebrahimi<P>
84*22dc650dSSadaf EbrahimiIn UTF mode, repeat quantifiers apply to complete UTF characters, not to
85*22dc650dSSadaf Ebrahimiindividual code units.
86*22dc650dSSadaf Ebrahimi</P>
87*22dc650dSSadaf Ebrahimi<P>
88*22dc650dSSadaf EbrahimiIn UTF mode, the dot metacharacter matches one UTF character instead of a
89*22dc650dSSadaf Ebrahimisingle code unit.
90*22dc650dSSadaf Ebrahimi</P>
91*22dc650dSSadaf Ebrahimi<P>
92*22dc650dSSadaf EbrahimiIn UTF mode, capture group names are not restricted to ASCII, and may contain
93*22dc650dSSadaf Ebrahimiany Unicode letters and decimal digits, as well as underscore.
94*22dc650dSSadaf Ebrahimi</P>
95*22dc650dSSadaf Ebrahimi<P>
96*22dc650dSSadaf EbrahimiThe escape sequence \C can be used to match a single code unit in UTF mode,
97*22dc650dSSadaf Ebrahimibut its use can lead to some strange effects because it breaks up multi-unit
98*22dc650dSSadaf Ebrahimicharacters (see the description of \C in the
99*22dc650dSSadaf Ebrahimi<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
100*22dc650dSSadaf Ebrahimidocumentation). For this reason, there is a build-time option that disables
101*22dc650dSSadaf Ebrahimisupport for \C completely. There is also a less draconian compile-time option
102*22dc650dSSadaf Ebrahimifor locking out the use of \C when a pattern is compiled.
103*22dc650dSSadaf Ebrahimi</P>
104*22dc650dSSadaf Ebrahimi<P>
105*22dc650dSSadaf EbrahimiThe use of \C is not supported by the alternative matching function
106*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b> when in UTF-8 or UTF-16 mode, that is, when a character
107*22dc650dSSadaf Ebrahimimay consist of more than one code unit. The use of \C in these modes provokes
108*22dc650dSSadaf Ebrahimia match-time error. Also, the JIT optimization does not support \C in these
109*22dc650dSSadaf Ebrahimimodes. If JIT optimization is requested for a UTF-8 or UTF-16 pattern that
110*22dc650dSSadaf Ebrahimicontains \C, it will not succeed, and so when <b>pcre2_match()</b> is called,
111*22dc650dSSadaf Ebrahimithe matching will be carried out by the interpretive function.
112*22dc650dSSadaf Ebrahimi</P>
113*22dc650dSSadaf Ebrahimi<P>
114*22dc650dSSadaf EbrahimiThe character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
115*22dc650dSSadaf Ebrahimicharacters of any code value, but, by default, the characters that PCRE2
116*22dc650dSSadaf Ebrahimirecognizes as digits, spaces, or word characters remain the same set as in
117*22dc650dSSadaf Ebrahiminon-UTF mode, all with code points less than 256. This remains true even when
118*22dc650dSSadaf EbrahimiPCRE2 is built to include Unicode support, because to do otherwise would slow
119*22dc650dSSadaf Ebrahimidown matching in many common cases. Note that this also applies to \b
120*22dc650dSSadaf Ebrahimiand \B, because they are defined in terms of \w and \W. If you want
121*22dc650dSSadaf Ebrahimito test for a wider sense of, say, "digit", you can use explicit Unicode
122*22dc650dSSadaf Ebrahimiproperty tests such as \p{Nd}. Alternatively, if you set the PCRE2_UCP option,
123*22dc650dSSadaf Ebrahimithe way that the character escapes work is changed so that Unicode properties
124*22dc650dSSadaf Ebrahimiare used to determine which characters match, though there are some options
125*22dc650dSSadaf Ebrahimithat suppress this for individual escapes. For details see the section on
126*22dc650dSSadaf Ebrahimi<a href="pcre2pattern.html#genericchartypes">generic character types</a>
127*22dc650dSSadaf Ebrahimiin the
128*22dc650dSSadaf Ebrahimi<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
129*22dc650dSSadaf Ebrahimidocumentation.
130*22dc650dSSadaf Ebrahimi</P>
131*22dc650dSSadaf Ebrahimi<P>
132*22dc650dSSadaf EbrahimiLike the escapes, characters that match the POSIX named character classes are
133*22dc650dSSadaf Ebrahimiall low-valued characters unless the PCRE2_UCP option is set, but there is an
134*22dc650dSSadaf Ebrahimioption to override this.
135*22dc650dSSadaf Ebrahimi</P>
136*22dc650dSSadaf Ebrahimi<P>
137*22dc650dSSadaf EbrahimiIn contrast to the character escapes and character classes, the special
138*22dc650dSSadaf Ebrahimihorizontal and vertical white space escapes (\h, \H, \v, and \V) do match
139*22dc650dSSadaf Ebrahimiall the appropriate Unicode characters, whether or not PCRE2_UCP is set.
140*22dc650dSSadaf Ebrahimi</P>
141*22dc650dSSadaf Ebrahimi<br><b>
142*22dc650dSSadaf EbrahimiUNICODE CASE-EQUIVALENCE
143*22dc650dSSadaf Ebrahimi</b><br>
144*22dc650dSSadaf Ebrahimi<P>
145*22dc650dSSadaf EbrahimiIf either PCRE2_UTF or PCRE2_UCP is set, upper/lower case processing makes use
146*22dc650dSSadaf Ebrahimiof Unicode properties except for characters whose code points are less than 128
147*22dc650dSSadaf Ebrahimiand that have at most two case-equivalent values. For these, a direct table
148*22dc650dSSadaf Ebrahimilookup is used for speed. A few Unicode characters such as Greek sigma have
149*22dc650dSSadaf Ebrahimimore than two code points that are case-equivalent, and these are treated
150*22dc650dSSadaf Ebrahimispecially. Setting PCRE2_UCP without PCRE2_UTF allows Unicode-style case
151*22dc650dSSadaf Ebrahimiprocessing for non-UTF character encodings such as UCS-2.
152*22dc650dSSadaf Ebrahimi</P>
153*22dc650dSSadaf Ebrahimi<P>
154*22dc650dSSadaf EbrahimiThere are two ASCII characters (S and K) that, in addition to their ASCII lower
155*22dc650dSSadaf Ebrahimicase equivalents, have a non-ASCII one as well (long S and Kelvin sign).
156*22dc650dSSadaf EbrahimiRecognition of these non-ASCII characters as case-equivalent to their ASCII
157*22dc650dSSadaf Ebrahimicounterparts can be disabled by setting the PCRE2_EXTRA_CASELESS_RESTRICT
158*22dc650dSSadaf Ebrahimioption. When this is set, all characters in a case equivalence must either be
159*22dc650dSSadaf EbrahimiASCII or non-ASCII; there can be no mixing.
160*22dc650dSSadaf Ebrahimi<a name="scriptruns"></a></P>
161*22dc650dSSadaf Ebrahimi<br><b>
162*22dc650dSSadaf EbrahimiSCRIPT RUNS
163*22dc650dSSadaf Ebrahimi</b><br>
164*22dc650dSSadaf Ebrahimi<P>
165*22dc650dSSadaf EbrahimiThe pattern constructs (*script_run:...) and (*atomic_script_run:...), with
166*22dc650dSSadaf Ebrahimisynonyms (*sr:...) and (*asr:...), verify that the string matched within the
167*22dc650dSSadaf Ebrahimiparentheses is a script run. In concept, a script run is a sequence of
168*22dc650dSSadaf Ebrahimicharacters that are all from the same Unicode script. However, because some
169*22dc650dSSadaf Ebrahimiscripts are commonly used together, and because some diacritical and other
170*22dc650dSSadaf Ebrahimimarks are used with multiple scripts, it is not that simple.
171*22dc650dSSadaf Ebrahimi</P>
172*22dc650dSSadaf Ebrahimi<P>
173*22dc650dSSadaf EbrahimiEvery Unicode character has a Script property, mostly with a value
174*22dc650dSSadaf Ebrahimicorresponding to the name of a script, such as Latin, Greek, or Cyrillic. There
175*22dc650dSSadaf Ebrahimiare also three special values:
176*22dc650dSSadaf Ebrahimi</P>
177*22dc650dSSadaf Ebrahimi<P>
178*22dc650dSSadaf Ebrahimi"Unknown" is used for code points that have not been assigned, and also for the
179*22dc650dSSadaf Ebrahimisurrogate code points. In the PCRE2 32-bit library, characters whose code
180*22dc650dSSadaf Ebrahimipoints are greater than the Unicode maximum (U+10FFFF), which are accessible
181*22dc650dSSadaf Ebrahimionly in non-UTF mode, are assigned the Unknown script.
182*22dc650dSSadaf Ebrahimi</P>
183*22dc650dSSadaf Ebrahimi<P>
184*22dc650dSSadaf Ebrahimi"Common" is used for characters that are used with many scripts. These include
185*22dc650dSSadaf Ebrahimipunctuation, emoji, mathematical, musical, and currency symbols, and the ASCII
186*22dc650dSSadaf Ebrahimidigits 0 to 9.
187*22dc650dSSadaf Ebrahimi</P>
188*22dc650dSSadaf Ebrahimi<P>
189*22dc650dSSadaf Ebrahimi"Inherited" is used for characters such as diacritical marks that modify a
190*22dc650dSSadaf Ebrahimiprevious character. These are considered to take on the script of the character
191*22dc650dSSadaf Ebrahimithat they modify.
192*22dc650dSSadaf Ebrahimi</P>
193*22dc650dSSadaf Ebrahimi<P>
194*22dc650dSSadaf EbrahimiSome Inherited characters are used with many scripts, but many of them are only
195*22dc650dSSadaf Ebrahiminormally used with a small number of scripts. For example, U+102E0 (Coptic
196*22dc650dSSadaf EbrahimiEpact thousands mark) is used only with Arabic and Coptic. In order to make it
197*22dc650dSSadaf Ebrahimipossible to check this, a Unicode property called Script Extension exists. Its
198*22dc650dSSadaf Ebrahimivalue is a list of scripts that apply to the character. For the majority of
199*22dc650dSSadaf Ebrahimicharacters, the list contains just one script, the same one as the Script
200*22dc650dSSadaf Ebrahimiproperty. However, for characters such as U+102E0 more than one Script is
201*22dc650dSSadaf Ebrahimilisted. There are also some Common characters that have a single, non-Common
202*22dc650dSSadaf Ebrahimiscript in their Script Extension list.
203*22dc650dSSadaf Ebrahimi</P>
204*22dc650dSSadaf Ebrahimi<P>
205*22dc650dSSadaf EbrahimiThe next section describes the basic rules for deciding whether a given string
206*22dc650dSSadaf Ebrahimiof characters is a script run. Note, however, that there are some special cases
207*22dc650dSSadaf Ebrahimiinvolving the Chinese Han script, and an additional constraint for decimal
208*22dc650dSSadaf Ebrahimidigits. These are covered in subsequent sections.
209*22dc650dSSadaf Ebrahimi</P>
210*22dc650dSSadaf Ebrahimi<br><b>
211*22dc650dSSadaf EbrahimiBasic script run rules
212*22dc650dSSadaf Ebrahimi</b><br>
213*22dc650dSSadaf Ebrahimi<P>
214*22dc650dSSadaf EbrahimiA string that is less than two characters long is a script run. This is the
215*22dc650dSSadaf Ebrahimionly case in which an Unknown character can be part of a script run. Longer
216*22dc650dSSadaf Ebrahimistrings are checked using only the Script Extensions property, not the basic
217*22dc650dSSadaf EbrahimiScript property.
218*22dc650dSSadaf Ebrahimi</P>
219*22dc650dSSadaf Ebrahimi<P>
220*22dc650dSSadaf EbrahimiIf a character's Script Extension property is the single value "Inherited", it
221*22dc650dSSadaf Ebrahimiis always accepted as part of a script run. This is also true for the property
222*22dc650dSSadaf Ebrahimi"Common", subject to the checking of decimal digits described below. All the
223*22dc650dSSadaf Ebrahimiremaining characters in a script run must have at least one script in common in
224*22dc650dSSadaf Ebrahimitheir Script Extension lists. In set-theoretic terminology, the intersection of
225*22dc650dSSadaf Ebrahimiall the sets of scripts must not be empty.
226*22dc650dSSadaf Ebrahimi</P>
227*22dc650dSSadaf Ebrahimi<P>
228*22dc650dSSadaf EbrahimiA simple example is an Internet name such as "google.com". The letters are all
229*22dc650dSSadaf Ebrahimiin the Latin script, and the dot is Common, so this string is a script run.
230*22dc650dSSadaf EbrahimiHowever, the Cyrillic letter "o" looks exactly the same as the Latin "o"; a
231*22dc650dSSadaf Ebrahimistring that looks the same, but with Cyrillic "o"s is not a script run.
232*22dc650dSSadaf Ebrahimi</P>
233*22dc650dSSadaf Ebrahimi<P>
234*22dc650dSSadaf EbrahimiMore interesting examples involve characters with more than one script in their
235*22dc650dSSadaf EbrahimiScript Extension. Consider the following characters:
236*22dc650dSSadaf Ebrahimi<pre>
237*22dc650dSSadaf Ebrahimi  U+060C  Arabic comma
238*22dc650dSSadaf Ebrahimi  U+06D4  Arabic full stop
239*22dc650dSSadaf Ebrahimi</pre>
240*22dc650dSSadaf EbrahimiThe first has the Script Extension list Arabic, Hanifi Rohingya, Syriac, and
241*22dc650dSSadaf EbrahimiThaana; the second has just Arabic and Hanifi Rohingya. Both of them could
242*22dc650dSSadaf Ebrahimiappear in script runs of either Arabic or Hanifi Rohingya. The first could also
243*22dc650dSSadaf Ebrahimiappear in Syriac or Thaana script runs, but the second could not.
244*22dc650dSSadaf Ebrahimi</P>
245*22dc650dSSadaf Ebrahimi<br><b>
246*22dc650dSSadaf EbrahimiThe Chinese Han script
247*22dc650dSSadaf Ebrahimi</b><br>
248*22dc650dSSadaf Ebrahimi<P>
249*22dc650dSSadaf EbrahimiThe Chinese Han script is commonly used in conjunction with other scripts for
250*22dc650dSSadaf Ebrahimiwriting certain languages. Japanese uses the Hiragana and Katakana scripts
251*22dc650dSSadaf Ebrahimitogether with Han; Korean uses Hangul and Han; Taiwanese Mandarin uses Bopomofo
252*22dc650dSSadaf Ebrahimiand Han. These three combinations are treated as special cases when checking
253*22dc650dSSadaf Ebrahimiscript runs and are, in effect, "virtual scripts". Thus, a script run may
254*22dc650dSSadaf Ebrahimicontain a mixture of Hiragana, Katakana, and Han, or a mixture of Hangul and
255*22dc650dSSadaf EbrahimiHan, or a mixture of Bopomofo and Han, but not, for example, a mixture of
256*22dc650dSSadaf EbrahimiHangul and Bopomofo and Han. PCRE2 (like Perl) follows Unicode's Technical
257*22dc650dSSadaf EbrahimiStandard 39 ("Unicode Security Mechanisms", http://unicode.org/reports/tr39/)
258*22dc650dSSadaf Ebrahimiin allowing such mixtures.
259*22dc650dSSadaf Ebrahimi</P>
260*22dc650dSSadaf Ebrahimi<br><b>
261*22dc650dSSadaf EbrahimiDecimal digits
262*22dc650dSSadaf Ebrahimi</b><br>
263*22dc650dSSadaf Ebrahimi<P>
264*22dc650dSSadaf EbrahimiUnicode contains many sets of 10 decimal digits in different scripts, and some
265*22dc650dSSadaf Ebrahimiscripts (including the Common script) contain more than one set. Some of these
266*22dc650dSSadaf Ebrahimidecimal digits them are visually indistinguishable from the common ASCII
267*22dc650dSSadaf Ebrahimidigits. In addition to the script checking described above, if a script run
268*22dc650dSSadaf Ebrahimicontains any decimal digits, they must all come from the same set of 10
269*22dc650dSSadaf Ebrahimiadjacent characters.
270*22dc650dSSadaf Ebrahimi</P>
271*22dc650dSSadaf Ebrahimi<br><b>
272*22dc650dSSadaf EbrahimiVALIDITY OF UTF STRINGS
273*22dc650dSSadaf Ebrahimi</b><br>
274*22dc650dSSadaf Ebrahimi<P>
275*22dc650dSSadaf EbrahimiWhen the PCRE2_UTF option is set, the strings passed as patterns and subjects
276*22dc650dSSadaf Ebrahimiare (by default) checked for validity on entry to the relevant functions. If an
277*22dc650dSSadaf Ebrahimiinvalid UTF string is passed, a negative error code is returned. The code unit
278*22dc650dSSadaf Ebrahimioffset to the offending character can be extracted from the match data block by
279*22dc650dSSadaf Ebrahimicalling <b>pcre2_get_startchar()</b>, which is used for this purpose after a UTF
280*22dc650dSSadaf Ebrahimierror.
281*22dc650dSSadaf Ebrahimi</P>
282*22dc650dSSadaf Ebrahimi<P>
283*22dc650dSSadaf EbrahimiIn some situations, you may already know that your strings are valid, and
284*22dc650dSSadaf Ebrahimitherefore want to skip these checks in order to improve performance, for
285*22dc650dSSadaf Ebrahimiexample in the case of a long subject string that is being scanned repeatedly.
286*22dc650dSSadaf EbrahimiIf you set the PCRE2_NO_UTF_CHECK option at compile time or at match time,
287*22dc650dSSadaf EbrahimiPCRE2 assumes that the pattern or subject it is given (respectively) contains
288*22dc650dSSadaf Ebrahimionly valid UTF code unit sequences.
289*22dc650dSSadaf Ebrahimi</P>
290*22dc650dSSadaf Ebrahimi<P>
291*22dc650dSSadaf EbrahimiIf you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result
292*22dc650dSSadaf Ebrahimiis undefined and your program may crash or loop indefinitely or give incorrect
293*22dc650dSSadaf Ebrahimiresults. There is, however, one mode of matching that can handle invalid UTF
294*22dc650dSSadaf Ebrahimisubject strings. This is enabled by passing PCRE2_MATCH_INVALID_UTF to
295*22dc650dSSadaf Ebrahimi<b>pcre2_compile()</b> and is discussed below in the next section. The rest of
296*22dc650dSSadaf Ebrahimithis section covers the case when PCRE2_MATCH_INVALID_UTF is not set.
297*22dc650dSSadaf Ebrahimi</P>
298*22dc650dSSadaf Ebrahimi<P>
299*22dc650dSSadaf EbrahimiPassing PCRE2_NO_UTF_CHECK to <b>pcre2_compile()</b> just disables the UTF check
300*22dc650dSSadaf Ebrahimifor the pattern; it does not also apply to subject strings. If you want to
301*22dc650dSSadaf Ebrahimidisable the check for a subject string you must pass this same option to
302*22dc650dSSadaf Ebrahimi<b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>.
303*22dc650dSSadaf Ebrahimi</P>
304*22dc650dSSadaf Ebrahimi<P>
305*22dc650dSSadaf EbrahimiUTF-16 and UTF-32 strings can indicate their endianness by special code knows
306*22dc650dSSadaf Ebrahimias a byte-order mark (BOM). The PCRE2 functions do not handle this, expecting
307*22dc650dSSadaf Ebrahimistrings to be in host byte order.
308*22dc650dSSadaf Ebrahimi</P>
309*22dc650dSSadaf Ebrahimi<P>
310*22dc650dSSadaf EbrahimiUnless PCRE2_NO_UTF_CHECK is set, a UTF string is checked before any other
311*22dc650dSSadaf Ebrahimiprocessing takes place. In the case of <b>pcre2_match()</b> and
312*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b> calls with a non-zero starting offset, the check is
313*22dc650dSSadaf Ebrahimiapplied only to that part of the subject that could be inspected during
314*22dc650dSSadaf Ebrahimimatching, and there is a check that the starting offset points to the first
315*22dc650dSSadaf Ebrahimicode unit of a character or to the end of the subject. If there are no
316*22dc650dSSadaf Ebrahimilookbehind assertions in the pattern, the check starts at the starting offset.
317*22dc650dSSadaf EbrahimiOtherwise, it starts at the length of the longest lookbehind before the
318*22dc650dSSadaf Ebrahimistarting offset, or at the start of the subject if there are not that many
319*22dc650dSSadaf Ebrahimicharacters before the starting offset. Note that the sequences \b and \B are
320*22dc650dSSadaf Ebrahimione-character lookbehinds.
321*22dc650dSSadaf Ebrahimi</P>
322*22dc650dSSadaf Ebrahimi<P>
323*22dc650dSSadaf EbrahimiIn addition to checking the format of the string, there is a check to ensure
324*22dc650dSSadaf Ebrahimithat all code points lie in the range U+0 to U+10FFFF, excluding the surrogate
325*22dc650dSSadaf Ebrahimiarea. The so-called "non-character" code points are not excluded because
326*22dc650dSSadaf EbrahimiUnicode corrigendum #9 makes it clear that they should not be.
327*22dc650dSSadaf Ebrahimi</P>
328*22dc650dSSadaf Ebrahimi<P>
329*22dc650dSSadaf EbrahimiCharacters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
330*22dc650dSSadaf Ebrahimiwhere they are used in pairs to encode code points with values greater than
331*22dc650dSSadaf Ebrahimi0xFFFF. The code points that are encoded by UTF-16 pairs are available
332*22dc650dSSadaf Ebrahimiindependently in the UTF-8 and UTF-32 encodings. (In other words, the whole
333*22dc650dSSadaf Ebrahimisurrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
334*22dc650dSSadaf EbrahimiUTF-32.)
335*22dc650dSSadaf Ebrahimi</P>
336*22dc650dSSadaf Ebrahimi<P>
337*22dc650dSSadaf EbrahimiSetting PCRE2_NO_UTF_CHECK at compile time does not disable the error that is
338*22dc650dSSadaf Ebrahimigiven if an escape sequence for an invalid Unicode code point is encountered in
339*22dc650dSSadaf Ebrahimithe pattern. If you want to allow escape sequences such as \x{d800} (a
340*22dc650dSSadaf Ebrahimisurrogate code point) you can set the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra
341*22dc650dSSadaf Ebrahimioption. However, this is possible only in UTF-8 and UTF-32 modes, because these
342*22dc650dSSadaf Ebrahimivalues are not representable in UTF-16.
343*22dc650dSSadaf Ebrahimi<a name="utf8strings"></a></P>
344*22dc650dSSadaf Ebrahimi<br><b>
345*22dc650dSSadaf EbrahimiErrors in UTF-8 strings
346*22dc650dSSadaf Ebrahimi</b><br>
347*22dc650dSSadaf Ebrahimi<P>
348*22dc650dSSadaf EbrahimiThe following negative error codes are given for invalid UTF-8 strings:
349*22dc650dSSadaf Ebrahimi<pre>
350*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR1
351*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR2
352*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR3
353*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR4
354*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR5
355*22dc650dSSadaf Ebrahimi</pre>
356*22dc650dSSadaf EbrahimiThe string ends with a truncated UTF-8 character; the code specifies how many
357*22dc650dSSadaf Ebrahimibytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be
358*22dc650dSSadaf Ebrahimino longer than 4 bytes, the encoding scheme (originally defined by RFC 2279)
359*22dc650dSSadaf Ebrahimiallows for up to 6 bytes, and this is checked first; hence the possibility of
360*22dc650dSSadaf Ebrahimi4 or 5 missing bytes.
361*22dc650dSSadaf Ebrahimi<pre>
362*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR6
363*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR7
364*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR8
365*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR9
366*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR10
367*22dc650dSSadaf Ebrahimi</pre>
368*22dc650dSSadaf EbrahimiThe two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the
369*22dc650dSSadaf Ebrahimicharacter do not have the binary value 0b10 (that is, either the most
370*22dc650dSSadaf Ebrahimisignificant bit is 0, or the next bit is 1).
371*22dc650dSSadaf Ebrahimi<pre>
372*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR11
373*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR12
374*22dc650dSSadaf Ebrahimi</pre>
375*22dc650dSSadaf EbrahimiA character that is valid by the RFC 2279 rules is either 5 or 6 bytes long;
376*22dc650dSSadaf Ebrahimithese code points are excluded by RFC 3629.
377*22dc650dSSadaf Ebrahimi<pre>
378*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR13
379*22dc650dSSadaf Ebrahimi</pre>
380*22dc650dSSadaf EbrahimiA 4-byte character has a value greater than 0x10ffff; these code points are
381*22dc650dSSadaf Ebrahimiexcluded by RFC 3629.
382*22dc650dSSadaf Ebrahimi<pre>
383*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR14
384*22dc650dSSadaf Ebrahimi</pre>
385*22dc650dSSadaf EbrahimiA 3-byte character has a value in the range 0xd800 to 0xdfff; this range of
386*22dc650dSSadaf Ebrahimicode points are reserved by RFC 3629 for use with UTF-16, and so are excluded
387*22dc650dSSadaf Ebrahimifrom UTF-8.
388*22dc650dSSadaf Ebrahimi<pre>
389*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR15
390*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR16
391*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR17
392*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR18
393*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR19
394*22dc650dSSadaf Ebrahimi</pre>
395*22dc650dSSadaf EbrahimiA 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a
396*22dc650dSSadaf Ebrahimivalue that can be represented by fewer bytes, which is invalid. For example,
397*22dc650dSSadaf Ebrahimithe two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just
398*22dc650dSSadaf Ebrahimione byte.
399*22dc650dSSadaf Ebrahimi<pre>
400*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR20
401*22dc650dSSadaf Ebrahimi</pre>
402*22dc650dSSadaf EbrahimiThe two most significant bits of the first byte of a character have the binary
403*22dc650dSSadaf Ebrahimivalue 0b10 (that is, the most significant bit is 1 and the second is 0). Such a
404*22dc650dSSadaf Ebrahimibyte can only validly occur as the second or subsequent byte of a multi-byte
405*22dc650dSSadaf Ebrahimicharacter.
406*22dc650dSSadaf Ebrahimi<pre>
407*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF8_ERR21
408*22dc650dSSadaf Ebrahimi</pre>
409*22dc650dSSadaf EbrahimiThe first byte of a character has the value 0xfe or 0xff. These values can
410*22dc650dSSadaf Ebrahiminever occur in a valid UTF-8 string.
411*22dc650dSSadaf Ebrahimi<a name="utf16strings"></a></P>
412*22dc650dSSadaf Ebrahimi<br><b>
413*22dc650dSSadaf EbrahimiErrors in UTF-16 strings
414*22dc650dSSadaf Ebrahimi</b><br>
415*22dc650dSSadaf Ebrahimi<P>
416*22dc650dSSadaf EbrahimiThe following negative error codes are given for invalid UTF-16 strings:
417*22dc650dSSadaf Ebrahimi<pre>
418*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string
419*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate
420*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate
421*22dc650dSSadaf Ebrahimi
422*22dc650dSSadaf Ebrahimi<a name="utf32strings"></a></PRE>
423*22dc650dSSadaf Ebrahimi</P>
424*22dc650dSSadaf Ebrahimi<br><b>
425*22dc650dSSadaf EbrahimiErrors in UTF-32 strings
426*22dc650dSSadaf Ebrahimi</b><br>
427*22dc650dSSadaf Ebrahimi<P>
428*22dc650dSSadaf EbrahimiThe following negative error codes are given for invalid UTF-32 strings:
429*22dc650dSSadaf Ebrahimi<pre>
430*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
431*22dc650dSSadaf Ebrahimi  PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff
432*22dc650dSSadaf Ebrahimi
433*22dc650dSSadaf Ebrahimi<a name="matchinvalid"></a></PRE>
434*22dc650dSSadaf Ebrahimi</P>
435*22dc650dSSadaf Ebrahimi<br><b>
436*22dc650dSSadaf EbrahimiMATCHING IN INVALID UTF STRINGS
437*22dc650dSSadaf Ebrahimi</b><br>
438*22dc650dSSadaf Ebrahimi<P>
439*22dc650dSSadaf EbrahimiYou can run pattern matches on subject strings that may contain invalid UTF
440*22dc650dSSadaf Ebrahimisequences if you call <b>pcre2_compile()</b> with the PCRE2_MATCH_INVALID_UTF
441*22dc650dSSadaf Ebrahimioption. This is supported by <b>pcre2_match()</b>, including JIT matching, but
442*22dc650dSSadaf Ebrahiminot by <b>pcre2_dfa_match()</b>. When PCRE2_MATCH_INVALID_UTF is set, it forces
443*22dc650dSSadaf EbrahimiPCRE2_UTF to be set as well. Note, however, that the pattern itself must be a
444*22dc650dSSadaf Ebrahimivalid UTF string.
445*22dc650dSSadaf Ebrahimi</P>
446*22dc650dSSadaf Ebrahimi<P>
447*22dc650dSSadaf EbrahimiIf you do not set PCRE2_MATCH_INVALID_UTF when calling <b>pcre2_compile</b>, and
448*22dc650dSSadaf Ebrahimiyou are not certain that your subject strings are valid UTF sequences, you
449*22dc650dSSadaf Ebrahimishould not make use of the JIT "fast path" function <b>pcre2_jit_match()</b>
450*22dc650dSSadaf Ebrahimibecause it bypasses sanity checks, including the one for UTF validity. An
451*22dc650dSSadaf Ebrahimiinvalid string may cause undefined behaviour, including looping, crashing, or
452*22dc650dSSadaf Ebrahimigiving the wrong answer.
453*22dc650dSSadaf Ebrahimi</P>
454*22dc650dSSadaf Ebrahimi<P>
455*22dc650dSSadaf EbrahimiSetting PCRE2_MATCH_INVALID_UTF does not affect what <b>pcre2_compile()</b>
456*22dc650dSSadaf Ebrahimigenerates, but if <b>pcre2_jit_compile()</b> is subsequently called, it does
457*22dc650dSSadaf Ebrahimigenerate different code. If JIT is not used, the option affects the behaviour
458*22dc650dSSadaf Ebrahimiof the interpretive code in <b>pcre2_match()</b>. When PCRE2_MATCH_INVALID_UTF
459*22dc650dSSadaf Ebrahimiis set at compile time, PCRE2_NO_UTF_CHECK is ignored at match time.
460*22dc650dSSadaf Ebrahimi</P>
461*22dc650dSSadaf Ebrahimi<P>
462*22dc650dSSadaf EbrahimiIn this mode, an invalid code unit sequence in the subject never matches any
463*22dc650dSSadaf Ebrahimipattern item. It does not match dot, it does not match \p{Any}, it does not
464*22dc650dSSadaf Ebrahimieven match negative items such as [^X]. A lookbehind assertion fails if it
465*22dc650dSSadaf Ebrahimiencounters an invalid sequence while moving the current point backwards. In
466*22dc650dSSadaf Ebrahimiother words, an invalid UTF code unit sequence acts as a barrier which no match
467*22dc650dSSadaf Ebrahimican cross.
468*22dc650dSSadaf Ebrahimi</P>
469*22dc650dSSadaf Ebrahimi<P>
470*22dc650dSSadaf EbrahimiYou can also think of this as the subject being split up into fragments of
471*22dc650dSSadaf Ebrahimivalid UTF, delimited internally by invalid code unit sequences. The pattern is
472*22dc650dSSadaf Ebrahimimatched fragment by fragment. The result of a successful match, however, is
473*22dc650dSSadaf Ebrahimigiven as code unit offsets in the entire subject string in the usual way. There
474*22dc650dSSadaf Ebrahimiare a few points to consider:
475*22dc650dSSadaf Ebrahimi</P>
476*22dc650dSSadaf Ebrahimi<P>
477*22dc650dSSadaf EbrahimiThe internal boundaries are not interpreted as the beginnings or ends of lines
478*22dc650dSSadaf Ebrahimiand so do not match circumflex or dollar characters in the pattern.
479*22dc650dSSadaf Ebrahimi</P>
480*22dc650dSSadaf Ebrahimi<P>
481*22dc650dSSadaf EbrahimiIf <b>pcre2_match()</b> is called with an offset that points to an invalid
482*22dc650dSSadaf EbrahimiUTF-sequence, that sequence is skipped, and the match starts at the next valid
483*22dc650dSSadaf EbrahimiUTF character, or the end of the subject.
484*22dc650dSSadaf Ebrahimi</P>
485*22dc650dSSadaf Ebrahimi<P>
486*22dc650dSSadaf EbrahimiAt internal fragment boundaries, \b and \B behave in the same way as at the
487*22dc650dSSadaf Ebrahimibeginning and end of the subject. For example, a sequence such as \bWORD\b
488*22dc650dSSadaf Ebrahimiwould match an instance of WORD that is surrounded by invalid UTF code units.
489*22dc650dSSadaf Ebrahimi</P>
490*22dc650dSSadaf Ebrahimi<P>
491*22dc650dSSadaf EbrahimiUsing PCRE2_MATCH_INVALID_UTF, an application can run matches on arbitrary
492*22dc650dSSadaf Ebrahimidata, knowing that any matched strings that are returned are valid UTF. This
493*22dc650dSSadaf Ebrahimican be useful when searching for UTF text in executable or other binary files.
494*22dc650dSSadaf Ebrahimi</P>
495*22dc650dSSadaf Ebrahimi<P>
496*22dc650dSSadaf EbrahimiNote, however, that the 16-bit and 32-bit PCRE2 libraries process strings as
497*22dc650dSSadaf Ebrahimisequences of uint16_t or uint32_t code points. They cannot find valid UTF
498*22dc650dSSadaf Ebrahimisequences within an arbitrary string of bytes unless such sequences are
499*22dc650dSSadaf Ebrahimisuitably aligned.
500*22dc650dSSadaf Ebrahimi</P>
501*22dc650dSSadaf Ebrahimi<br><b>
502*22dc650dSSadaf EbrahimiAUTHOR
503*22dc650dSSadaf Ebrahimi</b><br>
504*22dc650dSSadaf Ebrahimi<P>
505*22dc650dSSadaf EbrahimiPhilip Hazel
506*22dc650dSSadaf Ebrahimi<br>
507*22dc650dSSadaf EbrahimiRetired from University Computing Service
508*22dc650dSSadaf Ebrahimi<br>
509*22dc650dSSadaf EbrahimiCambridge, England.
510*22dc650dSSadaf Ebrahimi<br>
511*22dc650dSSadaf Ebrahimi</P>
512*22dc650dSSadaf Ebrahimi<br><b>
513*22dc650dSSadaf EbrahimiREVISION
514*22dc650dSSadaf Ebrahimi</b><br>
515*22dc650dSSadaf Ebrahimi<P>
516*22dc650dSSadaf EbrahimiLast updated: 12 October 2023
517*22dc650dSSadaf Ebrahimi<br>
518*22dc650dSSadaf EbrahimiCopyright &copy; 1997-2023 University of Cambridge.
519*22dc650dSSadaf Ebrahimi<br>
520*22dc650dSSadaf Ebrahimi<p>
521*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
522*22dc650dSSadaf Ebrahimi</p>
523