xref: /aosp_15_r20/external/pcre/maint/GenerateUcd.py (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf Ebrahimi#! /usr/bin/python
2*22dc650dSSadaf Ebrahimi
3*22dc650dSSadaf Ebrahimi#                   PCRE2 UNICODE PROPERTY SUPPORT
4*22dc650dSSadaf Ebrahimi#                   ------------------------------
5*22dc650dSSadaf Ebrahimi#
6*22dc650dSSadaf Ebrahimi# This script generates the pcre2_ucd.c file from Unicode data files. This is
7*22dc650dSSadaf Ebrahimi# the compressed Unicode property data used by PCRE2. The script was created in
8*22dc650dSSadaf Ebrahimi# December 2021 as part of the Unicode data generation refactoring. It is
9*22dc650dSSadaf Ebrahimi# basically a re-working of the MultiStage2.py script that was submitted to the
10*22dc650dSSadaf Ebrahimi# PCRE project by Peter Kankowski in 2008 as part of a previous upgrading of
11*22dc650dSSadaf Ebrahimi# Unicode property support. A number of extensions have since been added. The
12*22dc650dSSadaf Ebrahimi# main difference in the 2021 upgrade (apart from comments and layout) is that
13*22dc650dSSadaf Ebrahimi# the data tables (e.g. list of script names) are now listed in or generated by
14*22dc650dSSadaf Ebrahimi# a separate Python module that is shared with the other Generate scripts.
15*22dc650dSSadaf Ebrahimi#
16*22dc650dSSadaf Ebrahimi# This script must be run in the "maint" directory. It requires the following
17*22dc650dSSadaf Ebrahimi# Unicode data tables: BidiMirrorring.txt, CaseFolding.txt,
18*22dc650dSSadaf Ebrahimi# DerivedBidiClass.txt, DerivedCoreProperties.txt, DerivedGeneralCategory.txt,
19*22dc650dSSadaf Ebrahimi# GraphemeBreakProperty.txt, PropList.txt, PropertyAliases.txt,
20*22dc650dSSadaf Ebrahimi# PropertyValueAliases.txt, ScriptExtensions.txt, Scripts.txt, and
21*22dc650dSSadaf Ebrahimi# emoji-data.txt. These must be in the Unicode.tables subdirectory.
22*22dc650dSSadaf Ebrahimi#
23*22dc650dSSadaf Ebrahimi# The emoji-data.txt file is found in the "emoji" subdirectory even though it
24*22dc650dSSadaf Ebrahimi# is technically part of a different (but coordinated) standard as shown
25*22dc650dSSadaf Ebrahimi# in files associated with Unicode Technical Standard #51 ("Unicode Emoji"),
26*22dc650dSSadaf Ebrahimi# for example:
27*22dc650dSSadaf Ebrahimi#
28*22dc650dSSadaf Ebrahimi# http://unicode.org/Public/emoji/13.0/ReadMe.txt
29*22dc650dSSadaf Ebrahimi#
30*22dc650dSSadaf Ebrahimi# DerivedBidiClass.txt and DerivedGeneralCategory.txt are in the "extracted"
31*22dc650dSSadaf Ebrahimi# subdirectory of the Unicode database (UCD) on the Unicode web site;
32*22dc650dSSadaf Ebrahimi# GraphemeBreakProperty.txt is in the "auxiliary" subdirectory. The other files
33*22dc650dSSadaf Ebrahimi# are in the top-level UCD directory.
34*22dc650dSSadaf Ebrahimi#
35*22dc650dSSadaf Ebrahimi# -----------------------------------------------------------------------------
36*22dc650dSSadaf Ebrahimi# Minor modifications made to the original script:
37*22dc650dSSadaf Ebrahimi#  Added #! line at start
38*22dc650dSSadaf Ebrahimi#  Removed tabs
39*22dc650dSSadaf Ebrahimi#  Made it work with Python 2.4 by rewriting two statements that needed 2.5
40*22dc650dSSadaf Ebrahimi#  Consequent code tidy
41*22dc650dSSadaf Ebrahimi#  Adjusted data file names to take from the Unicode.tables directory
42*22dc650dSSadaf Ebrahimi#  Adjusted global table names by prefixing _pcre_.
43*22dc650dSSadaf Ebrahimi#  Commented out stuff relating to the casefolding table, which isn't used;
44*22dc650dSSadaf Ebrahimi#    removed completely in 2012.
45*22dc650dSSadaf Ebrahimi#  Corrected size calculation
46*22dc650dSSadaf Ebrahimi#  Add #ifndef SUPPORT_UCP to use dummy tables when no UCP support is needed.
47*22dc650dSSadaf Ebrahimi#  Update for PCRE2: name changes, and SUPPORT_UCP is abolished.
48*22dc650dSSadaf Ebrahimi#
49*22dc650dSSadaf Ebrahimi# Major modifications made to the original script:
50*22dc650dSSadaf Ebrahimi#  Added code to add a grapheme break property field to records.
51*22dc650dSSadaf Ebrahimi#
52*22dc650dSSadaf Ebrahimi#  Added code to search for sets of more than two characters that must match
53*22dc650dSSadaf Ebrahimi#  each other caselessly. A new table is output containing these sets, and
54*22dc650dSSadaf Ebrahimi#  offsets into the table are added to the main output records. This new
55*22dc650dSSadaf Ebrahimi#  code scans CaseFolding.txt instead of UnicodeData.txt, which is no longer
56*22dc650dSSadaf Ebrahimi#  used.
57*22dc650dSSadaf Ebrahimi#
58*22dc650dSSadaf Ebrahimi#  Update for Python3:
59*22dc650dSSadaf Ebrahimi#    . Processed with 2to3, but that didn't fix everything
60*22dc650dSSadaf Ebrahimi#    . Changed string.strip to str.strip
61*22dc650dSSadaf Ebrahimi#    . Added encoding='utf-8' to the open() call
62*22dc650dSSadaf Ebrahimi#    . Inserted 'int' before blocksize/ELEMS_PER_LINE because an int is
63*22dc650dSSadaf Ebrahimi#        required and the result of the division is a float
64*22dc650dSSadaf Ebrahimi#
65*22dc650dSSadaf Ebrahimi#  Added code to scan the emoji-data.txt file to find the Extended Pictographic
66*22dc650dSSadaf Ebrahimi#  property, which is used by PCRE2 as a grapheme breaking property. This was
67*22dc650dSSadaf Ebrahimi#  done when updating to Unicode 11.0.0 (July 2018).
68*22dc650dSSadaf Ebrahimi#
69*22dc650dSSadaf Ebrahimi#  Added code to add a Script Extensions field to records. This has increased
70*22dc650dSSadaf Ebrahimi#  their size from 8 to 12 bytes, only 10 of which are currently used.
71*22dc650dSSadaf Ebrahimi#
72*22dc650dSSadaf Ebrahimi#  Added code to add a bidi class field to records by scanning the
73*22dc650dSSadaf Ebrahimi#  DerivedBidiClass.txt and PropList.txt files. This uses one of the two spare
74*22dc650dSSadaf Ebrahimi#  bytes, so now 11 out of 12 are in use.
75*22dc650dSSadaf Ebrahimi#
76*22dc650dSSadaf Ebrahimi# 01-March-2010:     Updated list of scripts for Unicode 5.2.0
77*22dc650dSSadaf Ebrahimi# 30-April-2011:     Updated list of scripts for Unicode 6.0.0
78*22dc650dSSadaf Ebrahimi#     July-2012:     Updated list of scripts for Unicode 6.1.0
79*22dc650dSSadaf Ebrahimi# 20-August-2012:    Added scan of GraphemeBreakProperty.txt and added a new
80*22dc650dSSadaf Ebrahimi#                      field in the record to hold the value. Luckily, the
81*22dc650dSSadaf Ebrahimi#                      structure had a hole in it, so the resulting table is
82*22dc650dSSadaf Ebrahimi#                      not much bigger than before.
83*22dc650dSSadaf Ebrahimi# 18-September-2012: Added code for multiple caseless sets. This uses the
84*22dc650dSSadaf Ebrahimi#                      final hole in the structure.
85*22dc650dSSadaf Ebrahimi# 30-September-2012: Added RegionalIndicator break property from Unicode 6.2.0
86*22dc650dSSadaf Ebrahimi# 13-May-2014:       Updated for PCRE2
87*22dc650dSSadaf Ebrahimi# 03-June-2014:      Updated for Python 3
88*22dc650dSSadaf Ebrahimi# 20-June-2014:      Updated for Unicode 7.0.0
89*22dc650dSSadaf Ebrahimi# 12-August-2014:    Updated to put Unicode version into the file
90*22dc650dSSadaf Ebrahimi# 19-June-2015:      Updated for Unicode 8.0.0
91*22dc650dSSadaf Ebrahimi# 02-July-2017:      Updated for Unicode 10.0.0
92*22dc650dSSadaf Ebrahimi# 03-July-2018:      Updated for Unicode 11.0.0
93*22dc650dSSadaf Ebrahimi# 07-July-2018:      Added code to scan emoji-data.txt for the Extended
94*22dc650dSSadaf Ebrahimi#                      Pictographic property.
95*22dc650dSSadaf Ebrahimi# 01-October-2018:   Added the 'Unknown' script name
96*22dc650dSSadaf Ebrahimi# 03-October-2018:   Added new field for Script Extensions
97*22dc650dSSadaf Ebrahimi# 27-July-2019:      Updated for Unicode 12.1.0
98*22dc650dSSadaf Ebrahimi# 10-March-2020:     Updated for Unicode 13.0.0
99*22dc650dSSadaf Ebrahimi# PCRE2-10.39:       Updated for Unicode 14.0.0
100*22dc650dSSadaf Ebrahimi# 05-December-2021:  Added code to scan DerivedBidiClass.txt for bidi class,
101*22dc650dSSadaf Ebrahimi#                      and also PropList.txt for the Bidi_Control property
102*22dc650dSSadaf Ebrahimi# 19-December-2021:  Reworked script extensions lists to be bit maps instead
103*22dc650dSSadaf Ebrahimi#                      of zero-terminated lists of script numbers.
104*22dc650dSSadaf Ebrahimi# ----------------------------------------------------------------------------
105*22dc650dSSadaf Ebrahimi#
106*22dc650dSSadaf Ebrahimi# Changes to the refactored script:
107*22dc650dSSadaf Ebrahimi#
108*22dc650dSSadaf Ebrahimi# 26-December-2021:  Refactoring completed
109*22dc650dSSadaf Ebrahimi# 10-January-2022:   Addition of general Boolean property support
110*22dc650dSSadaf Ebrahimi# 12-January-2022:   Merge scriptx and bidiclass fields
111*22dc650dSSadaf Ebrahimi# 14-January-2022:   Enlarge Boolean property offset to 12 bits
112*22dc650dSSadaf Ebrahimi# 28-January-2023:   Remove ASCII "other case" from non-ASCII character that
113*22dc650dSSadaf Ebrahimi#                      are present in caseless sets.
114*22dc650dSSadaf Ebrahimi#
115*22dc650dSSadaf Ebrahimi# ----------------------------------------------------------------------------
116*22dc650dSSadaf Ebrahimi#
117*22dc650dSSadaf Ebrahimi#
118*22dc650dSSadaf Ebrahimi# The main tables generated by this script are used by macros defined in
119*22dc650dSSadaf Ebrahimi# pcre2_internal.h. They look up Unicode character properties using short
120*22dc650dSSadaf Ebrahimi# sequences of code that contains no branches, which makes for greater speed.
121*22dc650dSSadaf Ebrahimi#
122*22dc650dSSadaf Ebrahimi# Conceptually, there is a table of records (of type ucd_record), one for each
123*22dc650dSSadaf Ebrahimi# Unicode character. Each record contains the script number, script extension
124*22dc650dSSadaf Ebrahimi# value, character type, grapheme break type, offset to caseless matching set,
125*22dc650dSSadaf Ebrahimi# offset to the character's other case, the bidi class, and offset to bitmap of
126*22dc650dSSadaf Ebrahimi# Boolean properties.
127*22dc650dSSadaf Ebrahimi#
128*22dc650dSSadaf Ebrahimi# A real table covering all Unicode characters would be far too big. It can be
129*22dc650dSSadaf Ebrahimi# efficiently compressed by observing that many characters have the same
130*22dc650dSSadaf Ebrahimi# record, and many blocks of characters (taking 128 characters in a block) have
131*22dc650dSSadaf Ebrahimi# the same set of records as other blocks. This leads to a 2-stage lookup
132*22dc650dSSadaf Ebrahimi# process.
133*22dc650dSSadaf Ebrahimi#
134*22dc650dSSadaf Ebrahimi# This script constructs seven tables. The ucd_caseless_sets table contains
135*22dc650dSSadaf Ebrahimi# lists of characters that all match each other caselessly. Each list is
136*22dc650dSSadaf Ebrahimi# in order, and is terminated by NOTACHAR (0xffffffff), which is larger than
137*22dc650dSSadaf Ebrahimi# any valid character. The first list is empty; this is used for characters
138*22dc650dSSadaf Ebrahimi# that are not part of any list.
139*22dc650dSSadaf Ebrahimi#
140*22dc650dSSadaf Ebrahimi# The ucd_digit_sets table contains the code points of the '9' characters in
141*22dc650dSSadaf Ebrahimi# each set of 10 decimal digits in Unicode. This is used to ensure that digits
142*22dc650dSSadaf Ebrahimi# in script runs all come from the same set. The first element in the vector
143*22dc650dSSadaf Ebrahimi# contains the number of subsequent elements, which are in ascending order.
144*22dc650dSSadaf Ebrahimi#
145*22dc650dSSadaf Ebrahimi# Scripts are partitioned into two groups. Scripts that appear in at least one
146*22dc650dSSadaf Ebrahimi# character's script extension list come first, followed by "Unknown" and then
147*22dc650dSSadaf Ebrahimi# all the rest. This sorting is done automatically in the GenerateCommon.py
148*22dc650dSSadaf Ebrahimi# script. A script's number is its index in the script_names list.
149*22dc650dSSadaf Ebrahimi#
150*22dc650dSSadaf Ebrahimi# The ucd_script_sets table contains bitmaps that represent lists of scripts
151*22dc650dSSadaf Ebrahimi# for Script Extensions properties. Each bitmap consists of a fixed number of
152*22dc650dSSadaf Ebrahimi# unsigned 32-bit numbers, enough to allocate a bit for every script that is
153*22dc650dSSadaf Ebrahimi# used in any character's extension list, that is, enough for every script
154*22dc650dSSadaf Ebrahimi# whose number is less than ucp_Unknown. A character's script extension value
155*22dc650dSSadaf Ebrahimi# in its ucd record is an offset into the ucd_script_sets vector. The first
156*22dc650dSSadaf Ebrahimi# bitmap has no bits set; characters that have no script extensions have zero
157*22dc650dSSadaf Ebrahimi# as their script extensions value so that they use this map.
158*22dc650dSSadaf Ebrahimi#
159*22dc650dSSadaf Ebrahimi# The ucd_boolprop_sets table contains bitmaps that represent lists of Boolean
160*22dc650dSSadaf Ebrahimi# properties. Each bitmap consists of a fixed number of unsigned 32-bit
161*22dc650dSSadaf Ebrahimi# numbers, enough to allocate a bit for each supported Boolean property.
162*22dc650dSSadaf Ebrahimi#
163*22dc650dSSadaf Ebrahimi# The ucd_records table contains one instance of every unique character record
164*22dc650dSSadaf Ebrahimi# that is required. The ucd_stage1 table is indexed by a character's block
165*22dc650dSSadaf Ebrahimi# number, which is the character's code point divided by 128, since 128 is the
166*22dc650dSSadaf Ebrahimi# size of each block. The result of a lookup in ucd_stage1 a "virtual" block
167*22dc650dSSadaf Ebrahimi# number.
168*22dc650dSSadaf Ebrahimi#
169*22dc650dSSadaf Ebrahimi# The ucd_stage2 table is a table of "virtual" blocks; each block is indexed by
170*22dc650dSSadaf Ebrahimi# the offset of a character within its own block, and the result is the index
171*22dc650dSSadaf Ebrahimi# number of the required record in the ucd_records vector.
172*22dc650dSSadaf Ebrahimi#
173*22dc650dSSadaf Ebrahimi# The following examples are correct for the Unicode 14.0.0 database. Future
174*22dc650dSSadaf Ebrahimi# updates may make change the actual lookup values.
175*22dc650dSSadaf Ebrahimi#
176*22dc650dSSadaf Ebrahimi# Example: lowercase "a" (U+0061) is in block 0
177*22dc650dSSadaf Ebrahimi#          lookup 0 in stage1 table yields 0
178*22dc650dSSadaf Ebrahimi#          lookup 97 (0x61) in the first table in stage2 yields 35
179*22dc650dSSadaf Ebrahimi#          record 35 is { 0, 5, 12, 0, -32, 18432, 44 }
180*22dc650dSSadaf Ebrahimi#             0 = ucp_Latin   => Latin script
181*22dc650dSSadaf Ebrahimi#             5 = ucp_Ll      => Lower case letter
182*22dc650dSSadaf Ebrahimi#            12 = ucp_gbOther => Grapheme break property "Other"
183*22dc650dSSadaf Ebrahimi#             0               => Not part of a caseless set
184*22dc650dSSadaf Ebrahimi#           -32 (-0x20)       => Other case is U+0041
185*22dc650dSSadaf Ebrahimi#         18432 = 0x4800      => Combined Bidi class + script extension values
186*22dc650dSSadaf Ebrahimi#            44               => Offset to Boolean properties
187*22dc650dSSadaf Ebrahimi#
188*22dc650dSSadaf Ebrahimi# The top 5 bits of the sixth field are the Bidi class, with the rest being the
189*22dc650dSSadaf Ebrahimi# script extension value, giving:
190*22dc650dSSadaf Ebrahimi#
191*22dc650dSSadaf Ebrahimi#             9 = ucp_bidiL   => Bidi class left-to-right
192*22dc650dSSadaf Ebrahimi#             0               => No special script extension property
193*22dc650dSSadaf Ebrahimi#
194*22dc650dSSadaf Ebrahimi# Almost all lowercase latin characters resolve to the same record. One or two
195*22dc650dSSadaf Ebrahimi# are different because they are part of a multi-character caseless set (for
196*22dc650dSSadaf Ebrahimi# example, k, K and the Kelvin symbol are such a set).
197*22dc650dSSadaf Ebrahimi#
198*22dc650dSSadaf Ebrahimi# Example: hiragana letter A (U+3042) is in block 96 (0x60)
199*22dc650dSSadaf Ebrahimi#          lookup 96 in stage1 table yields 93
200*22dc650dSSadaf Ebrahimi#          lookup 66 (0x42) in table 93 in stage2 yields 819
201*22dc650dSSadaf Ebrahimi#          record 819 is { 20, 7, 12, 0, 0, 18432, 82 }
202*22dc650dSSadaf Ebrahimi#            20 = ucp_Hiragana => Hiragana script
203*22dc650dSSadaf Ebrahimi#             7 = ucp_Lo       => Other letter
204*22dc650dSSadaf Ebrahimi#            12 = ucp_gbOther  => Grapheme break property "Other"
205*22dc650dSSadaf Ebrahimi#             0                => Not part of a caseless set
206*22dc650dSSadaf Ebrahimi#             0                => No other case
207*22dc650dSSadaf Ebrahimi#         18432 = 0x4800       => Combined Bidi class + script extension values
208*22dc650dSSadaf Ebrahimi#            82                => Offset to Boolean properties
209*22dc650dSSadaf Ebrahimi#
210*22dc650dSSadaf Ebrahimi# The top 5 bits of the sixth field are the Bidi class, with the rest being the
211*22dc650dSSadaf Ebrahimi# script extension value, giving:
212*22dc650dSSadaf Ebrahimi#
213*22dc650dSSadaf Ebrahimi#             9 = ucp_bidiL   => Bidi class left-to-right
214*22dc650dSSadaf Ebrahimi#             0               => No special script extension property
215*22dc650dSSadaf Ebrahimi#
216*22dc650dSSadaf Ebrahimi# Example: vedic tone karshana (U+1CD0) is in block 57 (0x39)
217*22dc650dSSadaf Ebrahimi#          lookup 57 in stage1 table yields 55
218*22dc650dSSadaf Ebrahimi#          lookup 80 (0x50) in table 55 in stage2 yields 621
219*22dc650dSSadaf Ebrahimi#          record 621 is { 84, 12, 3, 0, 0, 26762, 96 }
220*22dc650dSSadaf Ebrahimi#            84 = ucp_Inherited => Script inherited from predecessor
221*22dc650dSSadaf Ebrahimi#            12 = ucp_Mn        => Non-spacing mark
222*22dc650dSSadaf Ebrahimi#             3 = ucp_gbExtend  => Grapheme break property "Extend"
223*22dc650dSSadaf Ebrahimi#             0                 => Not part of a caseless set
224*22dc650dSSadaf Ebrahimi#             0                 => No other case
225*22dc650dSSadaf Ebrahimi#         26762 = 0x688A        => Combined Bidi class + script extension values
226*22dc650dSSadaf Ebrahimi#            96                 => Offset to Boolean properties
227*22dc650dSSadaf Ebrahimi#
228*22dc650dSSadaf Ebrahimi# The top 5 bits of the sixth field are the Bidi class, with the rest being the
229*22dc650dSSadaf Ebrahimi# script extension value, giving:
230*22dc650dSSadaf Ebrahimi#
231*22dc650dSSadaf Ebrahimi#            13 = ucp_bidiNSM   => Bidi class non-spacing mark
232*22dc650dSSadaf Ebrahimi#           138                 => Script Extension list offset = 138
233*22dc650dSSadaf Ebrahimi#
234*22dc650dSSadaf Ebrahimi# At offset 138 in the ucd_script_sets vector we find a bitmap with bits 1, 8,
235*22dc650dSSadaf Ebrahimi# 18, and 47 set. This means that this character is expected to be used with
236*22dc650dSSadaf Ebrahimi# any of those scripts, which are Bengali, Devanagari, Kannada, and Grantha.
237*22dc650dSSadaf Ebrahimi#
238*22dc650dSSadaf Ebrahimi#  Philip Hazel, last updated 14 January 2022.
239*22dc650dSSadaf Ebrahimi##############################################################################
240*22dc650dSSadaf Ebrahimi
241*22dc650dSSadaf Ebrahimi
242*22dc650dSSadaf Ebrahimi# Import standard modules
243*22dc650dSSadaf Ebrahimi
244*22dc650dSSadaf Ebrahimiimport re
245*22dc650dSSadaf Ebrahimiimport string
246*22dc650dSSadaf Ebrahimiimport sys
247*22dc650dSSadaf Ebrahimi
248*22dc650dSSadaf Ebrahimi# Import common data lists and functions
249*22dc650dSSadaf Ebrahimi
250*22dc650dSSadaf Ebrahimifrom GenerateCommon import \
251*22dc650dSSadaf Ebrahimi  bidi_classes, \
252*22dc650dSSadaf Ebrahimi  bool_properties, \
253*22dc650dSSadaf Ebrahimi  bool_propsfiles, \
254*22dc650dSSadaf Ebrahimi  bool_props_list_item_size, \
255*22dc650dSSadaf Ebrahimi  break_properties, \
256*22dc650dSSadaf Ebrahimi  category_names, \
257*22dc650dSSadaf Ebrahimi  general_category_names, \
258*22dc650dSSadaf Ebrahimi  script_abbrevs, \
259*22dc650dSSadaf Ebrahimi  script_list_item_size, \
260*22dc650dSSadaf Ebrahimi  script_names, \
261*22dc650dSSadaf Ebrahimi  open_output
262*22dc650dSSadaf Ebrahimi
263*22dc650dSSadaf Ebrahimi# Some general parameters
264*22dc650dSSadaf Ebrahimi
265*22dc650dSSadaf EbrahimiMAX_UNICODE = 0x110000
266*22dc650dSSadaf EbrahimiNOTACHAR = 0xffffffff
267*22dc650dSSadaf Ebrahimi
268*22dc650dSSadaf Ebrahimi
269*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------
270*22dc650dSSadaf Ebrahimi#                         DEFINE FUNCTIONS
271*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------
272*22dc650dSSadaf Ebrahimi
273*22dc650dSSadaf Ebrahimi
274*22dc650dSSadaf Ebrahimi# Parse a line of Scripts.txt, GraphemeBreakProperty.txt or DerivedGeneralCategory.txt
275*22dc650dSSadaf Ebrahimi
276*22dc650dSSadaf Ebrahimidef make_get_names(enum):
277*22dc650dSSadaf Ebrahimi  return lambda chardata: enum.index(chardata[1])
278*22dc650dSSadaf Ebrahimi
279*22dc650dSSadaf Ebrahimi
280*22dc650dSSadaf Ebrahimi# Parse a line of DerivedBidiClass.txt
281*22dc650dSSadaf Ebrahimi
282*22dc650dSSadaf Ebrahimidef get_bidi(chardata):
283*22dc650dSSadaf Ebrahimi  if len(chardata[1]) > 3:
284*22dc650dSSadaf Ebrahimi    return bidi_classes_long.index(chardata[1])
285*22dc650dSSadaf Ebrahimi  else:
286*22dc650dSSadaf Ebrahimi    return bidi_classes_short.index(chardata[1])
287*22dc650dSSadaf Ebrahimi
288*22dc650dSSadaf Ebrahimi
289*22dc650dSSadaf Ebrahimi# Parse a line of CaseFolding.txt
290*22dc650dSSadaf Ebrahimi
291*22dc650dSSadaf Ebrahimidef get_other_case(chardata):
292*22dc650dSSadaf Ebrahimi  if chardata[1] == 'C' or chardata[1] == 'S':
293*22dc650dSSadaf Ebrahimi    return int(chardata[2], 16) - int(chardata[0], 16)
294*22dc650dSSadaf Ebrahimi  return None
295*22dc650dSSadaf Ebrahimi
296*22dc650dSSadaf Ebrahimi
297*22dc650dSSadaf Ebrahimi# Parse a line of ScriptExtensions.txt
298*22dc650dSSadaf Ebrahimi
299*22dc650dSSadaf Ebrahimidef get_script_extension(chardata):
300*22dc650dSSadaf Ebrahimi  global last_script_extension
301*22dc650dSSadaf Ebrahimi
302*22dc650dSSadaf Ebrahimi  offset = len(script_lists) * script_list_item_size
303*22dc650dSSadaf Ebrahimi  if last_script_extension == chardata[1]:
304*22dc650dSSadaf Ebrahimi    return offset - script_list_item_size
305*22dc650dSSadaf Ebrahimi
306*22dc650dSSadaf Ebrahimi  last_script_extension = chardata[1]
307*22dc650dSSadaf Ebrahimi  script_lists.append(tuple(script_abbrevs.index(abbrev) for abbrev in last_script_extension.split(' ')))
308*22dc650dSSadaf Ebrahimi  return offset
309*22dc650dSSadaf Ebrahimi
310*22dc650dSSadaf Ebrahimi
311*22dc650dSSadaf Ebrahimi# Read a whole table in memory, setting/checking the Unicode version
312*22dc650dSSadaf Ebrahimi
313*22dc650dSSadaf Ebrahimidef read_table(file_name, get_value, default_value):
314*22dc650dSSadaf Ebrahimi  global unicode_version
315*22dc650dSSadaf Ebrahimi
316*22dc650dSSadaf Ebrahimi  f = re.match(r'^[^/]+/([^.]+)\.txt$', file_name)
317*22dc650dSSadaf Ebrahimi  file_base = f.group(1)
318*22dc650dSSadaf Ebrahimi  version_pat = r"^# " + re.escape(file_base) + r"-(\d+\.\d+\.\d+)\.txt$"
319*22dc650dSSadaf Ebrahimi  file = open(file_name, 'r', encoding='utf-8')
320*22dc650dSSadaf Ebrahimi  f = re.match(version_pat, file.readline())
321*22dc650dSSadaf Ebrahimi  version = f.group(1)
322*22dc650dSSadaf Ebrahimi  if unicode_version == "":
323*22dc650dSSadaf Ebrahimi    unicode_version = version
324*22dc650dSSadaf Ebrahimi  elif unicode_version != version:
325*22dc650dSSadaf Ebrahimi    print("WARNING: Unicode version differs in %s", file_name, file=sys.stderr)
326*22dc650dSSadaf Ebrahimi
327*22dc650dSSadaf Ebrahimi  table = [default_value] * MAX_UNICODE
328*22dc650dSSadaf Ebrahimi  for line in file:
329*22dc650dSSadaf Ebrahimi    if file_base == 'DerivedBidiClass':
330*22dc650dSSadaf Ebrahimi      line = re.sub(r'# @missing: ', '', line)
331*22dc650dSSadaf Ebrahimi
332*22dc650dSSadaf Ebrahimi    line = re.sub(r'#.*', '', line)
333*22dc650dSSadaf Ebrahimi    chardata = list(map(str.strip, line.split(';')))
334*22dc650dSSadaf Ebrahimi    if len(chardata) <= 1:
335*22dc650dSSadaf Ebrahimi      continue
336*22dc650dSSadaf Ebrahimi    value = get_value(chardata)
337*22dc650dSSadaf Ebrahimi    if value is None:
338*22dc650dSSadaf Ebrahimi      continue
339*22dc650dSSadaf Ebrahimi    m = re.match(r'([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+))?$', chardata[0])
340*22dc650dSSadaf Ebrahimi    char = int(m.group(1), 16)
341*22dc650dSSadaf Ebrahimi    if m.group(3) is None:
342*22dc650dSSadaf Ebrahimi      last = char
343*22dc650dSSadaf Ebrahimi    else:
344*22dc650dSSadaf Ebrahimi      last = int(m.group(3), 16)
345*22dc650dSSadaf Ebrahimi    for i in range(char, last + 1):
346*22dc650dSSadaf Ebrahimi      table[i] = value
347*22dc650dSSadaf Ebrahimi
348*22dc650dSSadaf Ebrahimi  file.close()
349*22dc650dSSadaf Ebrahimi  return table
350*22dc650dSSadaf Ebrahimi
351*22dc650dSSadaf Ebrahimi
352*22dc650dSSadaf Ebrahimi# Get the smallest possible C language type for the values in a table
353*22dc650dSSadaf Ebrahimi
354*22dc650dSSadaf Ebrahimidef get_type_size(table):
355*22dc650dSSadaf Ebrahimi  type_size = [("uint8_t", 1), ("uint16_t", 2), ("uint32_t", 4),
356*22dc650dSSadaf Ebrahimi    ("signed char", 1), ("int16_t", 2), ("int32_t", 4)]
357*22dc650dSSadaf Ebrahimi  limits = [(0, 255), (0, 65535), (0, 4294967295), (-128, 127),
358*22dc650dSSadaf Ebrahimi    (-32768, 32767), (-2147483648, 2147483647)]
359*22dc650dSSadaf Ebrahimi  minval = min(table)
360*22dc650dSSadaf Ebrahimi  maxval = max(table)
361*22dc650dSSadaf Ebrahimi  for num, (minlimit, maxlimit) in enumerate(limits):
362*22dc650dSSadaf Ebrahimi    if minlimit <= minval and maxval <= maxlimit:
363*22dc650dSSadaf Ebrahimi      return type_size[num]
364*22dc650dSSadaf Ebrahimi  raise OverflowError("Too large to fit into C types")
365*22dc650dSSadaf Ebrahimi
366*22dc650dSSadaf Ebrahimi
367*22dc650dSSadaf Ebrahimi# Get the total size of a list of tables
368*22dc650dSSadaf Ebrahimi
369*22dc650dSSadaf Ebrahimidef get_tables_size(*tables):
370*22dc650dSSadaf Ebrahimi  total_size = 0
371*22dc650dSSadaf Ebrahimi  for table in tables:
372*22dc650dSSadaf Ebrahimi    type, size = get_type_size(table)
373*22dc650dSSadaf Ebrahimi    total_size += size * len(table)
374*22dc650dSSadaf Ebrahimi  return total_size
375*22dc650dSSadaf Ebrahimi
376*22dc650dSSadaf Ebrahimi
377*22dc650dSSadaf Ebrahimi# Compress a table into the two stages
378*22dc650dSSadaf Ebrahimi
379*22dc650dSSadaf Ebrahimidef compress_table(table, block_size):
380*22dc650dSSadaf Ebrahimi  blocks = {} # Dictionary for finding identical blocks
381*22dc650dSSadaf Ebrahimi  stage1 = [] # Stage 1 table contains block numbers (indices into stage 2 table)
382*22dc650dSSadaf Ebrahimi  stage2 = [] # Stage 2 table contains the blocks with property values
383*22dc650dSSadaf Ebrahimi  table = tuple(table)
384*22dc650dSSadaf Ebrahimi  for i in range(0, len(table), block_size):
385*22dc650dSSadaf Ebrahimi    block = table[i:i+block_size]
386*22dc650dSSadaf Ebrahimi    start = blocks.get(block)
387*22dc650dSSadaf Ebrahimi    if start is None:
388*22dc650dSSadaf Ebrahimi      # Allocate a new block
389*22dc650dSSadaf Ebrahimi      start = len(stage2) / block_size
390*22dc650dSSadaf Ebrahimi      stage2 += block
391*22dc650dSSadaf Ebrahimi      blocks[block] = start
392*22dc650dSSadaf Ebrahimi    stage1.append(start)
393*22dc650dSSadaf Ebrahimi  return stage1, stage2
394*22dc650dSSadaf Ebrahimi
395*22dc650dSSadaf Ebrahimi
396*22dc650dSSadaf Ebrahimi# Output a table
397*22dc650dSSadaf Ebrahimi
398*22dc650dSSadaf Ebrahimidef write_table(table, table_name, block_size = None):
399*22dc650dSSadaf Ebrahimi  type, size = get_type_size(table)
400*22dc650dSSadaf Ebrahimi  ELEMS_PER_LINE = 16
401*22dc650dSSadaf Ebrahimi
402*22dc650dSSadaf Ebrahimi  s = "const %s %s[] = { /* %d bytes" % (type, table_name, size * len(table))
403*22dc650dSSadaf Ebrahimi  if block_size:
404*22dc650dSSadaf Ebrahimi    s += ", block = %d" % block_size
405*22dc650dSSadaf Ebrahimi  f.write(s + " */\n")
406*22dc650dSSadaf Ebrahimi  table = tuple(table)
407*22dc650dSSadaf Ebrahimi  if block_size is None:
408*22dc650dSSadaf Ebrahimi    fmt = "%3d," * ELEMS_PER_LINE + " /* U+%04X */\n"
409*22dc650dSSadaf Ebrahimi    mult = MAX_UNICODE / len(table)
410*22dc650dSSadaf Ebrahimi    for i in range(0, len(table), ELEMS_PER_LINE):
411*22dc650dSSadaf Ebrahimi      f.write(fmt % (table[i:i+ELEMS_PER_LINE] + (int(i * mult),)))
412*22dc650dSSadaf Ebrahimi  else:
413*22dc650dSSadaf Ebrahimi    if block_size > ELEMS_PER_LINE:
414*22dc650dSSadaf Ebrahimi      el = ELEMS_PER_LINE
415*22dc650dSSadaf Ebrahimi    else:
416*22dc650dSSadaf Ebrahimi      el = block_size
417*22dc650dSSadaf Ebrahimi    fmt = "%3d," * el + "\n"
418*22dc650dSSadaf Ebrahimi    if block_size > ELEMS_PER_LINE:
419*22dc650dSSadaf Ebrahimi      fmt = fmt * int(block_size / ELEMS_PER_LINE)
420*22dc650dSSadaf Ebrahimi    for i in range(0, len(table), block_size):
421*22dc650dSSadaf Ebrahimi      f.write(("\n/* block %d */\n" + fmt) % ((i / block_size,) + table[i:i+block_size]))
422*22dc650dSSadaf Ebrahimi  f.write("};\n\n")
423*22dc650dSSadaf Ebrahimi
424*22dc650dSSadaf Ebrahimi
425*22dc650dSSadaf Ebrahimi# Extract the unique combinations of properties into records
426*22dc650dSSadaf Ebrahimi
427*22dc650dSSadaf Ebrahimidef combine_tables(*tables):
428*22dc650dSSadaf Ebrahimi  records = {}
429*22dc650dSSadaf Ebrahimi  index = []
430*22dc650dSSadaf Ebrahimi  for t in zip(*tables):
431*22dc650dSSadaf Ebrahimi    i = records.get(t)
432*22dc650dSSadaf Ebrahimi    if i is None:
433*22dc650dSSadaf Ebrahimi      i = records[t] = len(records)
434*22dc650dSSadaf Ebrahimi    index.append(i)
435*22dc650dSSadaf Ebrahimi  return index, records
436*22dc650dSSadaf Ebrahimi
437*22dc650dSSadaf Ebrahimi
438*22dc650dSSadaf Ebrahimi# Create a record struct
439*22dc650dSSadaf Ebrahimi
440*22dc650dSSadaf Ebrahimidef get_record_size_struct(records):
441*22dc650dSSadaf Ebrahimi  size = 0
442*22dc650dSSadaf Ebrahimi  structure = 'typedef struct {\n'
443*22dc650dSSadaf Ebrahimi  for i in range(len(records[0])):
444*22dc650dSSadaf Ebrahimi    record_slice = [record[i] for record in records]
445*22dc650dSSadaf Ebrahimi    slice_type, slice_size = get_type_size(record_slice)
446*22dc650dSSadaf Ebrahimi    # add padding: round up to the nearest power of slice_size
447*22dc650dSSadaf Ebrahimi    size = (size + slice_size - 1) & -slice_size
448*22dc650dSSadaf Ebrahimi    size += slice_size
449*22dc650dSSadaf Ebrahimi    structure += '%s property_%d;\n' % (slice_type, i)
450*22dc650dSSadaf Ebrahimi
451*22dc650dSSadaf Ebrahimi  # round up to the first item of the next structure in array
452*22dc650dSSadaf Ebrahimi  record_slice = [record[0] for record in records]
453*22dc650dSSadaf Ebrahimi  slice_type, slice_size = get_type_size(record_slice)
454*22dc650dSSadaf Ebrahimi  size = (size + slice_size - 1) & -slice_size
455*22dc650dSSadaf Ebrahimi
456*22dc650dSSadaf Ebrahimi  structure += '} ucd_record;\n*/\n'
457*22dc650dSSadaf Ebrahimi  return size, structure
458*22dc650dSSadaf Ebrahimi
459*22dc650dSSadaf Ebrahimi
460*22dc650dSSadaf Ebrahimi# Write records
461*22dc650dSSadaf Ebrahimi
462*22dc650dSSadaf Ebrahimidef write_records(records, record_size):
463*22dc650dSSadaf Ebrahimi  f.write('const ucd_record PRIV(ucd_records)[] = { ' + \
464*22dc650dSSadaf Ebrahimi    '/* %d bytes, record size %d */\n' % (len(records) * record_size, record_size))
465*22dc650dSSadaf Ebrahimi  records = list(zip(list(records.keys()), list(records.values())))
466*22dc650dSSadaf Ebrahimi  records.sort(key = lambda x: x[1])
467*22dc650dSSadaf Ebrahimi  for i, record in enumerate(records):
468*22dc650dSSadaf Ebrahimi    f.write(('  {' + '%6d, ' * len(record[0]) + '}, /* %3d */\n') % (record[0] + (i,)))
469*22dc650dSSadaf Ebrahimi  f.write('};\n\n')
470*22dc650dSSadaf Ebrahimi
471*22dc650dSSadaf Ebrahimi
472*22dc650dSSadaf Ebrahimi# Write a bit set
473*22dc650dSSadaf Ebrahimi
474*22dc650dSSadaf Ebrahimidef write_bitsets(list, item_size):
475*22dc650dSSadaf Ebrahimi  for d in list:
476*22dc650dSSadaf Ebrahimi    bitwords = [0] * item_size
477*22dc650dSSadaf Ebrahimi    for idx in d:
478*22dc650dSSadaf Ebrahimi      bitwords[idx // 32] |= 1 << (idx & 31)
479*22dc650dSSadaf Ebrahimi    s = " "
480*22dc650dSSadaf Ebrahimi    for x in bitwords:
481*22dc650dSSadaf Ebrahimi      f.write("%s" % s)
482*22dc650dSSadaf Ebrahimi      s = ", "
483*22dc650dSSadaf Ebrahimi      f.write("0x%08xu" % x)
484*22dc650dSSadaf Ebrahimi    f.write(",\n")
485*22dc650dSSadaf Ebrahimi  f.write("};\n\n")
486*22dc650dSSadaf Ebrahimi
487*22dc650dSSadaf Ebrahimi
488*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------
489*22dc650dSSadaf Ebrahimi# This bit of code must have been useful when the original script was being
490*22dc650dSSadaf Ebrahimi# developed. Retain it just in case it is ever needed again.
491*22dc650dSSadaf Ebrahimi
492*22dc650dSSadaf Ebrahimi# def test_record_size():
493*22dc650dSSadaf Ebrahimi#   tests = [ \
494*22dc650dSSadaf Ebrahimi#     ( [(3,), (6,), (6,), (1,)], 1 ), \
495*22dc650dSSadaf Ebrahimi#     ( [(300,), (600,), (600,), (100,)], 2 ), \
496*22dc650dSSadaf Ebrahimi#     ( [(25, 3), (6, 6), (34, 6), (68, 1)], 2 ), \
497*22dc650dSSadaf Ebrahimi#     ( [(300, 3), (6, 6), (340, 6), (690, 1)], 4 ), \
498*22dc650dSSadaf Ebrahimi#     ( [(3, 300), (6, 6), (6, 340), (1, 690)], 4 ), \
499*22dc650dSSadaf Ebrahimi#     ( [(300, 300), (6, 6), (6, 340), (1, 690)], 4 ), \
500*22dc650dSSadaf Ebrahimi#     ( [(3, 100000), (6, 6), (6, 123456), (1, 690)], 8 ), \
501*22dc650dSSadaf Ebrahimi#     ( [(100000, 300), (6, 6), (123456, 6), (1, 690)], 8 ), \
502*22dc650dSSadaf Ebrahimi#   ]
503*22dc650dSSadaf Ebrahimi#   for test in tests:
504*22dc650dSSadaf Ebrahimi#     size, struct = get_record_size_struct(test[0])
505*22dc650dSSadaf Ebrahimi#     assert(size == test[1])
506*22dc650dSSadaf Ebrahimi# test_record_size()
507*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------
508*22dc650dSSadaf Ebrahimi
509*22dc650dSSadaf Ebrahimi
510*22dc650dSSadaf Ebrahimi
511*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------
512*22dc650dSSadaf Ebrahimi#                       MAIN CODE FOR CREATING TABLES
513*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------
514*22dc650dSSadaf Ebrahimi
515*22dc650dSSadaf Ebrahimiunicode_version = ""
516*22dc650dSSadaf Ebrahimi
517*22dc650dSSadaf Ebrahimi# Some of the tables imported from GenerateCommon.py have alternate comment
518*22dc650dSSadaf Ebrahimi# strings for use by GenerateUcpHeader. The comments are not wanted here, so
519*22dc650dSSadaf Ebrahimi# remove them.
520*22dc650dSSadaf Ebrahimi
521*22dc650dSSadaf Ebrahimibidi_classes_short = bidi_classes[::2]
522*22dc650dSSadaf Ebrahimibidi_classes_long = bidi_classes[1::2]
523*22dc650dSSadaf Ebrahimibreak_properties = break_properties[::2]
524*22dc650dSSadaf Ebrahimicategory_names = category_names[::2]
525*22dc650dSSadaf Ebrahimi
526*22dc650dSSadaf Ebrahimi# Create the various tables from Unicode data files
527*22dc650dSSadaf Ebrahimi
528*22dc650dSSadaf Ebrahimiscript = read_table('Unicode.tables/Scripts.txt', make_get_names(script_names), script_names.index('Unknown'))
529*22dc650dSSadaf Ebrahimicategory = read_table('Unicode.tables/DerivedGeneralCategory.txt', make_get_names(category_names), category_names.index('Cn'))
530*22dc650dSSadaf Ebrahimibreak_props = read_table('Unicode.tables/GraphemeBreakProperty.txt', make_get_names(break_properties), break_properties.index('Other'))
531*22dc650dSSadaf Ebrahimiother_case = read_table('Unicode.tables/CaseFolding.txt', get_other_case, 0)
532*22dc650dSSadaf Ebrahimibidi_class = read_table('Unicode.tables/DerivedBidiClass.txt', get_bidi, bidi_classes_short.index('L'))
533*22dc650dSSadaf Ebrahimi
534*22dc650dSSadaf Ebrahimi# The grapheme breaking rules were changed for Unicode 11.0.0 (June 2018). Now
535*22dc650dSSadaf Ebrahimi# we need to find the Extended_Pictographic property for emoji characters. This
536*22dc650dSSadaf Ebrahimi# can be set as an additional grapheme break property, because the default for
537*22dc650dSSadaf Ebrahimi# all the emojis is "other". We scan the emoji-data.txt file and modify the
538*22dc650dSSadaf Ebrahimi# break-props table.
539*22dc650dSSadaf Ebrahimi
540*22dc650dSSadaf Ebrahimifile = open('Unicode.tables/emoji-data.txt', 'r', encoding='utf-8')
541*22dc650dSSadaf Ebrahimifor line in file:
542*22dc650dSSadaf Ebrahimi  line = re.sub(r'#.*', '', line)
543*22dc650dSSadaf Ebrahimi  chardata = list(map(str.strip, line.split(';')))
544*22dc650dSSadaf Ebrahimi  if len(chardata) <= 1:
545*22dc650dSSadaf Ebrahimi    continue
546*22dc650dSSadaf Ebrahimi  if chardata[1] != "Extended_Pictographic":
547*22dc650dSSadaf Ebrahimi    continue
548*22dc650dSSadaf Ebrahimi  m = re.match(r'([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+))?$', chardata[0])
549*22dc650dSSadaf Ebrahimi  char = int(m.group(1), 16)
550*22dc650dSSadaf Ebrahimi  if m.group(3) is None:
551*22dc650dSSadaf Ebrahimi    last = char
552*22dc650dSSadaf Ebrahimi  else:
553*22dc650dSSadaf Ebrahimi    last = int(m.group(3), 16)
554*22dc650dSSadaf Ebrahimi  for i in range(char, last + 1):
555*22dc650dSSadaf Ebrahimi    if break_props[i] != break_properties.index('Other'):
556*22dc650dSSadaf Ebrahimi      print("WARNING: Emoji 0x%x has break property %s, not 'Other'",
557*22dc650dSSadaf Ebrahimi        i, break_properties[break_props[i]], file=sys.stderr)
558*22dc650dSSadaf Ebrahimi    break_props[i] = break_properties.index('Extended_Pictographic')
559*22dc650dSSadaf Ebrahimifile.close()
560*22dc650dSSadaf Ebrahimi
561*22dc650dSSadaf Ebrahimi# Handle script extensions. The get_script_extesion() function maintains a
562*22dc650dSSadaf Ebrahimi# list of unique bitmaps representing lists of scripts, returning the offset
563*22dc650dSSadaf Ebrahimi# in that list. Initialize the list with an empty set, which is used for
564*22dc650dSSadaf Ebrahimi# characters that have no script extensions.
565*22dc650dSSadaf Ebrahimi
566*22dc650dSSadaf Ebrahimiscript_lists = [[]]
567*22dc650dSSadaf Ebrahimilast_script_extension = ""
568*22dc650dSSadaf Ebrahimiscriptx_bidi_class = read_table('Unicode.tables/ScriptExtensions.txt', get_script_extension, 0)
569*22dc650dSSadaf Ebrahimi
570*22dc650dSSadaf Ebrahimifor idx in range(len(scriptx_bidi_class)):
571*22dc650dSSadaf Ebrahimi  scriptx_bidi_class[idx] = scriptx_bidi_class[idx] | (bidi_class[idx] << 11)
572*22dc650dSSadaf Ebrahimibidi_class = None
573*22dc650dSSadaf Ebrahimi
574*22dc650dSSadaf Ebrahimi# Find the Boolean properties of each character. This next bit of magic creates
575*22dc650dSSadaf Ebrahimi# a list of empty lists. Using [[]] * MAX_UNICODE gives a list of references to
576*22dc650dSSadaf Ebrahimi# the *same* list, which is not what we want.
577*22dc650dSSadaf Ebrahimi
578*22dc650dSSadaf Ebrahimibprops = [[] for _ in range(MAX_UNICODE)]
579*22dc650dSSadaf Ebrahimi
580*22dc650dSSadaf Ebrahimi# Collect the properties from the various files
581*22dc650dSSadaf Ebrahimi
582*22dc650dSSadaf Ebrahimifor filename in bool_propsfiles:
583*22dc650dSSadaf Ebrahimi  try:
584*22dc650dSSadaf Ebrahimi    file = open('Unicode.tables/' + filename, 'r')
585*22dc650dSSadaf Ebrahimi  except IOError:
586*22dc650dSSadaf Ebrahimi    print(f"** Couldn't open {'Unicode.tables/' + filename}\n")
587*22dc650dSSadaf Ebrahimi    sys.exit(1)
588*22dc650dSSadaf Ebrahimi
589*22dc650dSSadaf Ebrahimi  for line in file:
590*22dc650dSSadaf Ebrahimi    line = re.sub(r'#.*', '', line)
591*22dc650dSSadaf Ebrahimi    data = list(map(str.strip, line.split(';')))
592*22dc650dSSadaf Ebrahimi    if len(data) <= 1:
593*22dc650dSSadaf Ebrahimi      continue
594*22dc650dSSadaf Ebrahimi
595*22dc650dSSadaf Ebrahimi    try:
596*22dc650dSSadaf Ebrahimi      ix = bool_properties.index(data[1])
597*22dc650dSSadaf Ebrahimi    except ValueError:
598*22dc650dSSadaf Ebrahimi      continue
599*22dc650dSSadaf Ebrahimi
600*22dc650dSSadaf Ebrahimi    m = re.match(r'([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+))?$', data[0])
601*22dc650dSSadaf Ebrahimi    char = int(m.group(1), 16)
602*22dc650dSSadaf Ebrahimi    if m.group(3) is None:
603*22dc650dSSadaf Ebrahimi      last = char
604*22dc650dSSadaf Ebrahimi    else:
605*22dc650dSSadaf Ebrahimi      last = int(m.group(3), 16)
606*22dc650dSSadaf Ebrahimi
607*22dc650dSSadaf Ebrahimi    for i in range(char, last + 1):
608*22dc650dSSadaf Ebrahimi      bprops[i].append(ix)
609*22dc650dSSadaf Ebrahimi
610*22dc650dSSadaf Ebrahimi  file.close()
611*22dc650dSSadaf Ebrahimi
612*22dc650dSSadaf Ebrahimi# The ASCII property isn't listed in any files, but it is easy enough to add
613*22dc650dSSadaf Ebrahimi# it manually.
614*22dc650dSSadaf Ebrahimi
615*22dc650dSSadaf Ebrahimiix = bool_properties.index("ASCII")
616*22dc650dSSadaf Ebrahimifor i in range(128):
617*22dc650dSSadaf Ebrahimi  bprops[i].append(ix)
618*22dc650dSSadaf Ebrahimi
619*22dc650dSSadaf Ebrahimi# The Bidi_Mirrored property isn't listed in any property files. We have to
620*22dc650dSSadaf Ebrahimi# deduce it from the file that lists the mirrored characters.
621*22dc650dSSadaf Ebrahimi
622*22dc650dSSadaf Ebrahimiix = bool_properties.index("Bidi_Mirrored")
623*22dc650dSSadaf Ebrahimi
624*22dc650dSSadaf Ebrahimitry:
625*22dc650dSSadaf Ebrahimi  file = open('Unicode.tables/BidiMirroring.txt', 'r')
626*22dc650dSSadaf Ebrahimiexcept IOError:
627*22dc650dSSadaf Ebrahimi  print(f"** Couldn't open {'Unicode.tables/BidiMirroring.txt'}\n")
628*22dc650dSSadaf Ebrahimi  sys.exit(1)
629*22dc650dSSadaf Ebrahimi
630*22dc650dSSadaf Ebrahimifor line in file:
631*22dc650dSSadaf Ebrahimi  line = re.sub(r'#.*', '', line)
632*22dc650dSSadaf Ebrahimi  data = list(map(str.strip, line.split(';')))
633*22dc650dSSadaf Ebrahimi  if len(data) <= 1:
634*22dc650dSSadaf Ebrahimi    continue
635*22dc650dSSadaf Ebrahimi  c = int(data[0], 16)
636*22dc650dSSadaf Ebrahimi  bprops[c].append(ix)
637*22dc650dSSadaf Ebrahimi
638*22dc650dSSadaf Ebrahimifile.close()
639*22dc650dSSadaf Ebrahimi
640*22dc650dSSadaf Ebrahimi# Scan each character's boolean property list and created a list of unique
641*22dc650dSSadaf Ebrahimi# lists, at the same time, setting the index in that list for each property in
642*22dc650dSSadaf Ebrahimi# the bool_props vector.
643*22dc650dSSadaf Ebrahimi
644*22dc650dSSadaf Ebrahimibool_props = [0] * MAX_UNICODE
645*22dc650dSSadaf Ebrahimibool_props_lists = [[]]
646*22dc650dSSadaf Ebrahimi
647*22dc650dSSadaf Ebrahimifor c in range(MAX_UNICODE):
648*22dc650dSSadaf Ebrahimi  s = set(bprops[c])
649*22dc650dSSadaf Ebrahimi  for i in range(len(bool_props_lists)):
650*22dc650dSSadaf Ebrahimi    if s == set(bool_props_lists[i]):
651*22dc650dSSadaf Ebrahimi      break;
652*22dc650dSSadaf Ebrahimi  else:
653*22dc650dSSadaf Ebrahimi    bool_props_lists.append(bprops[c])
654*22dc650dSSadaf Ebrahimi    i += 1
655*22dc650dSSadaf Ebrahimi
656*22dc650dSSadaf Ebrahimi  bool_props[c] = i * bool_props_list_item_size
657*22dc650dSSadaf Ebrahimi
658*22dc650dSSadaf Ebrahimi# This block of code was added by PH in September 2012. It scans the other_case
659*22dc650dSSadaf Ebrahimi# table to find sets of more than two characters that must all match each other
660*22dc650dSSadaf Ebrahimi# caselessly. Later in this script a table of these sets is written out.
661*22dc650dSSadaf Ebrahimi# However, we have to do this work here in order to compute the offsets in the
662*22dc650dSSadaf Ebrahimi# table that are inserted into the main table.
663*22dc650dSSadaf Ebrahimi
664*22dc650dSSadaf Ebrahimi# The CaseFolding.txt file lists pairs, but the common logic for reading data
665*22dc650dSSadaf Ebrahimi# sets only one value, so first we go through the table and set "return"
666*22dc650dSSadaf Ebrahimi# offsets for those that are not already set.
667*22dc650dSSadaf Ebrahimi
668*22dc650dSSadaf Ebrahimifor c in range(MAX_UNICODE):
669*22dc650dSSadaf Ebrahimi  if other_case[c] != 0 and other_case[c + other_case[c]] == 0:
670*22dc650dSSadaf Ebrahimi    other_case[c + other_case[c]] = -other_case[c]
671*22dc650dSSadaf Ebrahimi
672*22dc650dSSadaf Ebrahimi# Now scan again and create equivalence sets.
673*22dc650dSSadaf Ebrahimi
674*22dc650dSSadaf Ebrahimicaseless_sets = []
675*22dc650dSSadaf Ebrahimi
676*22dc650dSSadaf Ebrahimifor c in range(MAX_UNICODE):
677*22dc650dSSadaf Ebrahimi  o = c + other_case[c]
678*22dc650dSSadaf Ebrahimi
679*22dc650dSSadaf Ebrahimi  # Trigger when this character's other case does not point back here. We
680*22dc650dSSadaf Ebrahimi  # now have three characters that are case-equivalent.
681*22dc650dSSadaf Ebrahimi
682*22dc650dSSadaf Ebrahimi  if other_case[o] != -other_case[c]:
683*22dc650dSSadaf Ebrahimi    t = o + other_case[o]
684*22dc650dSSadaf Ebrahimi
685*22dc650dSSadaf Ebrahimi    # Scan the existing sets to see if any of the three characters are already
686*22dc650dSSadaf Ebrahimi    # part of a set. If so, unite the existing set with the new set.
687*22dc650dSSadaf Ebrahimi
688*22dc650dSSadaf Ebrahimi    appended = 0
689*22dc650dSSadaf Ebrahimi    for s in caseless_sets:
690*22dc650dSSadaf Ebrahimi      found = 0
691*22dc650dSSadaf Ebrahimi      for x in s:
692*22dc650dSSadaf Ebrahimi        if x == c or x == o or x == t:
693*22dc650dSSadaf Ebrahimi          found = 1
694*22dc650dSSadaf Ebrahimi
695*22dc650dSSadaf Ebrahimi      # Add new characters to an existing set
696*22dc650dSSadaf Ebrahimi
697*22dc650dSSadaf Ebrahimi      if found:
698*22dc650dSSadaf Ebrahimi        found = 0
699*22dc650dSSadaf Ebrahimi        for y in [c, o, t]:
700*22dc650dSSadaf Ebrahimi          for x in s:
701*22dc650dSSadaf Ebrahimi            if x == y:
702*22dc650dSSadaf Ebrahimi              found = 1
703*22dc650dSSadaf Ebrahimi          if not found:
704*22dc650dSSadaf Ebrahimi            s.append(y)
705*22dc650dSSadaf Ebrahimi        appended = 1
706*22dc650dSSadaf Ebrahimi
707*22dc650dSSadaf Ebrahimi    # If we have not added to an existing set, create a new one.
708*22dc650dSSadaf Ebrahimi
709*22dc650dSSadaf Ebrahimi    if not appended:
710*22dc650dSSadaf Ebrahimi      caseless_sets.append([c, o, t])
711*22dc650dSSadaf Ebrahimi
712*22dc650dSSadaf Ebrahimi# End of loop looking for caseless sets.
713*22dc650dSSadaf Ebrahimi
714*22dc650dSSadaf Ebrahimi# Now scan the sets and set appropriate offsets for the characters.
715*22dc650dSSadaf Ebrahimi
716*22dc650dSSadaf Ebrahimicaseless_offsets = [0] * MAX_UNICODE
717*22dc650dSSadaf Ebrahimi
718*22dc650dSSadaf Ebrahimioffset = 1;
719*22dc650dSSadaf Ebrahimifor s in caseless_sets:
720*22dc650dSSadaf Ebrahimi  for x in s:
721*22dc650dSSadaf Ebrahimi    caseless_offsets[x] = offset
722*22dc650dSSadaf Ebrahimi  offset += len(s) + 1
723*22dc650dSSadaf Ebrahimi
724*22dc650dSSadaf Ebrahimi# End of block of code for creating offsets for caseless matching sets.
725*22dc650dSSadaf Ebrahimi
726*22dc650dSSadaf Ebrahimi# Scan the caseless sets, and for any non-ASCII character that has an ASCII
727*22dc650dSSadaf Ebrahimi# character as its "base" other case, remove the other case. This makes it
728*22dc650dSSadaf Ebrahimi# easier to handle those characters when the PCRE2 option for not mixing ASCII
729*22dc650dSSadaf Ebrahimi# and non-ASCII is enabled. In principle one should perhaps scan for a
730*22dc650dSSadaf Ebrahimi# non-ASCII alternative, but in practice these don't exist.
731*22dc650dSSadaf Ebrahimi
732*22dc650dSSadaf Ebrahimifor s in caseless_sets:
733*22dc650dSSadaf Ebrahimi  for x in s:
734*22dc650dSSadaf Ebrahimi    if x > 127 and x + other_case[x] < 128:
735*22dc650dSSadaf Ebrahimi      other_case[x] = 0
736*22dc650dSSadaf Ebrahimi
737*22dc650dSSadaf Ebrahimi# Combine all the tables
738*22dc650dSSadaf Ebrahimi
739*22dc650dSSadaf Ebrahimitable, records = combine_tables(script, category, break_props,
740*22dc650dSSadaf Ebrahimi  caseless_offsets, other_case, scriptx_bidi_class, bool_props)
741*22dc650dSSadaf Ebrahimi
742*22dc650dSSadaf Ebrahimi# Find the record size and create a string definition of the structure for
743*22dc650dSSadaf Ebrahimi# outputting as a comment.
744*22dc650dSSadaf Ebrahimi
745*22dc650dSSadaf Ebrahimirecord_size, record_struct = get_record_size_struct(list(records.keys()))
746*22dc650dSSadaf Ebrahimi
747*22dc650dSSadaf Ebrahimi# Find the optimum block size for the two-stage table
748*22dc650dSSadaf Ebrahimi
749*22dc650dSSadaf Ebrahimimin_size = sys.maxsize
750*22dc650dSSadaf Ebrahimifor block_size in [2 ** i for i in range(5,10)]:
751*22dc650dSSadaf Ebrahimi  size = len(records) * record_size
752*22dc650dSSadaf Ebrahimi  stage1, stage2 = compress_table(table, block_size)
753*22dc650dSSadaf Ebrahimi  size += get_tables_size(stage1, stage2)
754*22dc650dSSadaf Ebrahimi  #print("/* block size {:3d} => {:5d} bytes */".format(block_size, size))
755*22dc650dSSadaf Ebrahimi  if size < min_size:
756*22dc650dSSadaf Ebrahimi    min_size = size
757*22dc650dSSadaf Ebrahimi    min_stage1, min_stage2 = stage1, stage2
758*22dc650dSSadaf Ebrahimi    min_block_size = block_size
759*22dc650dSSadaf Ebrahimi
760*22dc650dSSadaf Ebrahimi
761*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------
762*22dc650dSSadaf Ebrahimi#                   MAIN CODE FOR WRITING THE OUTPUT FILE
763*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------
764*22dc650dSSadaf Ebrahimi
765*22dc650dSSadaf Ebrahimi# Open the output file (no return on failure). This call also writes standard
766*22dc650dSSadaf Ebrahimi# header boilerplate.
767*22dc650dSSadaf Ebrahimi
768*22dc650dSSadaf Ebrahimif = open_output("pcre2_ucd.c")
769*22dc650dSSadaf Ebrahimi
770*22dc650dSSadaf Ebrahimi# Output this file's heading text
771*22dc650dSSadaf Ebrahimi
772*22dc650dSSadaf Ebrahimif.write("""\
773*22dc650dSSadaf Ebrahimi/* This file contains tables of Unicode properties that are extracted from
774*22dc650dSSadaf EbrahimiUnicode data files. See the comments at the start of maint/GenerateUcd.py for
775*22dc650dSSadaf Ebrahimidetails.
776*22dc650dSSadaf Ebrahimi
777*22dc650dSSadaf EbrahimiAs well as being part of the PCRE2 library, this file is #included by the
778*22dc650dSSadaf Ebrahimipcre2test program, which redefines the PRIV macro to change table names from
779*22dc650dSSadaf Ebrahimi_pcre2_xxx to xxxx, thereby avoiding name clashes with the library. At present,
780*22dc650dSSadaf Ebrahimijust one of these tables is actually needed. When compiling the library, some
781*22dc650dSSadaf Ebrahimiheaders are needed. */
782*22dc650dSSadaf Ebrahimi
783*22dc650dSSadaf Ebrahimi#ifndef PCRE2_PCRE2TEST
784*22dc650dSSadaf Ebrahimi#ifdef HAVE_CONFIG_H
785*22dc650dSSadaf Ebrahimi#include "config.h"
786*22dc650dSSadaf Ebrahimi#endif
787*22dc650dSSadaf Ebrahimi#include "pcre2_internal.h"
788*22dc650dSSadaf Ebrahimi#endif /* PCRE2_PCRE2TEST */
789*22dc650dSSadaf Ebrahimi
790*22dc650dSSadaf Ebrahimi/* The tables herein are needed only when UCP support is built, and in PCRE2
791*22dc650dSSadaf Ebrahimithat happens automatically with UTF support. This module should not be
792*22dc650dSSadaf Ebrahimireferenced otherwise, so it should not matter whether it is compiled or not.
793*22dc650dSSadaf EbrahimiHowever a comment was received about space saving - maybe the guy linked all
794*22dc650dSSadaf Ebrahimithe modules rather than using a library - so we include a condition to cut out
795*22dc650dSSadaf Ebrahimithe tables when not needed. But don't leave a totally empty module because some
796*22dc650dSSadaf Ebrahimicompilers barf at that. Instead, just supply some small dummy tables. */
797*22dc650dSSadaf Ebrahimi
798*22dc650dSSadaf Ebrahimi#ifndef SUPPORT_UNICODE
799*22dc650dSSadaf Ebrahimiconst ucd_record PRIV(ucd_records)[] = {{0,0,0,0,0,0,0}};
800*22dc650dSSadaf Ebrahimiconst uint16_t PRIV(ucd_stage1)[] = {0};
801*22dc650dSSadaf Ebrahimiconst uint16_t PRIV(ucd_stage2)[] = {0};
802*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_caseless_sets)[] = {0};
803*22dc650dSSadaf Ebrahimi#else
804*22dc650dSSadaf Ebrahimi\n""")
805*22dc650dSSadaf Ebrahimi
806*22dc650dSSadaf Ebrahimi# --- Output some variable heading stuff ---
807*22dc650dSSadaf Ebrahimi
808*22dc650dSSadaf Ebrahimif.write("/* Total size: %d bytes, block size: %d. */\n\n" % (min_size, min_block_size))
809*22dc650dSSadaf Ebrahimif.write('const char *PRIV(unicode_version) = "{}";\n\n'.format(unicode_version))
810*22dc650dSSadaf Ebrahimi
811*22dc650dSSadaf Ebrahimif.write("""\
812*22dc650dSSadaf Ebrahimi/* When recompiling tables with a new Unicode version, please check the types
813*22dc650dSSadaf Ebrahimiin this structure definition with those in pcre2_internal.h (the actual field
814*22dc650dSSadaf Ebrahiminames will be different).
815*22dc650dSSadaf Ebrahimi\n""")
816*22dc650dSSadaf Ebrahimi
817*22dc650dSSadaf Ebrahimif.write(record_struct)
818*22dc650dSSadaf Ebrahimi
819*22dc650dSSadaf Ebrahimif.write("""
820*22dc650dSSadaf Ebrahimi/* If the 32-bit library is run in non-32-bit mode, character values greater
821*22dc650dSSadaf Ebrahimithan 0x10ffff may be encountered. For these we set up a special record. */
822*22dc650dSSadaf Ebrahimi
823*22dc650dSSadaf Ebrahimi#if PCRE2_CODE_UNIT_WIDTH == 32
824*22dc650dSSadaf Ebrahimiconst ucd_record PRIV(dummy_ucd_record)[] = {{
825*22dc650dSSadaf Ebrahimi  ucp_Unknown,    /* script */
826*22dc650dSSadaf Ebrahimi  ucp_Cn,         /* type unassigned */
827*22dc650dSSadaf Ebrahimi  ucp_gbOther,    /* grapheme break property */
828*22dc650dSSadaf Ebrahimi  0,              /* case set */
829*22dc650dSSadaf Ebrahimi  0,              /* other case */
830*22dc650dSSadaf Ebrahimi  0 | (ucp_bidiL << UCD_BIDICLASS_SHIFT), /* script extension and bidi class */
831*22dc650dSSadaf Ebrahimi  0,              /* bool properties offset */
832*22dc650dSSadaf Ebrahimi  }};
833*22dc650dSSadaf Ebrahimi#endif
834*22dc650dSSadaf Ebrahimi\n""")
835*22dc650dSSadaf Ebrahimi
836*22dc650dSSadaf Ebrahimi# --- Output the table of caseless character sets ---
837*22dc650dSSadaf Ebrahimi
838*22dc650dSSadaf Ebrahimif.write("""\
839*22dc650dSSadaf Ebrahimi/* This table contains lists of characters that are caseless sets of
840*22dc650dSSadaf Ebrahimimore than one character. Each list is terminated by NOTACHAR. */
841*22dc650dSSadaf Ebrahimi
842*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_caseless_sets)[] = {
843*22dc650dSSadaf Ebrahimi  NOTACHAR,
844*22dc650dSSadaf Ebrahimi""")
845*22dc650dSSadaf Ebrahimi
846*22dc650dSSadaf Ebrahimifor s in caseless_sets:
847*22dc650dSSadaf Ebrahimi  s = sorted(s)
848*22dc650dSSadaf Ebrahimi  for x in s:
849*22dc650dSSadaf Ebrahimi    f.write('  0x%04x,' % x)
850*22dc650dSSadaf Ebrahimi  f.write('  NOTACHAR,\n')
851*22dc650dSSadaf Ebrahimif.write('};\n\n')
852*22dc650dSSadaf Ebrahimi
853*22dc650dSSadaf Ebrahimi# --- Other tables are not needed by pcre2test ---
854*22dc650dSSadaf Ebrahimi
855*22dc650dSSadaf Ebrahimif.write("""\
856*22dc650dSSadaf Ebrahimi/* When #included in pcre2test, we don't need the table of digit sets, nor the
857*22dc650dSSadaf Ebrahimithe large main UCD tables. */
858*22dc650dSSadaf Ebrahimi
859*22dc650dSSadaf Ebrahimi#ifndef PCRE2_PCRE2TEST
860*22dc650dSSadaf Ebrahimi\n""")
861*22dc650dSSadaf Ebrahimi
862*22dc650dSSadaf Ebrahimi# --- Read Scripts.txt again for the sets of 10 digits. ---
863*22dc650dSSadaf Ebrahimi
864*22dc650dSSadaf Ebrahimidigitsets = []
865*22dc650dSSadaf Ebrahimifile = open('Unicode.tables/Scripts.txt', 'r', encoding='utf-8')
866*22dc650dSSadaf Ebrahimi
867*22dc650dSSadaf Ebrahimifor line in file:
868*22dc650dSSadaf Ebrahimi  m = re.match(r'([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)\s+;\s+\S+\s+#\s+Nd\s+', line)
869*22dc650dSSadaf Ebrahimi  if m is None:
870*22dc650dSSadaf Ebrahimi    continue
871*22dc650dSSadaf Ebrahimi  first = int(m.group(1),16)
872*22dc650dSSadaf Ebrahimi  last  = int(m.group(2),16)
873*22dc650dSSadaf Ebrahimi  if ((last - first + 1) % 10) != 0:
874*22dc650dSSadaf Ebrahimi    f.write("ERROR: %04x..%04x does not contain a multiple of 10 characters" % (first, last),
875*22dc650dSSadaf Ebrahimi      file=sys.stderr)
876*22dc650dSSadaf Ebrahimi  while first < last:
877*22dc650dSSadaf Ebrahimi    digitsets.append(first + 9)
878*22dc650dSSadaf Ebrahimi    first += 10
879*22dc650dSSadaf Ebrahimifile.close()
880*22dc650dSSadaf Ebrahimidigitsets.sort()
881*22dc650dSSadaf Ebrahimi
882*22dc650dSSadaf Ebrahimif.write("""\
883*22dc650dSSadaf Ebrahimi/* This table lists the code points for the '9' characters in each set of
884*22dc650dSSadaf Ebrahimidecimal digits. It is used to ensure that all the digits in a script run come
885*22dc650dSSadaf Ebrahimifrom the same set. */
886*22dc650dSSadaf Ebrahimi
887*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_digit_sets)[] = {
888*22dc650dSSadaf Ebrahimi""")
889*22dc650dSSadaf Ebrahimi
890*22dc650dSSadaf Ebrahimif.write("  %d,  /* Number of subsequent values */" % len(digitsets))
891*22dc650dSSadaf Ebrahimicount = 8
892*22dc650dSSadaf Ebrahimifor d in digitsets:
893*22dc650dSSadaf Ebrahimi  if count == 8:
894*22dc650dSSadaf Ebrahimi    f.write("\n ")
895*22dc650dSSadaf Ebrahimi    count = 0
896*22dc650dSSadaf Ebrahimi  f.write(" 0x%05x," % d)
897*22dc650dSSadaf Ebrahimi  count += 1
898*22dc650dSSadaf Ebrahimif.write("\n};\n\n")
899*22dc650dSSadaf Ebrahimi
900*22dc650dSSadaf Ebrahimif.write("""\
901*22dc650dSSadaf Ebrahimi/* This vector is a list of script bitsets for the Script Extension property.
902*22dc650dSSadaf EbrahimiThe number of 32-bit words in each bitset is #defined in pcre2_ucp.h as
903*22dc650dSSadaf Ebrahimiucd_script_sets_item_size. */
904*22dc650dSSadaf Ebrahimi
905*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_script_sets)[] = {
906*22dc650dSSadaf Ebrahimi""")
907*22dc650dSSadaf Ebrahimiwrite_bitsets(script_lists, script_list_item_size)
908*22dc650dSSadaf Ebrahimi
909*22dc650dSSadaf Ebrahimif.write("""\
910*22dc650dSSadaf Ebrahimi/* This vector is a list of bitsets for Boolean properties. The number of
911*22dc650dSSadaf Ebrahimi32_bit words in each bitset is #defined as ucd_boolprop_sets_item_size in
912*22dc650dSSadaf Ebrahimipcre2_ucp.h. */
913*22dc650dSSadaf Ebrahimi
914*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_boolprop_sets)[] = {
915*22dc650dSSadaf Ebrahimi""")
916*22dc650dSSadaf Ebrahimiwrite_bitsets(bool_props_lists, bool_props_list_item_size)
917*22dc650dSSadaf Ebrahimi
918*22dc650dSSadaf Ebrahimi
919*22dc650dSSadaf Ebrahimi# Output the main UCD tables.
920*22dc650dSSadaf Ebrahimi
921*22dc650dSSadaf Ebrahimif.write("""\
922*22dc650dSSadaf Ebrahimi/* These are the main two-stage UCD tables. The fields in each record are:
923*22dc650dSSadaf Ebrahimiscript (8 bits), character type (8 bits), grapheme break property (8 bits),
924*22dc650dSSadaf Ebrahimioffset to multichar other cases or zero (8 bits), offset to other case or zero
925*22dc650dSSadaf Ebrahimi(32 bits, signed), bidi class (5 bits) and script extension (11 bits) packed
926*22dc650dSSadaf Ebrahimiinto a 16-bit field, and offset in binary properties table (16 bits). */
927*22dc650dSSadaf Ebrahimi\n""")
928*22dc650dSSadaf Ebrahimi
929*22dc650dSSadaf Ebrahimiwrite_records(records, record_size)
930*22dc650dSSadaf Ebrahimiwrite_table(min_stage1, 'PRIV(ucd_stage1)')
931*22dc650dSSadaf Ebrahimiwrite_table(min_stage2, 'PRIV(ucd_stage2)', min_block_size)
932*22dc650dSSadaf Ebrahimi
933*22dc650dSSadaf Ebrahimif.write("#if UCD_BLOCK_SIZE != %d\n" % min_block_size)
934*22dc650dSSadaf Ebrahimif.write("""\
935*22dc650dSSadaf Ebrahimi#error Please correct UCD_BLOCK_SIZE in pcre2_internal.h
936*22dc650dSSadaf Ebrahimi#endif
937*22dc650dSSadaf Ebrahimi#endif  /* SUPPORT_UNICODE */
938*22dc650dSSadaf Ebrahimi
939*22dc650dSSadaf Ebrahimi#endif  /* PCRE2_PCRE2TEST */
940*22dc650dSSadaf Ebrahimi
941*22dc650dSSadaf Ebrahimi/* End of pcre2_ucd.c */
942*22dc650dSSadaf Ebrahimi""")
943*22dc650dSSadaf Ebrahimi
944*22dc650dSSadaf Ebrahimif.close
945*22dc650dSSadaf Ebrahimi
946*22dc650dSSadaf Ebrahimi# End
947