1*22dc650dSSadaf Ebrahimi#! /usr/bin/python 2*22dc650dSSadaf Ebrahimi 3*22dc650dSSadaf Ebrahimi# PCRE2 UNICODE PROPERTY SUPPORT 4*22dc650dSSadaf Ebrahimi# ------------------------------ 5*22dc650dSSadaf Ebrahimi# 6*22dc650dSSadaf Ebrahimi# This script generates the pcre2_ucd.c file from Unicode data files. This is 7*22dc650dSSadaf Ebrahimi# the compressed Unicode property data used by PCRE2. The script was created in 8*22dc650dSSadaf Ebrahimi# December 2021 as part of the Unicode data generation refactoring. It is 9*22dc650dSSadaf Ebrahimi# basically a re-working of the MultiStage2.py script that was submitted to the 10*22dc650dSSadaf Ebrahimi# PCRE project by Peter Kankowski in 2008 as part of a previous upgrading of 11*22dc650dSSadaf Ebrahimi# Unicode property support. A number of extensions have since been added. The 12*22dc650dSSadaf Ebrahimi# main difference in the 2021 upgrade (apart from comments and layout) is that 13*22dc650dSSadaf Ebrahimi# the data tables (e.g. list of script names) are now listed in or generated by 14*22dc650dSSadaf Ebrahimi# a separate Python module that is shared with the other Generate scripts. 15*22dc650dSSadaf Ebrahimi# 16*22dc650dSSadaf Ebrahimi# This script must be run in the "maint" directory. It requires the following 17*22dc650dSSadaf Ebrahimi# Unicode data tables: BidiMirrorring.txt, CaseFolding.txt, 18*22dc650dSSadaf Ebrahimi# DerivedBidiClass.txt, DerivedCoreProperties.txt, DerivedGeneralCategory.txt, 19*22dc650dSSadaf Ebrahimi# GraphemeBreakProperty.txt, PropList.txt, PropertyAliases.txt, 20*22dc650dSSadaf Ebrahimi# PropertyValueAliases.txt, ScriptExtensions.txt, Scripts.txt, and 21*22dc650dSSadaf Ebrahimi# emoji-data.txt. These must be in the Unicode.tables subdirectory. 22*22dc650dSSadaf Ebrahimi# 23*22dc650dSSadaf Ebrahimi# The emoji-data.txt file is found in the "emoji" subdirectory even though it 24*22dc650dSSadaf Ebrahimi# is technically part of a different (but coordinated) standard as shown 25*22dc650dSSadaf Ebrahimi# in files associated with Unicode Technical Standard #51 ("Unicode Emoji"), 26*22dc650dSSadaf Ebrahimi# for example: 27*22dc650dSSadaf Ebrahimi# 28*22dc650dSSadaf Ebrahimi# http://unicode.org/Public/emoji/13.0/ReadMe.txt 29*22dc650dSSadaf Ebrahimi# 30*22dc650dSSadaf Ebrahimi# DerivedBidiClass.txt and DerivedGeneralCategory.txt are in the "extracted" 31*22dc650dSSadaf Ebrahimi# subdirectory of the Unicode database (UCD) on the Unicode web site; 32*22dc650dSSadaf Ebrahimi# GraphemeBreakProperty.txt is in the "auxiliary" subdirectory. The other files 33*22dc650dSSadaf Ebrahimi# are in the top-level UCD directory. 34*22dc650dSSadaf Ebrahimi# 35*22dc650dSSadaf Ebrahimi# ----------------------------------------------------------------------------- 36*22dc650dSSadaf Ebrahimi# Minor modifications made to the original script: 37*22dc650dSSadaf Ebrahimi# Added #! line at start 38*22dc650dSSadaf Ebrahimi# Removed tabs 39*22dc650dSSadaf Ebrahimi# Made it work with Python 2.4 by rewriting two statements that needed 2.5 40*22dc650dSSadaf Ebrahimi# Consequent code tidy 41*22dc650dSSadaf Ebrahimi# Adjusted data file names to take from the Unicode.tables directory 42*22dc650dSSadaf Ebrahimi# Adjusted global table names by prefixing _pcre_. 43*22dc650dSSadaf Ebrahimi# Commented out stuff relating to the casefolding table, which isn't used; 44*22dc650dSSadaf Ebrahimi# removed completely in 2012. 45*22dc650dSSadaf Ebrahimi# Corrected size calculation 46*22dc650dSSadaf Ebrahimi# Add #ifndef SUPPORT_UCP to use dummy tables when no UCP support is needed. 47*22dc650dSSadaf Ebrahimi# Update for PCRE2: name changes, and SUPPORT_UCP is abolished. 48*22dc650dSSadaf Ebrahimi# 49*22dc650dSSadaf Ebrahimi# Major modifications made to the original script: 50*22dc650dSSadaf Ebrahimi# Added code to add a grapheme break property field to records. 51*22dc650dSSadaf Ebrahimi# 52*22dc650dSSadaf Ebrahimi# Added code to search for sets of more than two characters that must match 53*22dc650dSSadaf Ebrahimi# each other caselessly. A new table is output containing these sets, and 54*22dc650dSSadaf Ebrahimi# offsets into the table are added to the main output records. This new 55*22dc650dSSadaf Ebrahimi# code scans CaseFolding.txt instead of UnicodeData.txt, which is no longer 56*22dc650dSSadaf Ebrahimi# used. 57*22dc650dSSadaf Ebrahimi# 58*22dc650dSSadaf Ebrahimi# Update for Python3: 59*22dc650dSSadaf Ebrahimi# . Processed with 2to3, but that didn't fix everything 60*22dc650dSSadaf Ebrahimi# . Changed string.strip to str.strip 61*22dc650dSSadaf Ebrahimi# . Added encoding='utf-8' to the open() call 62*22dc650dSSadaf Ebrahimi# . Inserted 'int' before blocksize/ELEMS_PER_LINE because an int is 63*22dc650dSSadaf Ebrahimi# required and the result of the division is a float 64*22dc650dSSadaf Ebrahimi# 65*22dc650dSSadaf Ebrahimi# Added code to scan the emoji-data.txt file to find the Extended Pictographic 66*22dc650dSSadaf Ebrahimi# property, which is used by PCRE2 as a grapheme breaking property. This was 67*22dc650dSSadaf Ebrahimi# done when updating to Unicode 11.0.0 (July 2018). 68*22dc650dSSadaf Ebrahimi# 69*22dc650dSSadaf Ebrahimi# Added code to add a Script Extensions field to records. This has increased 70*22dc650dSSadaf Ebrahimi# their size from 8 to 12 bytes, only 10 of which are currently used. 71*22dc650dSSadaf Ebrahimi# 72*22dc650dSSadaf Ebrahimi# Added code to add a bidi class field to records by scanning the 73*22dc650dSSadaf Ebrahimi# DerivedBidiClass.txt and PropList.txt files. This uses one of the two spare 74*22dc650dSSadaf Ebrahimi# bytes, so now 11 out of 12 are in use. 75*22dc650dSSadaf Ebrahimi# 76*22dc650dSSadaf Ebrahimi# 01-March-2010: Updated list of scripts for Unicode 5.2.0 77*22dc650dSSadaf Ebrahimi# 30-April-2011: Updated list of scripts for Unicode 6.0.0 78*22dc650dSSadaf Ebrahimi# July-2012: Updated list of scripts for Unicode 6.1.0 79*22dc650dSSadaf Ebrahimi# 20-August-2012: Added scan of GraphemeBreakProperty.txt and added a new 80*22dc650dSSadaf Ebrahimi# field in the record to hold the value. Luckily, the 81*22dc650dSSadaf Ebrahimi# structure had a hole in it, so the resulting table is 82*22dc650dSSadaf Ebrahimi# not much bigger than before. 83*22dc650dSSadaf Ebrahimi# 18-September-2012: Added code for multiple caseless sets. This uses the 84*22dc650dSSadaf Ebrahimi# final hole in the structure. 85*22dc650dSSadaf Ebrahimi# 30-September-2012: Added RegionalIndicator break property from Unicode 6.2.0 86*22dc650dSSadaf Ebrahimi# 13-May-2014: Updated for PCRE2 87*22dc650dSSadaf Ebrahimi# 03-June-2014: Updated for Python 3 88*22dc650dSSadaf Ebrahimi# 20-June-2014: Updated for Unicode 7.0.0 89*22dc650dSSadaf Ebrahimi# 12-August-2014: Updated to put Unicode version into the file 90*22dc650dSSadaf Ebrahimi# 19-June-2015: Updated for Unicode 8.0.0 91*22dc650dSSadaf Ebrahimi# 02-July-2017: Updated for Unicode 10.0.0 92*22dc650dSSadaf Ebrahimi# 03-July-2018: Updated for Unicode 11.0.0 93*22dc650dSSadaf Ebrahimi# 07-July-2018: Added code to scan emoji-data.txt for the Extended 94*22dc650dSSadaf Ebrahimi# Pictographic property. 95*22dc650dSSadaf Ebrahimi# 01-October-2018: Added the 'Unknown' script name 96*22dc650dSSadaf Ebrahimi# 03-October-2018: Added new field for Script Extensions 97*22dc650dSSadaf Ebrahimi# 27-July-2019: Updated for Unicode 12.1.0 98*22dc650dSSadaf Ebrahimi# 10-March-2020: Updated for Unicode 13.0.0 99*22dc650dSSadaf Ebrahimi# PCRE2-10.39: Updated for Unicode 14.0.0 100*22dc650dSSadaf Ebrahimi# 05-December-2021: Added code to scan DerivedBidiClass.txt for bidi class, 101*22dc650dSSadaf Ebrahimi# and also PropList.txt for the Bidi_Control property 102*22dc650dSSadaf Ebrahimi# 19-December-2021: Reworked script extensions lists to be bit maps instead 103*22dc650dSSadaf Ebrahimi# of zero-terminated lists of script numbers. 104*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------- 105*22dc650dSSadaf Ebrahimi# 106*22dc650dSSadaf Ebrahimi# Changes to the refactored script: 107*22dc650dSSadaf Ebrahimi# 108*22dc650dSSadaf Ebrahimi# 26-December-2021: Refactoring completed 109*22dc650dSSadaf Ebrahimi# 10-January-2022: Addition of general Boolean property support 110*22dc650dSSadaf Ebrahimi# 12-January-2022: Merge scriptx and bidiclass fields 111*22dc650dSSadaf Ebrahimi# 14-January-2022: Enlarge Boolean property offset to 12 bits 112*22dc650dSSadaf Ebrahimi# 28-January-2023: Remove ASCII "other case" from non-ASCII character that 113*22dc650dSSadaf Ebrahimi# are present in caseless sets. 114*22dc650dSSadaf Ebrahimi# 115*22dc650dSSadaf Ebrahimi# ---------------------------------------------------------------------------- 116*22dc650dSSadaf Ebrahimi# 117*22dc650dSSadaf Ebrahimi# 118*22dc650dSSadaf Ebrahimi# The main tables generated by this script are used by macros defined in 119*22dc650dSSadaf Ebrahimi# pcre2_internal.h. They look up Unicode character properties using short 120*22dc650dSSadaf Ebrahimi# sequences of code that contains no branches, which makes for greater speed. 121*22dc650dSSadaf Ebrahimi# 122*22dc650dSSadaf Ebrahimi# Conceptually, there is a table of records (of type ucd_record), one for each 123*22dc650dSSadaf Ebrahimi# Unicode character. Each record contains the script number, script extension 124*22dc650dSSadaf Ebrahimi# value, character type, grapheme break type, offset to caseless matching set, 125*22dc650dSSadaf Ebrahimi# offset to the character's other case, the bidi class, and offset to bitmap of 126*22dc650dSSadaf Ebrahimi# Boolean properties. 127*22dc650dSSadaf Ebrahimi# 128*22dc650dSSadaf Ebrahimi# A real table covering all Unicode characters would be far too big. It can be 129*22dc650dSSadaf Ebrahimi# efficiently compressed by observing that many characters have the same 130*22dc650dSSadaf Ebrahimi# record, and many blocks of characters (taking 128 characters in a block) have 131*22dc650dSSadaf Ebrahimi# the same set of records as other blocks. This leads to a 2-stage lookup 132*22dc650dSSadaf Ebrahimi# process. 133*22dc650dSSadaf Ebrahimi# 134*22dc650dSSadaf Ebrahimi# This script constructs seven tables. The ucd_caseless_sets table contains 135*22dc650dSSadaf Ebrahimi# lists of characters that all match each other caselessly. Each list is 136*22dc650dSSadaf Ebrahimi# in order, and is terminated by NOTACHAR (0xffffffff), which is larger than 137*22dc650dSSadaf Ebrahimi# any valid character. The first list is empty; this is used for characters 138*22dc650dSSadaf Ebrahimi# that are not part of any list. 139*22dc650dSSadaf Ebrahimi# 140*22dc650dSSadaf Ebrahimi# The ucd_digit_sets table contains the code points of the '9' characters in 141*22dc650dSSadaf Ebrahimi# each set of 10 decimal digits in Unicode. This is used to ensure that digits 142*22dc650dSSadaf Ebrahimi# in script runs all come from the same set. The first element in the vector 143*22dc650dSSadaf Ebrahimi# contains the number of subsequent elements, which are in ascending order. 144*22dc650dSSadaf Ebrahimi# 145*22dc650dSSadaf Ebrahimi# Scripts are partitioned into two groups. Scripts that appear in at least one 146*22dc650dSSadaf Ebrahimi# character's script extension list come first, followed by "Unknown" and then 147*22dc650dSSadaf Ebrahimi# all the rest. This sorting is done automatically in the GenerateCommon.py 148*22dc650dSSadaf Ebrahimi# script. A script's number is its index in the script_names list. 149*22dc650dSSadaf Ebrahimi# 150*22dc650dSSadaf Ebrahimi# The ucd_script_sets table contains bitmaps that represent lists of scripts 151*22dc650dSSadaf Ebrahimi# for Script Extensions properties. Each bitmap consists of a fixed number of 152*22dc650dSSadaf Ebrahimi# unsigned 32-bit numbers, enough to allocate a bit for every script that is 153*22dc650dSSadaf Ebrahimi# used in any character's extension list, that is, enough for every script 154*22dc650dSSadaf Ebrahimi# whose number is less than ucp_Unknown. A character's script extension value 155*22dc650dSSadaf Ebrahimi# in its ucd record is an offset into the ucd_script_sets vector. The first 156*22dc650dSSadaf Ebrahimi# bitmap has no bits set; characters that have no script extensions have zero 157*22dc650dSSadaf Ebrahimi# as their script extensions value so that they use this map. 158*22dc650dSSadaf Ebrahimi# 159*22dc650dSSadaf Ebrahimi# The ucd_boolprop_sets table contains bitmaps that represent lists of Boolean 160*22dc650dSSadaf Ebrahimi# properties. Each bitmap consists of a fixed number of unsigned 32-bit 161*22dc650dSSadaf Ebrahimi# numbers, enough to allocate a bit for each supported Boolean property. 162*22dc650dSSadaf Ebrahimi# 163*22dc650dSSadaf Ebrahimi# The ucd_records table contains one instance of every unique character record 164*22dc650dSSadaf Ebrahimi# that is required. The ucd_stage1 table is indexed by a character's block 165*22dc650dSSadaf Ebrahimi# number, which is the character's code point divided by 128, since 128 is the 166*22dc650dSSadaf Ebrahimi# size of each block. The result of a lookup in ucd_stage1 a "virtual" block 167*22dc650dSSadaf Ebrahimi# number. 168*22dc650dSSadaf Ebrahimi# 169*22dc650dSSadaf Ebrahimi# The ucd_stage2 table is a table of "virtual" blocks; each block is indexed by 170*22dc650dSSadaf Ebrahimi# the offset of a character within its own block, and the result is the index 171*22dc650dSSadaf Ebrahimi# number of the required record in the ucd_records vector. 172*22dc650dSSadaf Ebrahimi# 173*22dc650dSSadaf Ebrahimi# The following examples are correct for the Unicode 14.0.0 database. Future 174*22dc650dSSadaf Ebrahimi# updates may make change the actual lookup values. 175*22dc650dSSadaf Ebrahimi# 176*22dc650dSSadaf Ebrahimi# Example: lowercase "a" (U+0061) is in block 0 177*22dc650dSSadaf Ebrahimi# lookup 0 in stage1 table yields 0 178*22dc650dSSadaf Ebrahimi# lookup 97 (0x61) in the first table in stage2 yields 35 179*22dc650dSSadaf Ebrahimi# record 35 is { 0, 5, 12, 0, -32, 18432, 44 } 180*22dc650dSSadaf Ebrahimi# 0 = ucp_Latin => Latin script 181*22dc650dSSadaf Ebrahimi# 5 = ucp_Ll => Lower case letter 182*22dc650dSSadaf Ebrahimi# 12 = ucp_gbOther => Grapheme break property "Other" 183*22dc650dSSadaf Ebrahimi# 0 => Not part of a caseless set 184*22dc650dSSadaf Ebrahimi# -32 (-0x20) => Other case is U+0041 185*22dc650dSSadaf Ebrahimi# 18432 = 0x4800 => Combined Bidi class + script extension values 186*22dc650dSSadaf Ebrahimi# 44 => Offset to Boolean properties 187*22dc650dSSadaf Ebrahimi# 188*22dc650dSSadaf Ebrahimi# The top 5 bits of the sixth field are the Bidi class, with the rest being the 189*22dc650dSSadaf Ebrahimi# script extension value, giving: 190*22dc650dSSadaf Ebrahimi# 191*22dc650dSSadaf Ebrahimi# 9 = ucp_bidiL => Bidi class left-to-right 192*22dc650dSSadaf Ebrahimi# 0 => No special script extension property 193*22dc650dSSadaf Ebrahimi# 194*22dc650dSSadaf Ebrahimi# Almost all lowercase latin characters resolve to the same record. One or two 195*22dc650dSSadaf Ebrahimi# are different because they are part of a multi-character caseless set (for 196*22dc650dSSadaf Ebrahimi# example, k, K and the Kelvin symbol are such a set). 197*22dc650dSSadaf Ebrahimi# 198*22dc650dSSadaf Ebrahimi# Example: hiragana letter A (U+3042) is in block 96 (0x60) 199*22dc650dSSadaf Ebrahimi# lookup 96 in stage1 table yields 93 200*22dc650dSSadaf Ebrahimi# lookup 66 (0x42) in table 93 in stage2 yields 819 201*22dc650dSSadaf Ebrahimi# record 819 is { 20, 7, 12, 0, 0, 18432, 82 } 202*22dc650dSSadaf Ebrahimi# 20 = ucp_Hiragana => Hiragana script 203*22dc650dSSadaf Ebrahimi# 7 = ucp_Lo => Other letter 204*22dc650dSSadaf Ebrahimi# 12 = ucp_gbOther => Grapheme break property "Other" 205*22dc650dSSadaf Ebrahimi# 0 => Not part of a caseless set 206*22dc650dSSadaf Ebrahimi# 0 => No other case 207*22dc650dSSadaf Ebrahimi# 18432 = 0x4800 => Combined Bidi class + script extension values 208*22dc650dSSadaf Ebrahimi# 82 => Offset to Boolean properties 209*22dc650dSSadaf Ebrahimi# 210*22dc650dSSadaf Ebrahimi# The top 5 bits of the sixth field are the Bidi class, with the rest being the 211*22dc650dSSadaf Ebrahimi# script extension value, giving: 212*22dc650dSSadaf Ebrahimi# 213*22dc650dSSadaf Ebrahimi# 9 = ucp_bidiL => Bidi class left-to-right 214*22dc650dSSadaf Ebrahimi# 0 => No special script extension property 215*22dc650dSSadaf Ebrahimi# 216*22dc650dSSadaf Ebrahimi# Example: vedic tone karshana (U+1CD0) is in block 57 (0x39) 217*22dc650dSSadaf Ebrahimi# lookup 57 in stage1 table yields 55 218*22dc650dSSadaf Ebrahimi# lookup 80 (0x50) in table 55 in stage2 yields 621 219*22dc650dSSadaf Ebrahimi# record 621 is { 84, 12, 3, 0, 0, 26762, 96 } 220*22dc650dSSadaf Ebrahimi# 84 = ucp_Inherited => Script inherited from predecessor 221*22dc650dSSadaf Ebrahimi# 12 = ucp_Mn => Non-spacing mark 222*22dc650dSSadaf Ebrahimi# 3 = ucp_gbExtend => Grapheme break property "Extend" 223*22dc650dSSadaf Ebrahimi# 0 => Not part of a caseless set 224*22dc650dSSadaf Ebrahimi# 0 => No other case 225*22dc650dSSadaf Ebrahimi# 26762 = 0x688A => Combined Bidi class + script extension values 226*22dc650dSSadaf Ebrahimi# 96 => Offset to Boolean properties 227*22dc650dSSadaf Ebrahimi# 228*22dc650dSSadaf Ebrahimi# The top 5 bits of the sixth field are the Bidi class, with the rest being the 229*22dc650dSSadaf Ebrahimi# script extension value, giving: 230*22dc650dSSadaf Ebrahimi# 231*22dc650dSSadaf Ebrahimi# 13 = ucp_bidiNSM => Bidi class non-spacing mark 232*22dc650dSSadaf Ebrahimi# 138 => Script Extension list offset = 138 233*22dc650dSSadaf Ebrahimi# 234*22dc650dSSadaf Ebrahimi# At offset 138 in the ucd_script_sets vector we find a bitmap with bits 1, 8, 235*22dc650dSSadaf Ebrahimi# 18, and 47 set. This means that this character is expected to be used with 236*22dc650dSSadaf Ebrahimi# any of those scripts, which are Bengali, Devanagari, Kannada, and Grantha. 237*22dc650dSSadaf Ebrahimi# 238*22dc650dSSadaf Ebrahimi# Philip Hazel, last updated 14 January 2022. 239*22dc650dSSadaf Ebrahimi############################################################################## 240*22dc650dSSadaf Ebrahimi 241*22dc650dSSadaf Ebrahimi 242*22dc650dSSadaf Ebrahimi# Import standard modules 243*22dc650dSSadaf Ebrahimi 244*22dc650dSSadaf Ebrahimiimport re 245*22dc650dSSadaf Ebrahimiimport string 246*22dc650dSSadaf Ebrahimiimport sys 247*22dc650dSSadaf Ebrahimi 248*22dc650dSSadaf Ebrahimi# Import common data lists and functions 249*22dc650dSSadaf Ebrahimi 250*22dc650dSSadaf Ebrahimifrom GenerateCommon import \ 251*22dc650dSSadaf Ebrahimi bidi_classes, \ 252*22dc650dSSadaf Ebrahimi bool_properties, \ 253*22dc650dSSadaf Ebrahimi bool_propsfiles, \ 254*22dc650dSSadaf Ebrahimi bool_props_list_item_size, \ 255*22dc650dSSadaf Ebrahimi break_properties, \ 256*22dc650dSSadaf Ebrahimi category_names, \ 257*22dc650dSSadaf Ebrahimi general_category_names, \ 258*22dc650dSSadaf Ebrahimi script_abbrevs, \ 259*22dc650dSSadaf Ebrahimi script_list_item_size, \ 260*22dc650dSSadaf Ebrahimi script_names, \ 261*22dc650dSSadaf Ebrahimi open_output 262*22dc650dSSadaf Ebrahimi 263*22dc650dSSadaf Ebrahimi# Some general parameters 264*22dc650dSSadaf Ebrahimi 265*22dc650dSSadaf EbrahimiMAX_UNICODE = 0x110000 266*22dc650dSSadaf EbrahimiNOTACHAR = 0xffffffff 267*22dc650dSSadaf Ebrahimi 268*22dc650dSSadaf Ebrahimi 269*22dc650dSSadaf Ebrahimi# --------------------------------------------------------------------------- 270*22dc650dSSadaf Ebrahimi# DEFINE FUNCTIONS 271*22dc650dSSadaf Ebrahimi# --------------------------------------------------------------------------- 272*22dc650dSSadaf Ebrahimi 273*22dc650dSSadaf Ebrahimi 274*22dc650dSSadaf Ebrahimi# Parse a line of Scripts.txt, GraphemeBreakProperty.txt or DerivedGeneralCategory.txt 275*22dc650dSSadaf Ebrahimi 276*22dc650dSSadaf Ebrahimidef make_get_names(enum): 277*22dc650dSSadaf Ebrahimi return lambda chardata: enum.index(chardata[1]) 278*22dc650dSSadaf Ebrahimi 279*22dc650dSSadaf Ebrahimi 280*22dc650dSSadaf Ebrahimi# Parse a line of DerivedBidiClass.txt 281*22dc650dSSadaf Ebrahimi 282*22dc650dSSadaf Ebrahimidef get_bidi(chardata): 283*22dc650dSSadaf Ebrahimi if len(chardata[1]) > 3: 284*22dc650dSSadaf Ebrahimi return bidi_classes_long.index(chardata[1]) 285*22dc650dSSadaf Ebrahimi else: 286*22dc650dSSadaf Ebrahimi return bidi_classes_short.index(chardata[1]) 287*22dc650dSSadaf Ebrahimi 288*22dc650dSSadaf Ebrahimi 289*22dc650dSSadaf Ebrahimi# Parse a line of CaseFolding.txt 290*22dc650dSSadaf Ebrahimi 291*22dc650dSSadaf Ebrahimidef get_other_case(chardata): 292*22dc650dSSadaf Ebrahimi if chardata[1] == 'C' or chardata[1] == 'S': 293*22dc650dSSadaf Ebrahimi return int(chardata[2], 16) - int(chardata[0], 16) 294*22dc650dSSadaf Ebrahimi return None 295*22dc650dSSadaf Ebrahimi 296*22dc650dSSadaf Ebrahimi 297*22dc650dSSadaf Ebrahimi# Parse a line of ScriptExtensions.txt 298*22dc650dSSadaf Ebrahimi 299*22dc650dSSadaf Ebrahimidef get_script_extension(chardata): 300*22dc650dSSadaf Ebrahimi global last_script_extension 301*22dc650dSSadaf Ebrahimi 302*22dc650dSSadaf Ebrahimi offset = len(script_lists) * script_list_item_size 303*22dc650dSSadaf Ebrahimi if last_script_extension == chardata[1]: 304*22dc650dSSadaf Ebrahimi return offset - script_list_item_size 305*22dc650dSSadaf Ebrahimi 306*22dc650dSSadaf Ebrahimi last_script_extension = chardata[1] 307*22dc650dSSadaf Ebrahimi script_lists.append(tuple(script_abbrevs.index(abbrev) for abbrev in last_script_extension.split(' '))) 308*22dc650dSSadaf Ebrahimi return offset 309*22dc650dSSadaf Ebrahimi 310*22dc650dSSadaf Ebrahimi 311*22dc650dSSadaf Ebrahimi# Read a whole table in memory, setting/checking the Unicode version 312*22dc650dSSadaf Ebrahimi 313*22dc650dSSadaf Ebrahimidef read_table(file_name, get_value, default_value): 314*22dc650dSSadaf Ebrahimi global unicode_version 315*22dc650dSSadaf Ebrahimi 316*22dc650dSSadaf Ebrahimi f = re.match(r'^[^/]+/([^.]+)\.txt$', file_name) 317*22dc650dSSadaf Ebrahimi file_base = f.group(1) 318*22dc650dSSadaf Ebrahimi version_pat = r"^# " + re.escape(file_base) + r"-(\d+\.\d+\.\d+)\.txt$" 319*22dc650dSSadaf Ebrahimi file = open(file_name, 'r', encoding='utf-8') 320*22dc650dSSadaf Ebrahimi f = re.match(version_pat, file.readline()) 321*22dc650dSSadaf Ebrahimi version = f.group(1) 322*22dc650dSSadaf Ebrahimi if unicode_version == "": 323*22dc650dSSadaf Ebrahimi unicode_version = version 324*22dc650dSSadaf Ebrahimi elif unicode_version != version: 325*22dc650dSSadaf Ebrahimi print("WARNING: Unicode version differs in %s", file_name, file=sys.stderr) 326*22dc650dSSadaf Ebrahimi 327*22dc650dSSadaf Ebrahimi table = [default_value] * MAX_UNICODE 328*22dc650dSSadaf Ebrahimi for line in file: 329*22dc650dSSadaf Ebrahimi if file_base == 'DerivedBidiClass': 330*22dc650dSSadaf Ebrahimi line = re.sub(r'# @missing: ', '', line) 331*22dc650dSSadaf Ebrahimi 332*22dc650dSSadaf Ebrahimi line = re.sub(r'#.*', '', line) 333*22dc650dSSadaf Ebrahimi chardata = list(map(str.strip, line.split(';'))) 334*22dc650dSSadaf Ebrahimi if len(chardata) <= 1: 335*22dc650dSSadaf Ebrahimi continue 336*22dc650dSSadaf Ebrahimi value = get_value(chardata) 337*22dc650dSSadaf Ebrahimi if value is None: 338*22dc650dSSadaf Ebrahimi continue 339*22dc650dSSadaf Ebrahimi m = re.match(r'([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+))?$', chardata[0]) 340*22dc650dSSadaf Ebrahimi char = int(m.group(1), 16) 341*22dc650dSSadaf Ebrahimi if m.group(3) is None: 342*22dc650dSSadaf Ebrahimi last = char 343*22dc650dSSadaf Ebrahimi else: 344*22dc650dSSadaf Ebrahimi last = int(m.group(3), 16) 345*22dc650dSSadaf Ebrahimi for i in range(char, last + 1): 346*22dc650dSSadaf Ebrahimi table[i] = value 347*22dc650dSSadaf Ebrahimi 348*22dc650dSSadaf Ebrahimi file.close() 349*22dc650dSSadaf Ebrahimi return table 350*22dc650dSSadaf Ebrahimi 351*22dc650dSSadaf Ebrahimi 352*22dc650dSSadaf Ebrahimi# Get the smallest possible C language type for the values in a table 353*22dc650dSSadaf Ebrahimi 354*22dc650dSSadaf Ebrahimidef get_type_size(table): 355*22dc650dSSadaf Ebrahimi type_size = [("uint8_t", 1), ("uint16_t", 2), ("uint32_t", 4), 356*22dc650dSSadaf Ebrahimi ("signed char", 1), ("int16_t", 2), ("int32_t", 4)] 357*22dc650dSSadaf Ebrahimi limits = [(0, 255), (0, 65535), (0, 4294967295), (-128, 127), 358*22dc650dSSadaf Ebrahimi (-32768, 32767), (-2147483648, 2147483647)] 359*22dc650dSSadaf Ebrahimi minval = min(table) 360*22dc650dSSadaf Ebrahimi maxval = max(table) 361*22dc650dSSadaf Ebrahimi for num, (minlimit, maxlimit) in enumerate(limits): 362*22dc650dSSadaf Ebrahimi if minlimit <= minval and maxval <= maxlimit: 363*22dc650dSSadaf Ebrahimi return type_size[num] 364*22dc650dSSadaf Ebrahimi raise OverflowError("Too large to fit into C types") 365*22dc650dSSadaf Ebrahimi 366*22dc650dSSadaf Ebrahimi 367*22dc650dSSadaf Ebrahimi# Get the total size of a list of tables 368*22dc650dSSadaf Ebrahimi 369*22dc650dSSadaf Ebrahimidef get_tables_size(*tables): 370*22dc650dSSadaf Ebrahimi total_size = 0 371*22dc650dSSadaf Ebrahimi for table in tables: 372*22dc650dSSadaf Ebrahimi type, size = get_type_size(table) 373*22dc650dSSadaf Ebrahimi total_size += size * len(table) 374*22dc650dSSadaf Ebrahimi return total_size 375*22dc650dSSadaf Ebrahimi 376*22dc650dSSadaf Ebrahimi 377*22dc650dSSadaf Ebrahimi# Compress a table into the two stages 378*22dc650dSSadaf Ebrahimi 379*22dc650dSSadaf Ebrahimidef compress_table(table, block_size): 380*22dc650dSSadaf Ebrahimi blocks = {} # Dictionary for finding identical blocks 381*22dc650dSSadaf Ebrahimi stage1 = [] # Stage 1 table contains block numbers (indices into stage 2 table) 382*22dc650dSSadaf Ebrahimi stage2 = [] # Stage 2 table contains the blocks with property values 383*22dc650dSSadaf Ebrahimi table = tuple(table) 384*22dc650dSSadaf Ebrahimi for i in range(0, len(table), block_size): 385*22dc650dSSadaf Ebrahimi block = table[i:i+block_size] 386*22dc650dSSadaf Ebrahimi start = blocks.get(block) 387*22dc650dSSadaf Ebrahimi if start is None: 388*22dc650dSSadaf Ebrahimi # Allocate a new block 389*22dc650dSSadaf Ebrahimi start = len(stage2) / block_size 390*22dc650dSSadaf Ebrahimi stage2 += block 391*22dc650dSSadaf Ebrahimi blocks[block] = start 392*22dc650dSSadaf Ebrahimi stage1.append(start) 393*22dc650dSSadaf Ebrahimi return stage1, stage2 394*22dc650dSSadaf Ebrahimi 395*22dc650dSSadaf Ebrahimi 396*22dc650dSSadaf Ebrahimi# Output a table 397*22dc650dSSadaf Ebrahimi 398*22dc650dSSadaf Ebrahimidef write_table(table, table_name, block_size = None): 399*22dc650dSSadaf Ebrahimi type, size = get_type_size(table) 400*22dc650dSSadaf Ebrahimi ELEMS_PER_LINE = 16 401*22dc650dSSadaf Ebrahimi 402*22dc650dSSadaf Ebrahimi s = "const %s %s[] = { /* %d bytes" % (type, table_name, size * len(table)) 403*22dc650dSSadaf Ebrahimi if block_size: 404*22dc650dSSadaf Ebrahimi s += ", block = %d" % block_size 405*22dc650dSSadaf Ebrahimi f.write(s + " */\n") 406*22dc650dSSadaf Ebrahimi table = tuple(table) 407*22dc650dSSadaf Ebrahimi if block_size is None: 408*22dc650dSSadaf Ebrahimi fmt = "%3d," * ELEMS_PER_LINE + " /* U+%04X */\n" 409*22dc650dSSadaf Ebrahimi mult = MAX_UNICODE / len(table) 410*22dc650dSSadaf Ebrahimi for i in range(0, len(table), ELEMS_PER_LINE): 411*22dc650dSSadaf Ebrahimi f.write(fmt % (table[i:i+ELEMS_PER_LINE] + (int(i * mult),))) 412*22dc650dSSadaf Ebrahimi else: 413*22dc650dSSadaf Ebrahimi if block_size > ELEMS_PER_LINE: 414*22dc650dSSadaf Ebrahimi el = ELEMS_PER_LINE 415*22dc650dSSadaf Ebrahimi else: 416*22dc650dSSadaf Ebrahimi el = block_size 417*22dc650dSSadaf Ebrahimi fmt = "%3d," * el + "\n" 418*22dc650dSSadaf Ebrahimi if block_size > ELEMS_PER_LINE: 419*22dc650dSSadaf Ebrahimi fmt = fmt * int(block_size / ELEMS_PER_LINE) 420*22dc650dSSadaf Ebrahimi for i in range(0, len(table), block_size): 421*22dc650dSSadaf Ebrahimi f.write(("\n/* block %d */\n" + fmt) % ((i / block_size,) + table[i:i+block_size])) 422*22dc650dSSadaf Ebrahimi f.write("};\n\n") 423*22dc650dSSadaf Ebrahimi 424*22dc650dSSadaf Ebrahimi 425*22dc650dSSadaf Ebrahimi# Extract the unique combinations of properties into records 426*22dc650dSSadaf Ebrahimi 427*22dc650dSSadaf Ebrahimidef combine_tables(*tables): 428*22dc650dSSadaf Ebrahimi records = {} 429*22dc650dSSadaf Ebrahimi index = [] 430*22dc650dSSadaf Ebrahimi for t in zip(*tables): 431*22dc650dSSadaf Ebrahimi i = records.get(t) 432*22dc650dSSadaf Ebrahimi if i is None: 433*22dc650dSSadaf Ebrahimi i = records[t] = len(records) 434*22dc650dSSadaf Ebrahimi index.append(i) 435*22dc650dSSadaf Ebrahimi return index, records 436*22dc650dSSadaf Ebrahimi 437*22dc650dSSadaf Ebrahimi 438*22dc650dSSadaf Ebrahimi# Create a record struct 439*22dc650dSSadaf Ebrahimi 440*22dc650dSSadaf Ebrahimidef get_record_size_struct(records): 441*22dc650dSSadaf Ebrahimi size = 0 442*22dc650dSSadaf Ebrahimi structure = 'typedef struct {\n' 443*22dc650dSSadaf Ebrahimi for i in range(len(records[0])): 444*22dc650dSSadaf Ebrahimi record_slice = [record[i] for record in records] 445*22dc650dSSadaf Ebrahimi slice_type, slice_size = get_type_size(record_slice) 446*22dc650dSSadaf Ebrahimi # add padding: round up to the nearest power of slice_size 447*22dc650dSSadaf Ebrahimi size = (size + slice_size - 1) & -slice_size 448*22dc650dSSadaf Ebrahimi size += slice_size 449*22dc650dSSadaf Ebrahimi structure += '%s property_%d;\n' % (slice_type, i) 450*22dc650dSSadaf Ebrahimi 451*22dc650dSSadaf Ebrahimi # round up to the first item of the next structure in array 452*22dc650dSSadaf Ebrahimi record_slice = [record[0] for record in records] 453*22dc650dSSadaf Ebrahimi slice_type, slice_size = get_type_size(record_slice) 454*22dc650dSSadaf Ebrahimi size = (size + slice_size - 1) & -slice_size 455*22dc650dSSadaf Ebrahimi 456*22dc650dSSadaf Ebrahimi structure += '} ucd_record;\n*/\n' 457*22dc650dSSadaf Ebrahimi return size, structure 458*22dc650dSSadaf Ebrahimi 459*22dc650dSSadaf Ebrahimi 460*22dc650dSSadaf Ebrahimi# Write records 461*22dc650dSSadaf Ebrahimi 462*22dc650dSSadaf Ebrahimidef write_records(records, record_size): 463*22dc650dSSadaf Ebrahimi f.write('const ucd_record PRIV(ucd_records)[] = { ' + \ 464*22dc650dSSadaf Ebrahimi '/* %d bytes, record size %d */\n' % (len(records) * record_size, record_size)) 465*22dc650dSSadaf Ebrahimi records = list(zip(list(records.keys()), list(records.values()))) 466*22dc650dSSadaf Ebrahimi records.sort(key = lambda x: x[1]) 467*22dc650dSSadaf Ebrahimi for i, record in enumerate(records): 468*22dc650dSSadaf Ebrahimi f.write((' {' + '%6d, ' * len(record[0]) + '}, /* %3d */\n') % (record[0] + (i,))) 469*22dc650dSSadaf Ebrahimi f.write('};\n\n') 470*22dc650dSSadaf Ebrahimi 471*22dc650dSSadaf Ebrahimi 472*22dc650dSSadaf Ebrahimi# Write a bit set 473*22dc650dSSadaf Ebrahimi 474*22dc650dSSadaf Ebrahimidef write_bitsets(list, item_size): 475*22dc650dSSadaf Ebrahimi for d in list: 476*22dc650dSSadaf Ebrahimi bitwords = [0] * item_size 477*22dc650dSSadaf Ebrahimi for idx in d: 478*22dc650dSSadaf Ebrahimi bitwords[idx // 32] |= 1 << (idx & 31) 479*22dc650dSSadaf Ebrahimi s = " " 480*22dc650dSSadaf Ebrahimi for x in bitwords: 481*22dc650dSSadaf Ebrahimi f.write("%s" % s) 482*22dc650dSSadaf Ebrahimi s = ", " 483*22dc650dSSadaf Ebrahimi f.write("0x%08xu" % x) 484*22dc650dSSadaf Ebrahimi f.write(",\n") 485*22dc650dSSadaf Ebrahimi f.write("};\n\n") 486*22dc650dSSadaf Ebrahimi 487*22dc650dSSadaf Ebrahimi 488*22dc650dSSadaf Ebrahimi# --------------------------------------------------------------------------- 489*22dc650dSSadaf Ebrahimi# This bit of code must have been useful when the original script was being 490*22dc650dSSadaf Ebrahimi# developed. Retain it just in case it is ever needed again. 491*22dc650dSSadaf Ebrahimi 492*22dc650dSSadaf Ebrahimi# def test_record_size(): 493*22dc650dSSadaf Ebrahimi# tests = [ \ 494*22dc650dSSadaf Ebrahimi# ( [(3,), (6,), (6,), (1,)], 1 ), \ 495*22dc650dSSadaf Ebrahimi# ( [(300,), (600,), (600,), (100,)], 2 ), \ 496*22dc650dSSadaf Ebrahimi# ( [(25, 3), (6, 6), (34, 6), (68, 1)], 2 ), \ 497*22dc650dSSadaf Ebrahimi# ( [(300, 3), (6, 6), (340, 6), (690, 1)], 4 ), \ 498*22dc650dSSadaf Ebrahimi# ( [(3, 300), (6, 6), (6, 340), (1, 690)], 4 ), \ 499*22dc650dSSadaf Ebrahimi# ( [(300, 300), (6, 6), (6, 340), (1, 690)], 4 ), \ 500*22dc650dSSadaf Ebrahimi# ( [(3, 100000), (6, 6), (6, 123456), (1, 690)], 8 ), \ 501*22dc650dSSadaf Ebrahimi# ( [(100000, 300), (6, 6), (123456, 6), (1, 690)], 8 ), \ 502*22dc650dSSadaf Ebrahimi# ] 503*22dc650dSSadaf Ebrahimi# for test in tests: 504*22dc650dSSadaf Ebrahimi# size, struct = get_record_size_struct(test[0]) 505*22dc650dSSadaf Ebrahimi# assert(size == test[1]) 506*22dc650dSSadaf Ebrahimi# test_record_size() 507*22dc650dSSadaf Ebrahimi# --------------------------------------------------------------------------- 508*22dc650dSSadaf Ebrahimi 509*22dc650dSSadaf Ebrahimi 510*22dc650dSSadaf Ebrahimi 511*22dc650dSSadaf Ebrahimi# --------------------------------------------------------------------------- 512*22dc650dSSadaf Ebrahimi# MAIN CODE FOR CREATING TABLES 513*22dc650dSSadaf Ebrahimi# --------------------------------------------------------------------------- 514*22dc650dSSadaf Ebrahimi 515*22dc650dSSadaf Ebrahimiunicode_version = "" 516*22dc650dSSadaf Ebrahimi 517*22dc650dSSadaf Ebrahimi# Some of the tables imported from GenerateCommon.py have alternate comment 518*22dc650dSSadaf Ebrahimi# strings for use by GenerateUcpHeader. The comments are not wanted here, so 519*22dc650dSSadaf Ebrahimi# remove them. 520*22dc650dSSadaf Ebrahimi 521*22dc650dSSadaf Ebrahimibidi_classes_short = bidi_classes[::2] 522*22dc650dSSadaf Ebrahimibidi_classes_long = bidi_classes[1::2] 523*22dc650dSSadaf Ebrahimibreak_properties = break_properties[::2] 524*22dc650dSSadaf Ebrahimicategory_names = category_names[::2] 525*22dc650dSSadaf Ebrahimi 526*22dc650dSSadaf Ebrahimi# Create the various tables from Unicode data files 527*22dc650dSSadaf Ebrahimi 528*22dc650dSSadaf Ebrahimiscript = read_table('Unicode.tables/Scripts.txt', make_get_names(script_names), script_names.index('Unknown')) 529*22dc650dSSadaf Ebrahimicategory = read_table('Unicode.tables/DerivedGeneralCategory.txt', make_get_names(category_names), category_names.index('Cn')) 530*22dc650dSSadaf Ebrahimibreak_props = read_table('Unicode.tables/GraphemeBreakProperty.txt', make_get_names(break_properties), break_properties.index('Other')) 531*22dc650dSSadaf Ebrahimiother_case = read_table('Unicode.tables/CaseFolding.txt', get_other_case, 0) 532*22dc650dSSadaf Ebrahimibidi_class = read_table('Unicode.tables/DerivedBidiClass.txt', get_bidi, bidi_classes_short.index('L')) 533*22dc650dSSadaf Ebrahimi 534*22dc650dSSadaf Ebrahimi# The grapheme breaking rules were changed for Unicode 11.0.0 (June 2018). Now 535*22dc650dSSadaf Ebrahimi# we need to find the Extended_Pictographic property for emoji characters. This 536*22dc650dSSadaf Ebrahimi# can be set as an additional grapheme break property, because the default for 537*22dc650dSSadaf Ebrahimi# all the emojis is "other". We scan the emoji-data.txt file and modify the 538*22dc650dSSadaf Ebrahimi# break-props table. 539*22dc650dSSadaf Ebrahimi 540*22dc650dSSadaf Ebrahimifile = open('Unicode.tables/emoji-data.txt', 'r', encoding='utf-8') 541*22dc650dSSadaf Ebrahimifor line in file: 542*22dc650dSSadaf Ebrahimi line = re.sub(r'#.*', '', line) 543*22dc650dSSadaf Ebrahimi chardata = list(map(str.strip, line.split(';'))) 544*22dc650dSSadaf Ebrahimi if len(chardata) <= 1: 545*22dc650dSSadaf Ebrahimi continue 546*22dc650dSSadaf Ebrahimi if chardata[1] != "Extended_Pictographic": 547*22dc650dSSadaf Ebrahimi continue 548*22dc650dSSadaf Ebrahimi m = re.match(r'([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+))?$', chardata[0]) 549*22dc650dSSadaf Ebrahimi char = int(m.group(1), 16) 550*22dc650dSSadaf Ebrahimi if m.group(3) is None: 551*22dc650dSSadaf Ebrahimi last = char 552*22dc650dSSadaf Ebrahimi else: 553*22dc650dSSadaf Ebrahimi last = int(m.group(3), 16) 554*22dc650dSSadaf Ebrahimi for i in range(char, last + 1): 555*22dc650dSSadaf Ebrahimi if break_props[i] != break_properties.index('Other'): 556*22dc650dSSadaf Ebrahimi print("WARNING: Emoji 0x%x has break property %s, not 'Other'", 557*22dc650dSSadaf Ebrahimi i, break_properties[break_props[i]], file=sys.stderr) 558*22dc650dSSadaf Ebrahimi break_props[i] = break_properties.index('Extended_Pictographic') 559*22dc650dSSadaf Ebrahimifile.close() 560*22dc650dSSadaf Ebrahimi 561*22dc650dSSadaf Ebrahimi# Handle script extensions. The get_script_extesion() function maintains a 562*22dc650dSSadaf Ebrahimi# list of unique bitmaps representing lists of scripts, returning the offset 563*22dc650dSSadaf Ebrahimi# in that list. Initialize the list with an empty set, which is used for 564*22dc650dSSadaf Ebrahimi# characters that have no script extensions. 565*22dc650dSSadaf Ebrahimi 566*22dc650dSSadaf Ebrahimiscript_lists = [[]] 567*22dc650dSSadaf Ebrahimilast_script_extension = "" 568*22dc650dSSadaf Ebrahimiscriptx_bidi_class = read_table('Unicode.tables/ScriptExtensions.txt', get_script_extension, 0) 569*22dc650dSSadaf Ebrahimi 570*22dc650dSSadaf Ebrahimifor idx in range(len(scriptx_bidi_class)): 571*22dc650dSSadaf Ebrahimi scriptx_bidi_class[idx] = scriptx_bidi_class[idx] | (bidi_class[idx] << 11) 572*22dc650dSSadaf Ebrahimibidi_class = None 573*22dc650dSSadaf Ebrahimi 574*22dc650dSSadaf Ebrahimi# Find the Boolean properties of each character. This next bit of magic creates 575*22dc650dSSadaf Ebrahimi# a list of empty lists. Using [[]] * MAX_UNICODE gives a list of references to 576*22dc650dSSadaf Ebrahimi# the *same* list, which is not what we want. 577*22dc650dSSadaf Ebrahimi 578*22dc650dSSadaf Ebrahimibprops = [[] for _ in range(MAX_UNICODE)] 579*22dc650dSSadaf Ebrahimi 580*22dc650dSSadaf Ebrahimi# Collect the properties from the various files 581*22dc650dSSadaf Ebrahimi 582*22dc650dSSadaf Ebrahimifor filename in bool_propsfiles: 583*22dc650dSSadaf Ebrahimi try: 584*22dc650dSSadaf Ebrahimi file = open('Unicode.tables/' + filename, 'r') 585*22dc650dSSadaf Ebrahimi except IOError: 586*22dc650dSSadaf Ebrahimi print(f"** Couldn't open {'Unicode.tables/' + filename}\n") 587*22dc650dSSadaf Ebrahimi sys.exit(1) 588*22dc650dSSadaf Ebrahimi 589*22dc650dSSadaf Ebrahimi for line in file: 590*22dc650dSSadaf Ebrahimi line = re.sub(r'#.*', '', line) 591*22dc650dSSadaf Ebrahimi data = list(map(str.strip, line.split(';'))) 592*22dc650dSSadaf Ebrahimi if len(data) <= 1: 593*22dc650dSSadaf Ebrahimi continue 594*22dc650dSSadaf Ebrahimi 595*22dc650dSSadaf Ebrahimi try: 596*22dc650dSSadaf Ebrahimi ix = bool_properties.index(data[1]) 597*22dc650dSSadaf Ebrahimi except ValueError: 598*22dc650dSSadaf Ebrahimi continue 599*22dc650dSSadaf Ebrahimi 600*22dc650dSSadaf Ebrahimi m = re.match(r'([0-9a-fA-F]+)(\.\.([0-9a-fA-F]+))?$', data[0]) 601*22dc650dSSadaf Ebrahimi char = int(m.group(1), 16) 602*22dc650dSSadaf Ebrahimi if m.group(3) is None: 603*22dc650dSSadaf Ebrahimi last = char 604*22dc650dSSadaf Ebrahimi else: 605*22dc650dSSadaf Ebrahimi last = int(m.group(3), 16) 606*22dc650dSSadaf Ebrahimi 607*22dc650dSSadaf Ebrahimi for i in range(char, last + 1): 608*22dc650dSSadaf Ebrahimi bprops[i].append(ix) 609*22dc650dSSadaf Ebrahimi 610*22dc650dSSadaf Ebrahimi file.close() 611*22dc650dSSadaf Ebrahimi 612*22dc650dSSadaf Ebrahimi# The ASCII property isn't listed in any files, but it is easy enough to add 613*22dc650dSSadaf Ebrahimi# it manually. 614*22dc650dSSadaf Ebrahimi 615*22dc650dSSadaf Ebrahimiix = bool_properties.index("ASCII") 616*22dc650dSSadaf Ebrahimifor i in range(128): 617*22dc650dSSadaf Ebrahimi bprops[i].append(ix) 618*22dc650dSSadaf Ebrahimi 619*22dc650dSSadaf Ebrahimi# The Bidi_Mirrored property isn't listed in any property files. We have to 620*22dc650dSSadaf Ebrahimi# deduce it from the file that lists the mirrored characters. 621*22dc650dSSadaf Ebrahimi 622*22dc650dSSadaf Ebrahimiix = bool_properties.index("Bidi_Mirrored") 623*22dc650dSSadaf Ebrahimi 624*22dc650dSSadaf Ebrahimitry: 625*22dc650dSSadaf Ebrahimi file = open('Unicode.tables/BidiMirroring.txt', 'r') 626*22dc650dSSadaf Ebrahimiexcept IOError: 627*22dc650dSSadaf Ebrahimi print(f"** Couldn't open {'Unicode.tables/BidiMirroring.txt'}\n") 628*22dc650dSSadaf Ebrahimi sys.exit(1) 629*22dc650dSSadaf Ebrahimi 630*22dc650dSSadaf Ebrahimifor line in file: 631*22dc650dSSadaf Ebrahimi line = re.sub(r'#.*', '', line) 632*22dc650dSSadaf Ebrahimi data = list(map(str.strip, line.split(';'))) 633*22dc650dSSadaf Ebrahimi if len(data) <= 1: 634*22dc650dSSadaf Ebrahimi continue 635*22dc650dSSadaf Ebrahimi c = int(data[0], 16) 636*22dc650dSSadaf Ebrahimi bprops[c].append(ix) 637*22dc650dSSadaf Ebrahimi 638*22dc650dSSadaf Ebrahimifile.close() 639*22dc650dSSadaf Ebrahimi 640*22dc650dSSadaf Ebrahimi# Scan each character's boolean property list and created a list of unique 641*22dc650dSSadaf Ebrahimi# lists, at the same time, setting the index in that list for each property in 642*22dc650dSSadaf Ebrahimi# the bool_props vector. 643*22dc650dSSadaf Ebrahimi 644*22dc650dSSadaf Ebrahimibool_props = [0] * MAX_UNICODE 645*22dc650dSSadaf Ebrahimibool_props_lists = [[]] 646*22dc650dSSadaf Ebrahimi 647*22dc650dSSadaf Ebrahimifor c in range(MAX_UNICODE): 648*22dc650dSSadaf Ebrahimi s = set(bprops[c]) 649*22dc650dSSadaf Ebrahimi for i in range(len(bool_props_lists)): 650*22dc650dSSadaf Ebrahimi if s == set(bool_props_lists[i]): 651*22dc650dSSadaf Ebrahimi break; 652*22dc650dSSadaf Ebrahimi else: 653*22dc650dSSadaf Ebrahimi bool_props_lists.append(bprops[c]) 654*22dc650dSSadaf Ebrahimi i += 1 655*22dc650dSSadaf Ebrahimi 656*22dc650dSSadaf Ebrahimi bool_props[c] = i * bool_props_list_item_size 657*22dc650dSSadaf Ebrahimi 658*22dc650dSSadaf Ebrahimi# This block of code was added by PH in September 2012. It scans the other_case 659*22dc650dSSadaf Ebrahimi# table to find sets of more than two characters that must all match each other 660*22dc650dSSadaf Ebrahimi# caselessly. Later in this script a table of these sets is written out. 661*22dc650dSSadaf Ebrahimi# However, we have to do this work here in order to compute the offsets in the 662*22dc650dSSadaf Ebrahimi# table that are inserted into the main table. 663*22dc650dSSadaf Ebrahimi 664*22dc650dSSadaf Ebrahimi# The CaseFolding.txt file lists pairs, but the common logic for reading data 665*22dc650dSSadaf Ebrahimi# sets only one value, so first we go through the table and set "return" 666*22dc650dSSadaf Ebrahimi# offsets for those that are not already set. 667*22dc650dSSadaf Ebrahimi 668*22dc650dSSadaf Ebrahimifor c in range(MAX_UNICODE): 669*22dc650dSSadaf Ebrahimi if other_case[c] != 0 and other_case[c + other_case[c]] == 0: 670*22dc650dSSadaf Ebrahimi other_case[c + other_case[c]] = -other_case[c] 671*22dc650dSSadaf Ebrahimi 672*22dc650dSSadaf Ebrahimi# Now scan again and create equivalence sets. 673*22dc650dSSadaf Ebrahimi 674*22dc650dSSadaf Ebrahimicaseless_sets = [] 675*22dc650dSSadaf Ebrahimi 676*22dc650dSSadaf Ebrahimifor c in range(MAX_UNICODE): 677*22dc650dSSadaf Ebrahimi o = c + other_case[c] 678*22dc650dSSadaf Ebrahimi 679*22dc650dSSadaf Ebrahimi # Trigger when this character's other case does not point back here. We 680*22dc650dSSadaf Ebrahimi # now have three characters that are case-equivalent. 681*22dc650dSSadaf Ebrahimi 682*22dc650dSSadaf Ebrahimi if other_case[o] != -other_case[c]: 683*22dc650dSSadaf Ebrahimi t = o + other_case[o] 684*22dc650dSSadaf Ebrahimi 685*22dc650dSSadaf Ebrahimi # Scan the existing sets to see if any of the three characters are already 686*22dc650dSSadaf Ebrahimi # part of a set. If so, unite the existing set with the new set. 687*22dc650dSSadaf Ebrahimi 688*22dc650dSSadaf Ebrahimi appended = 0 689*22dc650dSSadaf Ebrahimi for s in caseless_sets: 690*22dc650dSSadaf Ebrahimi found = 0 691*22dc650dSSadaf Ebrahimi for x in s: 692*22dc650dSSadaf Ebrahimi if x == c or x == o or x == t: 693*22dc650dSSadaf Ebrahimi found = 1 694*22dc650dSSadaf Ebrahimi 695*22dc650dSSadaf Ebrahimi # Add new characters to an existing set 696*22dc650dSSadaf Ebrahimi 697*22dc650dSSadaf Ebrahimi if found: 698*22dc650dSSadaf Ebrahimi found = 0 699*22dc650dSSadaf Ebrahimi for y in [c, o, t]: 700*22dc650dSSadaf Ebrahimi for x in s: 701*22dc650dSSadaf Ebrahimi if x == y: 702*22dc650dSSadaf Ebrahimi found = 1 703*22dc650dSSadaf Ebrahimi if not found: 704*22dc650dSSadaf Ebrahimi s.append(y) 705*22dc650dSSadaf Ebrahimi appended = 1 706*22dc650dSSadaf Ebrahimi 707*22dc650dSSadaf Ebrahimi # If we have not added to an existing set, create a new one. 708*22dc650dSSadaf Ebrahimi 709*22dc650dSSadaf Ebrahimi if not appended: 710*22dc650dSSadaf Ebrahimi caseless_sets.append([c, o, t]) 711*22dc650dSSadaf Ebrahimi 712*22dc650dSSadaf Ebrahimi# End of loop looking for caseless sets. 713*22dc650dSSadaf Ebrahimi 714*22dc650dSSadaf Ebrahimi# Now scan the sets and set appropriate offsets for the characters. 715*22dc650dSSadaf Ebrahimi 716*22dc650dSSadaf Ebrahimicaseless_offsets = [0] * MAX_UNICODE 717*22dc650dSSadaf Ebrahimi 718*22dc650dSSadaf Ebrahimioffset = 1; 719*22dc650dSSadaf Ebrahimifor s in caseless_sets: 720*22dc650dSSadaf Ebrahimi for x in s: 721*22dc650dSSadaf Ebrahimi caseless_offsets[x] = offset 722*22dc650dSSadaf Ebrahimi offset += len(s) + 1 723*22dc650dSSadaf Ebrahimi 724*22dc650dSSadaf Ebrahimi# End of block of code for creating offsets for caseless matching sets. 725*22dc650dSSadaf Ebrahimi 726*22dc650dSSadaf Ebrahimi# Scan the caseless sets, and for any non-ASCII character that has an ASCII 727*22dc650dSSadaf Ebrahimi# character as its "base" other case, remove the other case. This makes it 728*22dc650dSSadaf Ebrahimi# easier to handle those characters when the PCRE2 option for not mixing ASCII 729*22dc650dSSadaf Ebrahimi# and non-ASCII is enabled. In principle one should perhaps scan for a 730*22dc650dSSadaf Ebrahimi# non-ASCII alternative, but in practice these don't exist. 731*22dc650dSSadaf Ebrahimi 732*22dc650dSSadaf Ebrahimifor s in caseless_sets: 733*22dc650dSSadaf Ebrahimi for x in s: 734*22dc650dSSadaf Ebrahimi if x > 127 and x + other_case[x] < 128: 735*22dc650dSSadaf Ebrahimi other_case[x] = 0 736*22dc650dSSadaf Ebrahimi 737*22dc650dSSadaf Ebrahimi# Combine all the tables 738*22dc650dSSadaf Ebrahimi 739*22dc650dSSadaf Ebrahimitable, records = combine_tables(script, category, break_props, 740*22dc650dSSadaf Ebrahimi caseless_offsets, other_case, scriptx_bidi_class, bool_props) 741*22dc650dSSadaf Ebrahimi 742*22dc650dSSadaf Ebrahimi# Find the record size and create a string definition of the structure for 743*22dc650dSSadaf Ebrahimi# outputting as a comment. 744*22dc650dSSadaf Ebrahimi 745*22dc650dSSadaf Ebrahimirecord_size, record_struct = get_record_size_struct(list(records.keys())) 746*22dc650dSSadaf Ebrahimi 747*22dc650dSSadaf Ebrahimi# Find the optimum block size for the two-stage table 748*22dc650dSSadaf Ebrahimi 749*22dc650dSSadaf Ebrahimimin_size = sys.maxsize 750*22dc650dSSadaf Ebrahimifor block_size in [2 ** i for i in range(5,10)]: 751*22dc650dSSadaf Ebrahimi size = len(records) * record_size 752*22dc650dSSadaf Ebrahimi stage1, stage2 = compress_table(table, block_size) 753*22dc650dSSadaf Ebrahimi size += get_tables_size(stage1, stage2) 754*22dc650dSSadaf Ebrahimi #print("/* block size {:3d} => {:5d} bytes */".format(block_size, size)) 755*22dc650dSSadaf Ebrahimi if size < min_size: 756*22dc650dSSadaf Ebrahimi min_size = size 757*22dc650dSSadaf Ebrahimi min_stage1, min_stage2 = stage1, stage2 758*22dc650dSSadaf Ebrahimi min_block_size = block_size 759*22dc650dSSadaf Ebrahimi 760*22dc650dSSadaf Ebrahimi 761*22dc650dSSadaf Ebrahimi# --------------------------------------------------------------------------- 762*22dc650dSSadaf Ebrahimi# MAIN CODE FOR WRITING THE OUTPUT FILE 763*22dc650dSSadaf Ebrahimi# --------------------------------------------------------------------------- 764*22dc650dSSadaf Ebrahimi 765*22dc650dSSadaf Ebrahimi# Open the output file (no return on failure). This call also writes standard 766*22dc650dSSadaf Ebrahimi# header boilerplate. 767*22dc650dSSadaf Ebrahimi 768*22dc650dSSadaf Ebrahimif = open_output("pcre2_ucd.c") 769*22dc650dSSadaf Ebrahimi 770*22dc650dSSadaf Ebrahimi# Output this file's heading text 771*22dc650dSSadaf Ebrahimi 772*22dc650dSSadaf Ebrahimif.write("""\ 773*22dc650dSSadaf Ebrahimi/* This file contains tables of Unicode properties that are extracted from 774*22dc650dSSadaf EbrahimiUnicode data files. See the comments at the start of maint/GenerateUcd.py for 775*22dc650dSSadaf Ebrahimidetails. 776*22dc650dSSadaf Ebrahimi 777*22dc650dSSadaf EbrahimiAs well as being part of the PCRE2 library, this file is #included by the 778*22dc650dSSadaf Ebrahimipcre2test program, which redefines the PRIV macro to change table names from 779*22dc650dSSadaf Ebrahimi_pcre2_xxx to xxxx, thereby avoiding name clashes with the library. At present, 780*22dc650dSSadaf Ebrahimijust one of these tables is actually needed. When compiling the library, some 781*22dc650dSSadaf Ebrahimiheaders are needed. */ 782*22dc650dSSadaf Ebrahimi 783*22dc650dSSadaf Ebrahimi#ifndef PCRE2_PCRE2TEST 784*22dc650dSSadaf Ebrahimi#ifdef HAVE_CONFIG_H 785*22dc650dSSadaf Ebrahimi#include "config.h" 786*22dc650dSSadaf Ebrahimi#endif 787*22dc650dSSadaf Ebrahimi#include "pcre2_internal.h" 788*22dc650dSSadaf Ebrahimi#endif /* PCRE2_PCRE2TEST */ 789*22dc650dSSadaf Ebrahimi 790*22dc650dSSadaf Ebrahimi/* The tables herein are needed only when UCP support is built, and in PCRE2 791*22dc650dSSadaf Ebrahimithat happens automatically with UTF support. This module should not be 792*22dc650dSSadaf Ebrahimireferenced otherwise, so it should not matter whether it is compiled or not. 793*22dc650dSSadaf EbrahimiHowever a comment was received about space saving - maybe the guy linked all 794*22dc650dSSadaf Ebrahimithe modules rather than using a library - so we include a condition to cut out 795*22dc650dSSadaf Ebrahimithe tables when not needed. But don't leave a totally empty module because some 796*22dc650dSSadaf Ebrahimicompilers barf at that. Instead, just supply some small dummy tables. */ 797*22dc650dSSadaf Ebrahimi 798*22dc650dSSadaf Ebrahimi#ifndef SUPPORT_UNICODE 799*22dc650dSSadaf Ebrahimiconst ucd_record PRIV(ucd_records)[] = {{0,0,0,0,0,0,0}}; 800*22dc650dSSadaf Ebrahimiconst uint16_t PRIV(ucd_stage1)[] = {0}; 801*22dc650dSSadaf Ebrahimiconst uint16_t PRIV(ucd_stage2)[] = {0}; 802*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_caseless_sets)[] = {0}; 803*22dc650dSSadaf Ebrahimi#else 804*22dc650dSSadaf Ebrahimi\n""") 805*22dc650dSSadaf Ebrahimi 806*22dc650dSSadaf Ebrahimi# --- Output some variable heading stuff --- 807*22dc650dSSadaf Ebrahimi 808*22dc650dSSadaf Ebrahimif.write("/* Total size: %d bytes, block size: %d. */\n\n" % (min_size, min_block_size)) 809*22dc650dSSadaf Ebrahimif.write('const char *PRIV(unicode_version) = "{}";\n\n'.format(unicode_version)) 810*22dc650dSSadaf Ebrahimi 811*22dc650dSSadaf Ebrahimif.write("""\ 812*22dc650dSSadaf Ebrahimi/* When recompiling tables with a new Unicode version, please check the types 813*22dc650dSSadaf Ebrahimiin this structure definition with those in pcre2_internal.h (the actual field 814*22dc650dSSadaf Ebrahiminames will be different). 815*22dc650dSSadaf Ebrahimi\n""") 816*22dc650dSSadaf Ebrahimi 817*22dc650dSSadaf Ebrahimif.write(record_struct) 818*22dc650dSSadaf Ebrahimi 819*22dc650dSSadaf Ebrahimif.write(""" 820*22dc650dSSadaf Ebrahimi/* If the 32-bit library is run in non-32-bit mode, character values greater 821*22dc650dSSadaf Ebrahimithan 0x10ffff may be encountered. For these we set up a special record. */ 822*22dc650dSSadaf Ebrahimi 823*22dc650dSSadaf Ebrahimi#if PCRE2_CODE_UNIT_WIDTH == 32 824*22dc650dSSadaf Ebrahimiconst ucd_record PRIV(dummy_ucd_record)[] = {{ 825*22dc650dSSadaf Ebrahimi ucp_Unknown, /* script */ 826*22dc650dSSadaf Ebrahimi ucp_Cn, /* type unassigned */ 827*22dc650dSSadaf Ebrahimi ucp_gbOther, /* grapheme break property */ 828*22dc650dSSadaf Ebrahimi 0, /* case set */ 829*22dc650dSSadaf Ebrahimi 0, /* other case */ 830*22dc650dSSadaf Ebrahimi 0 | (ucp_bidiL << UCD_BIDICLASS_SHIFT), /* script extension and bidi class */ 831*22dc650dSSadaf Ebrahimi 0, /* bool properties offset */ 832*22dc650dSSadaf Ebrahimi }}; 833*22dc650dSSadaf Ebrahimi#endif 834*22dc650dSSadaf Ebrahimi\n""") 835*22dc650dSSadaf Ebrahimi 836*22dc650dSSadaf Ebrahimi# --- Output the table of caseless character sets --- 837*22dc650dSSadaf Ebrahimi 838*22dc650dSSadaf Ebrahimif.write("""\ 839*22dc650dSSadaf Ebrahimi/* This table contains lists of characters that are caseless sets of 840*22dc650dSSadaf Ebrahimimore than one character. Each list is terminated by NOTACHAR. */ 841*22dc650dSSadaf Ebrahimi 842*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_caseless_sets)[] = { 843*22dc650dSSadaf Ebrahimi NOTACHAR, 844*22dc650dSSadaf Ebrahimi""") 845*22dc650dSSadaf Ebrahimi 846*22dc650dSSadaf Ebrahimifor s in caseless_sets: 847*22dc650dSSadaf Ebrahimi s = sorted(s) 848*22dc650dSSadaf Ebrahimi for x in s: 849*22dc650dSSadaf Ebrahimi f.write(' 0x%04x,' % x) 850*22dc650dSSadaf Ebrahimi f.write(' NOTACHAR,\n') 851*22dc650dSSadaf Ebrahimif.write('};\n\n') 852*22dc650dSSadaf Ebrahimi 853*22dc650dSSadaf Ebrahimi# --- Other tables are not needed by pcre2test --- 854*22dc650dSSadaf Ebrahimi 855*22dc650dSSadaf Ebrahimif.write("""\ 856*22dc650dSSadaf Ebrahimi/* When #included in pcre2test, we don't need the table of digit sets, nor the 857*22dc650dSSadaf Ebrahimithe large main UCD tables. */ 858*22dc650dSSadaf Ebrahimi 859*22dc650dSSadaf Ebrahimi#ifndef PCRE2_PCRE2TEST 860*22dc650dSSadaf Ebrahimi\n""") 861*22dc650dSSadaf Ebrahimi 862*22dc650dSSadaf Ebrahimi# --- Read Scripts.txt again for the sets of 10 digits. --- 863*22dc650dSSadaf Ebrahimi 864*22dc650dSSadaf Ebrahimidigitsets = [] 865*22dc650dSSadaf Ebrahimifile = open('Unicode.tables/Scripts.txt', 'r', encoding='utf-8') 866*22dc650dSSadaf Ebrahimi 867*22dc650dSSadaf Ebrahimifor line in file: 868*22dc650dSSadaf Ebrahimi m = re.match(r'([0-9a-fA-F]+)\.\.([0-9a-fA-F]+)\s+;\s+\S+\s+#\s+Nd\s+', line) 869*22dc650dSSadaf Ebrahimi if m is None: 870*22dc650dSSadaf Ebrahimi continue 871*22dc650dSSadaf Ebrahimi first = int(m.group(1),16) 872*22dc650dSSadaf Ebrahimi last = int(m.group(2),16) 873*22dc650dSSadaf Ebrahimi if ((last - first + 1) % 10) != 0: 874*22dc650dSSadaf Ebrahimi f.write("ERROR: %04x..%04x does not contain a multiple of 10 characters" % (first, last), 875*22dc650dSSadaf Ebrahimi file=sys.stderr) 876*22dc650dSSadaf Ebrahimi while first < last: 877*22dc650dSSadaf Ebrahimi digitsets.append(first + 9) 878*22dc650dSSadaf Ebrahimi first += 10 879*22dc650dSSadaf Ebrahimifile.close() 880*22dc650dSSadaf Ebrahimidigitsets.sort() 881*22dc650dSSadaf Ebrahimi 882*22dc650dSSadaf Ebrahimif.write("""\ 883*22dc650dSSadaf Ebrahimi/* This table lists the code points for the '9' characters in each set of 884*22dc650dSSadaf Ebrahimidecimal digits. It is used to ensure that all the digits in a script run come 885*22dc650dSSadaf Ebrahimifrom the same set. */ 886*22dc650dSSadaf Ebrahimi 887*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_digit_sets)[] = { 888*22dc650dSSadaf Ebrahimi""") 889*22dc650dSSadaf Ebrahimi 890*22dc650dSSadaf Ebrahimif.write(" %d, /* Number of subsequent values */" % len(digitsets)) 891*22dc650dSSadaf Ebrahimicount = 8 892*22dc650dSSadaf Ebrahimifor d in digitsets: 893*22dc650dSSadaf Ebrahimi if count == 8: 894*22dc650dSSadaf Ebrahimi f.write("\n ") 895*22dc650dSSadaf Ebrahimi count = 0 896*22dc650dSSadaf Ebrahimi f.write(" 0x%05x," % d) 897*22dc650dSSadaf Ebrahimi count += 1 898*22dc650dSSadaf Ebrahimif.write("\n};\n\n") 899*22dc650dSSadaf Ebrahimi 900*22dc650dSSadaf Ebrahimif.write("""\ 901*22dc650dSSadaf Ebrahimi/* This vector is a list of script bitsets for the Script Extension property. 902*22dc650dSSadaf EbrahimiThe number of 32-bit words in each bitset is #defined in pcre2_ucp.h as 903*22dc650dSSadaf Ebrahimiucd_script_sets_item_size. */ 904*22dc650dSSadaf Ebrahimi 905*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_script_sets)[] = { 906*22dc650dSSadaf Ebrahimi""") 907*22dc650dSSadaf Ebrahimiwrite_bitsets(script_lists, script_list_item_size) 908*22dc650dSSadaf Ebrahimi 909*22dc650dSSadaf Ebrahimif.write("""\ 910*22dc650dSSadaf Ebrahimi/* This vector is a list of bitsets for Boolean properties. The number of 911*22dc650dSSadaf Ebrahimi32_bit words in each bitset is #defined as ucd_boolprop_sets_item_size in 912*22dc650dSSadaf Ebrahimipcre2_ucp.h. */ 913*22dc650dSSadaf Ebrahimi 914*22dc650dSSadaf Ebrahimiconst uint32_t PRIV(ucd_boolprop_sets)[] = { 915*22dc650dSSadaf Ebrahimi""") 916*22dc650dSSadaf Ebrahimiwrite_bitsets(bool_props_lists, bool_props_list_item_size) 917*22dc650dSSadaf Ebrahimi 918*22dc650dSSadaf Ebrahimi 919*22dc650dSSadaf Ebrahimi# Output the main UCD tables. 920*22dc650dSSadaf Ebrahimi 921*22dc650dSSadaf Ebrahimif.write("""\ 922*22dc650dSSadaf Ebrahimi/* These are the main two-stage UCD tables. The fields in each record are: 923*22dc650dSSadaf Ebrahimiscript (8 bits), character type (8 bits), grapheme break property (8 bits), 924*22dc650dSSadaf Ebrahimioffset to multichar other cases or zero (8 bits), offset to other case or zero 925*22dc650dSSadaf Ebrahimi(32 bits, signed), bidi class (5 bits) and script extension (11 bits) packed 926*22dc650dSSadaf Ebrahimiinto a 16-bit field, and offset in binary properties table (16 bits). */ 927*22dc650dSSadaf Ebrahimi\n""") 928*22dc650dSSadaf Ebrahimi 929*22dc650dSSadaf Ebrahimiwrite_records(records, record_size) 930*22dc650dSSadaf Ebrahimiwrite_table(min_stage1, 'PRIV(ucd_stage1)') 931*22dc650dSSadaf Ebrahimiwrite_table(min_stage2, 'PRIV(ucd_stage2)', min_block_size) 932*22dc650dSSadaf Ebrahimi 933*22dc650dSSadaf Ebrahimif.write("#if UCD_BLOCK_SIZE != %d\n" % min_block_size) 934*22dc650dSSadaf Ebrahimif.write("""\ 935*22dc650dSSadaf Ebrahimi#error Please correct UCD_BLOCK_SIZE in pcre2_internal.h 936*22dc650dSSadaf Ebrahimi#endif 937*22dc650dSSadaf Ebrahimi#endif /* SUPPORT_UNICODE */ 938*22dc650dSSadaf Ebrahimi 939*22dc650dSSadaf Ebrahimi#endif /* PCRE2_PCRE2TEST */ 940*22dc650dSSadaf Ebrahimi 941*22dc650dSSadaf Ebrahimi/* End of pcre2_ucd.c */ 942*22dc650dSSadaf Ebrahimi""") 943*22dc650dSSadaf Ebrahimi 944*22dc650dSSadaf Ebrahimif.close 945*22dc650dSSadaf Ebrahimi 946*22dc650dSSadaf Ebrahimi# End 947