1<?xml version="1.0"?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> 5 <!ENTITY version SYSTEM "version.xml"> 6]> 7<chapter id="shaping-concepts"> 8 <title>Shaping concepts</title> 9 <section id="text-shaping-concepts"> 10 <title>Text shaping</title> 11 <para> 12 Text shaping is the process of transforming a sequence of Unicode 13 codepoints that represent individual characters (letters, 14 diacritics, tone marks, numbers, symbols, etc.) into the 15 orthographically and linguistically correct two-dimensional layout 16 of glyph shapes taken from a specified font. 17 </para> 18 <para> 19 For some writing systems (or <emphasis>scripts</emphasis>) and 20 languages, the process is simple, requiring the shaper to do 21 little more than advance the horizontal position forward by the 22 correct amount for each successive glyph. 23 </para> 24 <para> 25 But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of 26 several shaping operations may be required, and the rules for how 27 and when they are applied vary from script to script. HarfBuzz and 28 other shaping engines implement these rules. 29 </para> 30 <para> 31 The exact rules and necessary operations for a particular script 32 constitute a shaping <emphasis>model</emphasis>. OpenType 33 specifies a set of shaping models that covers all of 34 Unicode. Other shaping models are available, however, including 35 Graphite and Apple Advanced Typography (AAT). 36 </para> 37 </section> 38 39 <section id="script-specific-shaping"> 40 <title>Script-specific shaping</title> 41 <para> 42 In many scripts, transforming the input 43 sequence into the final layout often requires some combination of 44 operations—such as context-dependent substitutions, 45 context-dependent mark positioning, glyph-to-glyph joining, 46 glyph reordering, or glyph stacking. 47 </para> 48 <para> 49 In some scripts, the shaping rules require that a text 50 run be divided into syllables before the operations can be 51 applied. Other scripts may apply shaping operations over 52 entire words or over the entire text run, with no subdivision 53 required. 54 </para> 55 <para> 56 Other scripts, do not require these 57 operations. However, correctly shaping a text run in 58 any script may still involve Unicode normalization, 59 ligature substitutions, mark positioning, kerning, and applying 60 other font features. 61 </para> 62 </section> 63 64 <section id="shaping-operations"> 65 <title>Shaping operations</title> 66 <para> 67 Shaping a text run involves transforming the 68 input sequence of Unicode codepoints with some combination of 69 operations that is specified in the shaping model for the 70 script. 71 </para> 72 <para> 73 The specific conditions that trigger a given operation for a 74 text run varies from script to script, as do the order that the 75 operations are performed in and which codepoints are 76 affected. However, the same general set of shaping operations is 77 common to all of the script shaping models. 78 </para> 79 80 <itemizedlist> 81 <listitem> 82 <para> 83 A <emphasis>reordering</emphasis> operation moves a glyph 84 from its original ("logical") position in the sequence to 85 some other ("visual") position. 86 </para> 87 <para> 88 The shaping model for a given script might involve 89 more than one reordering step. 90 </para> 91 </listitem> 92 93 <listitem> 94 <para> 95 A <emphasis>joining</emphasis> operation replaces a glyph 96 with an alternate form that is designed to connect with one 97 or more of the adjacent glyphs in the sequence. 98 </para> 99 </listitem> 100 101 <listitem> 102 <para> 103 A contextual <emphasis>substitution</emphasis> operation 104 replaces either a single glyph or a subsequence of several 105 glyphs with an alternate glyph. This substitution is 106 performed when the original glyph or subsequence of glyphs 107 occurs in a specified position with respect to the 108 surrounding sequence. For example, one substitution might be 109 performed only when the target glyph is the first glyph in 110 the sequence, while another substitution is performed only 111 when a different target glyph occurs immediately after a 112 particular string pattern. 113 </para> 114 <para> 115 The shaping model for a given script might involve 116 multiple contextual-substitution operations, each applying 117 to different target glyphs and patterns, and which are 118 performed in separate steps. 119 </para> 120 </listitem> 121 122 <listitem> 123 <para> 124 A contextual <emphasis>positioning</emphasis> operation 125 moves the horizontal and/or vertical position of a 126 glyph. This positioning move is performed when the glyph 127 occurs in a specified position with respect to the 128 surrounding sequence. 129 </para> 130 <para> 131 Many contextual positioning operations are used to place 132 <emphasis>mark</emphasis> glyphs (such as diacritics, vowel 133 signs, and tone markers) with respect to 134 <emphasis>base</emphasis> glyphs. However, some 135 scripts may use contextual positioning operations to 136 correctly place base glyphs as well, such as 137 when the script uses <emphasis>stacking</emphasis> characters. 138 </para> 139 </listitem> 140 141 </itemizedlist> 142 </section> 143 144 <section id="unicode-character-categories"> 145 <title>Unicode character categories</title> 146 <para> 147 Shaping models are typically specified with respect to how 148 scripts are defined in the Unicode standard. 149 </para> 150 <para> 151 Every codepoint in the Unicode Character Database (UCD) is 152 assigned a <emphasis>Unicode General Category</emphasis> (UGC), 153 which provides the most fundamental information about the 154 codepoint: whether the codepoint represents a 155 <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a 156 <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a 157 <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>, 158 or something else (<emphasis>Other</emphasis>). 159 </para> 160 <para> 161 These UGC properties are "Major" categories. Each codepoint is 162 further assigned to a "minor" category within its Major 163 category, such as "Letter, uppercase" (<literal>Lu</literal>) or 164 "Letter, modifier" (<literal>Lm</literal>). 165 </para> 166 <para> 167 Shaping models are concerned primarily with Letter and Mark 168 codepoints. The minor categories of Mark codepoints are 169 particularly important for shaping. Marks can be nonspacing 170 (<literal>Mn</literal>), spacing combining 171 (<literal>Mc</literal>), or enclosing (<literal>Me</literal>). 172 </para> 173 <para> 174 In addition to the UGC property, codepoints in the Indic and 175 Southeast Asian scripts are also assigned 176 <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and 177 <emphasis>Unicode Indic Positional Category</emphasis> (UIPC) 178 properties that provide more detailed information needed for 179 shaping. 180 </para> 181 <para> 182 The UISC property sub-categorizes Letters and Marks according to 183 common script-shaping behaviors. For example, UISC distinguishes 184 between consonant letters, vowel letters, and vowel marks. The 185 UIPC property sub-categorizes Mark codepoints by the relative visual 186 position that they occupy (above, below, right, left, or in 187 multiple positions). 188 </para> 189 <para> 190 Some scripts require that the text run be split into 191 syllables. What constitutes a valid syllable in these 192 scripts is specified in regular expressions, formed from the 193 Letter and Mark codepoints, that take the UISC and UIPC 194 properties into account. 195 </para> 196 197 </section> 198 199 <section id="text-runs"> 200 <title>Text runs</title> 201 <para> 202 Real-world text usually contains codepoints from a mixture of 203 different Unicode scripts (including punctuation, numbers, symbols, 204 white-space characters, and other codepoints that do not belong 205 to any script). Real-world text may also be marked up with 206 formatting that changes font properties (including the font, 207 font style, and font size). 208 </para> 209 <para> 210 For shaping purposes, all real-world text streams must be first 211 segmented into runs that have a uniform set of properties. 212 </para> 213 <para> 214 In particular, shaping models always assume that every codepoint 215 in a text run has the same <emphasis>direction</emphasis>, 216 <emphasis>script</emphasis> tag, and 217 <emphasis>language</emphasis> tag. 218 </para> 219 </section> 220 221 <section id="opentype-shaping-models"> 222 <title>OpenType shaping models</title> 223 <para> 224 OpenType provides shaping models for the following scripts: 225 </para> 226 227 <itemizedlist> 228 <listitem> 229 <para> 230 The <emphasis>default</emphasis> shaping model handles all 231 scripts with no script-specific shaping model, and may also be used as a fallback for 232 handling unrecognized scripts. 233 </para> 234 </listitem> 235 236 <listitem> 237 <para> 238 The <emphasis>Indic</emphasis> shaping model handles the Indic 239 scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada, 240 Malayalam, Oriya, Tamil, and Telugu. 241 </para> 242 <para> 243 The Indic shaping model was revised significantly in 244 2005. To denote the change, a new set of <emphasis>script 245 tags</emphasis> was assigned for Bengali, Devanagari, 246 Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and 247 Telugu. For the sake of clarity, the term "Indic2" is 248 sometimes used to refer to the current, revised shaping 249 model. 250 </para> 251 </listitem> 252 253 <listitem> 254 <para> 255 The <emphasis>Arabic</emphasis> shaping model supports 256 Arabic, Mongolian, N'Ko, Syriac, and several other connected 257 or cursive scripts. 258 </para> 259 </listitem> 260 261 <listitem> 262 <para> 263 The <emphasis>Thai/Lao</emphasis> shaping model supports 264 the Thai and Lao scripts. 265 </para> 266 </listitem> 267 268 <listitem> 269 <para> 270 The <emphasis>Khmer</emphasis> shaping model supports the 271 Khmer script. 272 </para> 273 </listitem> 274 275 <listitem> 276 <para> 277 The <emphasis>Myanmar</emphasis> shaping model supports the 278 Myanmar (or Burmese) script. 279 </para> 280 </listitem> 281 282 <listitem> 283 <para> 284 The <emphasis>Tibetan</emphasis> shaping model supports the 285 Tibetan script. 286 </para> 287 </listitem> 288 289 <listitem> 290 <para> 291 The <emphasis>Hangul</emphasis> shaping model supports the 292 Hangul script. 293 </para> 294 </listitem> 295 296 <listitem> 297 <para> 298 The <emphasis>Hebrew</emphasis> shaping model supports the 299 Hebrew script. 300 </para> 301 </listitem> 302 303 <listitem> 304 <para> 305 The <emphasis>Universal Shaping Engine</emphasis> (USE) 306 shaping model supports scripts not covered by one of 307 the above, script-specific shaping models, including 308 Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi, 309 Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai 310 Viet, and many others. 311 </para> 312 </listitem> 313 314 <listitem> 315 <para> 316 Text runs that do not fall under one of the above shaping 317 models may still require processing by a shaping engine. Of 318 particular note is <emphasis>Emoji</emphasis> shaping, which 319 may involve variation-selector sequences and glyph 320 substitution. Emoji shaping is handled by the default 321 shaping model. 322 </para> 323 </listitem> 324 325 </itemizedlist> 326 327 </section> 328 329 <section id="graphite-shaping"> 330 <title>Graphite shaping</title> 331 <para> 332 In contrast to OpenType shaping, Graphite shaping does not 333 specify a predefined set of shaping models or a set of supported 334 scripts. 335 </para> 336 <para> 337 Instead, each Graphite font contains a complete set of rules that 338 implement the required shaping model for the intended 339 script. These rules include finite-state machines to match 340 sequences of codepoints to the shaping operations to perform. 341 </para> 342 <para> 343 Graphite shaping can perform the same shaping operations used in 344 OpenType shaping, as well as other functions that have not been 345 defined for OpenType shaping. 346 </para> 347 </section> 348 349 <section id="aat-shaping"> 350 <title>AAT shaping</title> 351 <para> 352 In contrast to OpenType shaping, AAT shaping does not specify a 353 predefined set of shaping models or a set of supported scripts. 354 </para> 355 <para> 356 Instead, each AAT font includes a complete set of rules that 357 implement the desired shaping model for the intended 358 script. These rules include finite-state machines to match glyph 359 sequences and the shaping operations to perform. 360 </para> 361 <para> 362 Notably, AAT shaping rules are expressed for glyphs in the font, 363 not for Unicode codepoints. AAT shaping can perform the same 364 shaping operations used in OpenType shaping, as well as other 365 functions that have not been defined for OpenType shaping. 366 </para> 367 </section> 368</chapter> 369