xref: /aosp_15_r20/external/harfbuzz_ng/docs/usermanual-shaping-concepts.xml (revision 2d1272b857b1f7575e6e246373e1cb218663db8a)
1<?xml version="1.0"?>
2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"
3               "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [
4  <!ENTITY % local.common.attrib "xmlns:xi  CDATA  #FIXED 'http://www.w3.org/2003/XInclude'">
5  <!ENTITY version SYSTEM "version.xml">
6]>
7<chapter id="shaping-concepts">
8  <title>Shaping concepts</title>
9  <section id="text-shaping-concepts">
10    <title>Text shaping</title>
11    <para>
12      Text shaping is the process of transforming a sequence of Unicode
13      codepoints that represent individual characters (letters,
14      diacritics, tone marks, numbers, symbols, etc.) into the
15      orthographically and linguistically correct two-dimensional layout
16      of glyph shapes taken from a specified font.
17    </para>
18    <para>
19      For some writing systems (or <emphasis>scripts</emphasis>) and
20      languages, the process is simple, requiring the shaper to do
21      little more than advance the horizontal position forward by the
22      correct amount for each successive glyph.
23    </para>
24    <para>
25      But, for other scripts (often unceremoniously called <emphasis>complex scripts</emphasis>), any combination of
26      several shaping operations may be required, and the rules for how
27      and when they are applied vary from script to script. HarfBuzz and
28      other shaping engines implement these rules.
29    </para>
30    <para>
31      The exact rules and necessary operations for a particular script
32      constitute a shaping <emphasis>model</emphasis>. OpenType
33      specifies a set of shaping models that covers all of
34      Unicode. Other shaping models are available, however, including
35      Graphite and Apple Advanced Typography (AAT).
36    </para>
37  </section>
38
39  <section id="script-specific-shaping">
40    <title>Script-specific shaping</title>
41    <para>
42      In many scripts, transforming the input
43      sequence into the final layout often requires some combination of
44      operations&mdash;such as context-dependent substitutions,
45      context-dependent mark positioning, glyph-to-glyph joining,
46      glyph reordering, or glyph stacking.
47    </para>
48    <para>
49      In some scripts, the shaping rules require that a text
50      run be divided into syllables before the operations can be
51      applied. Other scripts may apply shaping operations over
52      entire words or over the entire text run, with no subdivision
53      required.
54    </para>
55    <para>
56      Other scripts, do not require these
57      operations. However, correctly shaping a text run in
58      any script may still involve Unicode normalization,
59      ligature substitutions, mark positioning, kerning, and applying
60      other font features.
61    </para>
62  </section>
63
64  <section id="shaping-operations">
65    <title>Shaping operations</title>
66    <para>
67      Shaping a text run involves transforming the
68      input sequence of Unicode codepoints with some combination of
69      operations that is specified in the shaping model for the
70      script.
71    </para>
72    <para>
73      The specific conditions that trigger a given operation for a
74      text run varies from script to script, as do the order that the
75      operations are performed in and which codepoints are
76      affected. However, the same general set of shaping operations is
77      common to all of the script shaping models.
78    </para>
79
80    <itemizedlist>
81      <listitem>
82	<para>
83	  A <emphasis>reordering</emphasis> operation moves a glyph
84	  from its original ("logical") position in the sequence to
85	  some other ("visual") position.
86	</para>
87	<para>
88	  The shaping model for a given script might involve
89	  more than one reordering step.
90	</para>
91      </listitem>
92
93      <listitem>
94	<para>
95	  A <emphasis>joining</emphasis> operation replaces a glyph
96	  with an alternate form that is designed to connect with one
97	  or more of the adjacent glyphs in the sequence.
98	</para>
99      </listitem>
100
101      <listitem>
102	<para>
103	  A contextual <emphasis>substitution</emphasis> operation
104	  replaces either a single glyph or a subsequence of several
105	  glyphs with an alternate glyph. This substitution is
106	  performed when the original glyph or subsequence of glyphs
107	  occurs in a specified position with respect to the
108	  surrounding sequence. For example, one substitution might be
109	  performed only when the target glyph is the first glyph in
110	  the sequence, while another substitution is performed only
111	  when a different target glyph occurs immediately after a
112	  particular string pattern.
113	</para>
114	<para>
115	  The shaping model for a given script might involve
116	  multiple contextual-substitution operations, each applying
117	  to different target glyphs and patterns, and which are
118	  performed in separate steps.
119	</para>
120      </listitem>
121
122      <listitem>
123	<para>
124	  A contextual <emphasis>positioning</emphasis> operation
125	  moves the horizontal and/or vertical position of a
126	  glyph. This positioning move is performed when the glyph
127	  occurs in a specified position with respect to the
128	  surrounding sequence.
129	</para>
130	<para>
131	  Many contextual positioning operations are used to place
132	  <emphasis>mark</emphasis> glyphs (such as diacritics, vowel
133	  signs, and tone markers) with respect to
134	  <emphasis>base</emphasis> glyphs. However, some
135	  scripts may use contextual positioning operations to
136	  correctly place base glyphs as well, such as
137	  when the script uses <emphasis>stacking</emphasis> characters.
138	</para>
139      </listitem>
140
141    </itemizedlist>
142  </section>
143
144  <section id="unicode-character-categories">
145    <title>Unicode character categories</title>
146    <para>
147      Shaping models are typically specified with respect to how
148      scripts are defined in the Unicode standard.
149    </para>
150    <para>
151      Every codepoint in the Unicode Character Database (UCD) is
152      assigned a <emphasis>Unicode General Category</emphasis> (UGC),
153      which provides the most fundamental information about the
154      codepoint: whether the codepoint represents a
155      <emphasis>Letter</emphasis>, a <emphasis>Mark</emphasis>, a
156      <emphasis>Number</emphasis>, <emphasis>Punctuation</emphasis>, a
157      <emphasis>Symbol</emphasis>, a <emphasis>Separator</emphasis>,
158      or something else (<emphasis>Other</emphasis>).
159    </para>
160    <para>
161      These UGC properties are "Major" categories. Each codepoint is
162      further assigned to a "minor" category within its Major
163      category, such as "Letter, uppercase" (<literal>Lu</literal>) or
164      "Letter, modifier" (<literal>Lm</literal>).
165    </para>
166    <para>
167      Shaping models are concerned primarily with Letter and Mark
168      codepoints. The minor categories of Mark codepoints are
169      particularly important for shaping. Marks can be nonspacing
170      (<literal>Mn</literal>), spacing combining
171      (<literal>Mc</literal>), or enclosing (<literal>Me</literal>).
172    </para>
173    <para>
174      In addition to the UGC property, codepoints in the Indic and
175      Southeast Asian scripts are also assigned
176      <emphasis>Unicode Indic Syllabic Category</emphasis> (UISC) and
177      <emphasis>Unicode Indic Positional Category</emphasis> (UIPC)
178      properties that provide more detailed information needed for
179      shaping.
180    </para>
181    <para>
182      The UISC property sub-categorizes Letters and Marks according to
183      common script-shaping behaviors. For example, UISC distinguishes
184      between consonant letters, vowel letters, and vowel marks. The
185      UIPC property sub-categorizes Mark codepoints by the relative visual
186      position that they occupy (above, below, right, left, or in
187      multiple positions).
188    </para>
189    <para>
190      Some scripts require that the text run be split into
191      syllables. What constitutes a valid syllable in these
192      scripts is specified in regular expressions, formed from the
193      Letter and Mark codepoints, that take the UISC and UIPC
194      properties into account.
195    </para>
196
197  </section>
198
199  <section id="text-runs">
200    <title>Text runs</title>
201    <para>
202      Real-world text usually contains codepoints from a mixture of
203      different Unicode scripts (including punctuation, numbers, symbols,
204      white-space characters, and other codepoints that do not belong
205      to any script). Real-world text may also be marked up with
206      formatting that changes font properties (including the font,
207      font style, and font size).
208    </para>
209    <para>
210      For shaping purposes, all real-world text streams must be first
211      segmented into runs that have a uniform set of properties.
212    </para>
213    <para>
214      In particular, shaping models always assume that every codepoint
215      in a text run has the same <emphasis>direction</emphasis>,
216      <emphasis>script</emphasis> tag, and
217      <emphasis>language</emphasis> tag.
218    </para>
219  </section>
220
221  <section id="opentype-shaping-models">
222    <title>OpenType shaping models</title>
223    <para>
224      OpenType provides shaping models for the following scripts:
225    </para>
226
227    <itemizedlist>
228      <listitem>
229	<para>
230	  The <emphasis>default</emphasis> shaping model handles all
231	  scripts with no script-specific shaping model, and may also be used as a fallback for
232	  handling unrecognized scripts.
233	</para>
234      </listitem>
235
236      <listitem>
237	<para>
238	  The <emphasis>Indic</emphasis> shaping model handles the Indic
239	  scripts Bengali, Devanagari, Gujarati, Gurmukhi, Kannada,
240	  Malayalam, Oriya, Tamil, and Telugu.
241	</para>
242	<para>
243	  The Indic shaping model was revised significantly in
244	  2005. To denote the change, a new set of <emphasis>script
245	  tags</emphasis> was assigned for Bengali, Devanagari,
246	  Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and
247	  Telugu. For the sake of clarity, the term "Indic2" is
248	  sometimes used to refer to the current, revised shaping
249	  model.
250	</para>
251      </listitem>
252
253      <listitem>
254	<para>
255	  The <emphasis>Arabic</emphasis> shaping model supports
256	  Arabic, Mongolian, N'Ko, Syriac, and several other connected
257	  or cursive scripts.
258	</para>
259      </listitem>
260
261      <listitem>
262	<para>
263	  The <emphasis>Thai/Lao</emphasis> shaping model supports
264	  the Thai and Lao scripts.
265	</para>
266      </listitem>
267
268      <listitem>
269	<para>
270	  The <emphasis>Khmer</emphasis> shaping model supports the
271	  Khmer script.
272	</para>
273      </listitem>
274
275      <listitem>
276	<para>
277	  The <emphasis>Myanmar</emphasis> shaping model supports the
278	  Myanmar (or Burmese) script.
279	</para>
280      </listitem>
281
282      <listitem>
283	<para>
284	  The <emphasis>Tibetan</emphasis> shaping model supports the
285	  Tibetan script.
286	</para>
287      </listitem>
288
289      <listitem>
290	<para>
291	  The <emphasis>Hangul</emphasis> shaping model supports the
292	  Hangul script.
293	</para>
294      </listitem>
295
296      <listitem>
297	<para>
298	  The <emphasis>Hebrew</emphasis> shaping model supports the
299	  Hebrew script.
300	</para>
301      </listitem>
302
303      <listitem>
304	<para>
305	  The <emphasis>Universal Shaping Engine</emphasis> (USE)
306	  shaping model supports scripts not covered by one of
307	  the above, script-specific shaping models, including
308	  Javanese, Balinese, Buginese, Batak, Chakma, Lepcha, Modi,
309	  Phags-pa, Tagalog, Siddham, Sundanese, Tai Le, Tai Tham, Tai
310	  Viet, and many others.
311	</para>
312      </listitem>
313
314      <listitem>
315	<para>
316	  Text runs that do not fall under one of the above shaping
317	  models may still require processing by a shaping engine. Of
318	  particular note is <emphasis>Emoji</emphasis> shaping, which
319	  may involve variation-selector sequences and glyph
320	  substitution. Emoji shaping is handled by the default
321	  shaping model.
322	</para>
323      </listitem>
324
325    </itemizedlist>
326
327  </section>
328
329  <section id="graphite-shaping">
330    <title>Graphite shaping</title>
331    <para>
332      In contrast to OpenType shaping, Graphite shaping does not
333      specify a predefined set of shaping models or a set of supported
334      scripts.
335    </para>
336    <para>
337      Instead, each Graphite font contains a complete set of rules that
338      implement the required shaping model for the intended
339      script. These rules include finite-state machines to match
340      sequences of codepoints to the shaping operations to perform.
341    </para>
342    <para>
343      Graphite shaping can perform the same shaping operations used in
344      OpenType shaping, as well as other functions that have not been
345      defined for OpenType shaping.
346    </para>
347  </section>
348
349  <section id="aat-shaping">
350    <title>AAT shaping</title>
351    <para>
352      In contrast to OpenType shaping, AAT shaping does not specify a
353      predefined set of shaping models or a set of supported scripts.
354    </para>
355    <para>
356      Instead, each AAT font includes a complete set of rules that
357      implement the desired shaping model for the intended
358      script. These rules include finite-state machines to match glyph
359      sequences and the shaping operations to perform.
360    </para>
361    <para>
362      Notably, AAT shaping rules are expressed for glyphs in the font,
363      not for Unicode codepoints. AAT shaping can perform the same
364      shaping operations used in OpenType shaping, as well as other
365      functions that have not been defined for OpenType shaping.
366    </para>
367  </section>
368</chapter>
369