1<?xml version="1.0"?> 2<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN" 3 "http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd" [ 4 <!ENTITY % local.common.attrib "xmlns:xi CDATA #FIXED 'http://www.w3.org/2003/XInclude'"> 5 <!ENTITY version SYSTEM "version.xml"> 6]> 7<chapter id="clusters"> 8 <title>Clusters</title> 9 <section id="clusters-and-shaping"> 10 <title>Clusters and shaping</title> 11 <para> 12 In text shaping, a <emphasis>cluster</emphasis> is a sequence of 13 characters that needs to be treated as a single, indivisible 14 unit. A single letter or symbol can be a cluster of its 15 own. Other clusters correspond to longer subsequences of the 16 input code points — such as a ligature or conjunct form 17 — and require the shaper to ensure that the cluster is not 18 broken during the shaping process. 19 </para> 20 <para> 21 A cluster is distinct from a <emphasis>grapheme</emphasis>, 22 which is the smallest unit of meaning in a writing system or 23 script. 24 </para> 25 <para> 26 The definitions of the two terms are similar. However, clusters 27 are only relevant for script shaping and glyph layout. In 28 contrast, graphemes are a property of the underlying script, and 29 are of interest when client programs implement orthographic 30 or linguistic functionality. 31 </para> 32 <para> 33 For example, two individual letters are often two separate 34 graphemes. When two letters form a ligature, however, they 35 combine into a single glyph. They are then part of the same 36 cluster and are treated as a unit by the shaping engine — 37 even though the two original, underlying letters remain separate 38 graphemes. 39 </para> 40 <para> 41 HarfBuzz is concerned with clusters, <emphasis>not</emphasis> 42 with graphemes — although client programs using HarfBuzz 43 may still care about graphemes for other reasons from time to time. 44 </para> 45 <para> 46 During the shaping process, there are several shaping operations 47 that may merge adjacent characters (for example, when two code 48 points form a ligature or a conjunct form and are replaced by a 49 single glyph) or split one character into several (for example, 50 when decomposing a code point through the 51 <literal>ccmp</literal> feature). Operations like these alter 52 clusters; HarfBuzz tracks the changes to ensure that no clusters 53 get lost or broken during shaping. 54 </para> 55 <para> 56 HarfBuzz records cluster information independently from how 57 shaping operations affect the individual glyphs returned in an 58 output buffer. Consequently, a client program using HarfBuzz can 59 utilize the cluster information to implement features such as: 60 </para> 61 <itemizedlist> 62 <listitem> 63 <para> 64 Correctly positioning the cursor within a shaped text run, 65 even when characters have formed ligatures, composed or 66 decomposed, reordered, or undergone other shaping operations. 67 </para> 68 </listitem> 69 <listitem> 70 <para> 71 Correctly highlighting a text selection that includes some, 72 but not all, of the characters in a word. 73 </para> 74 </listitem> 75 <listitem> 76 <para> 77 Applying text attributes (such as color or underlining) to 78 part, but not all, of a word. 79 </para> 80 </listitem> 81 <listitem> 82 <para> 83 Generating output document formats (such as PDF) with 84 embedded text that can be fully extracted. 85 </para> 86 </listitem> 87 <listitem> 88 <para> 89 Determining the mapping between input characters and output 90 glyphs, such as which glyphs are ligatures. 91 </para> 92 </listitem> 93 <listitem> 94 <para> 95 Performing line-breaking, justification, and other 96 line-level or paragraph-level operations that must be done 97 after shaping is complete, but which require examining 98 character-level properties. 99 </para> 100 </listitem> 101 </itemizedlist> 102 </section> 103 <section id="working-with-harfbuzz-clusters"> 104 <title>Working with HarfBuzz clusters</title> 105 <para> 106 When you add text to a HarfBuzz buffer, each code point must be 107 assigned a <emphasis>cluster value</emphasis>. 108 </para> 109 <para> 110 This cluster value is an arbitrary number; HarfBuzz uses it only 111 to distinguish between clusters. Many client programs will use 112 the index of each code point in the input text stream as the 113 cluster value. This is for the sake of convenience; the actual 114 value does not matter. 115 </para> 116 <para> 117 Some of the shaping operations performed by HarfBuzz — 118 such as reordering, composition, decomposition, and substitution 119 — may alter the cluster values of some characters. The 120 final cluster values in the buffer at the end of the shaping 121 process will indicate to client programs which subsequences of 122 glyphs represent a cluster and, therefore, must not be 123 separated. 124 </para> 125 <para> 126 In addition, client programs can query the final cluster values 127 to discern other potentially important information about the 128 glyphs in the output buffer (such as whether or not a ligature 129 was formed). 130 </para> 131 <para> 132 For example, if the initial sequence of cluster values was: 133 </para> 134 <programlisting> 135 0,1,2,3,4 136 </programlisting> 137 <para> 138 and the final sequence of cluster values is: 139 </para> 140 <programlisting> 141 0,0,3,3 142 </programlisting> 143 <para> 144 then there are two clusters in the output buffer: the first 145 cluster includes the first two glyphs, and the second cluster 146 includes the third and fourth glyphs. It is also evident that a 147 ligature or conjunct has been formed, because there are fewer 148 glyphs in the output buffer (four) than there were code points 149 in the input buffer (five). 150 </para> 151 <para> 152 Although client programs using HarfBuzz are free to assign 153 initial cluster values in any manner they choose to, HarfBuzz 154 does offer some useful guarantees if the cluster values are 155 assigned in a monotonic (either non-decreasing or non-increasing) 156 order. 157 </para> 158 <para> 159 For buffers in the left-to-right (LTR) 160 or top-to-bottom (TTB) text flow direction, 161 HarfBuzz will preserve the monotonic property: client programs 162 are guaranteed that monotonically increasing initial cluster 163 values will be returned as monotonically increasing final 164 cluster values. 165 </para> 166 <para> 167 For buffers in the right-to-left (RTL) 168 or bottom-to-top (BTT) text flow direction, 169 the directionality of the buffer itself is reversed for final 170 output as a matter of design. Therefore, HarfBuzz inverts the 171 monotonic property: client programs are guaranteed that 172 monotonically increasing initial cluster values will be 173 returned as monotonically <emphasis>decreasing</emphasis> final 174 cluster values. 175 </para> 176 <para> 177 Client programs can adjust how HarfBuzz handles clusters during 178 shaping by setting the 179 <literal>cluster_level</literal> of the 180 buffer. HarfBuzz offers three <emphasis>levels</emphasis> of 181 clustering support for this property: 182 </para> 183 <itemizedlist> 184 <listitem> 185 <para><emphasis>Level 0</emphasis> is the default. 186 </para> 187 <para> 188 The distinguishing feature of level 0 behavior is that, at 189 the beginning of processing the buffer, all code points that 190 are categorized as <emphasis>marks</emphasis>, 191 <emphasis>modifier symbols</emphasis>, or 192 <emphasis>Emoji extended pictographic</emphasis> modifiers, 193 as well as the <emphasis>Zero Width Joiner</emphasis> and 194 <emphasis>Zero Width Non-Joiner</emphasis> code points, are 195 assigned the cluster value of the closest preceding code 196 point from <emphasis>different</emphasis> category. 197 </para> 198 <para> 199 In essence, whenever a base character is followed by a mark 200 character or a sequence of mark characters, those marks are 201 reassigned to the same initial cluster value as the base 202 character. This reassignment is referred to as 203 "merging" the affected clusters. This behavior is based on 204 the Grapheme Cluster Boundary specification in <ulink 205 url="https://www.unicode.org/reports/tr29/#Regex_Definitions">Unicode 206 Technical Report 29</ulink>. 207 </para> 208 <para> 209 This cluster level is suitable for code that likes to use 210 HarfBuzz cluster values as an approximation of the Unicode 211 Grapheme Cluster Boundaries as well. 212 </para> 213 <para> 214 Client programs can specify level 0 behavior for a buffer by 215 setting its <literal>cluster_level</literal> to 216 <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_GRAPHEMES</literal>. 217 </para> 218 </listitem> 219 <listitem> 220 <para> 221 <emphasis>Level 1</emphasis> tweaks the old behavior 222 slightly to produce better results. Therefore, level 1 223 clustering is recommended for code that is not required to 224 implement backward compatibility with the old HarfBuzz. 225 </para> 226 <para> 227 <emphasis>Level 1</emphasis> differs from level 0 by not merging the 228 clusters of marks and other modifier code points with the 229 preceding "base" code point's cluster. By preserving the 230 separate cluster values of these marks and modifier code 231 points, script shapers can perform additional operations 232 that might lead to improved results (for example, coloring 233 mark glyphs differently than their base). 234 </para> 235 <para> 236 Client programs can specify level 1 behavior for a buffer by 237 setting its <literal>cluster_level</literal> to 238 <literal>HB_BUFFER_CLUSTER_LEVEL_MONOTONE_CHARACTERS</literal>. 239 </para> 240 </listitem> 241 <listitem> 242 <para> 243 <emphasis>Level 2</emphasis> differs significantly in how it 244 treats cluster values. In level 2, HarfBuzz never merges 245 clusters. 246 </para> 247 <para> 248 This difference can be seen most clearly when HarfBuzz processes 249 ligature substitutions and glyph decompositions. In level 0 250 and level 1, ligatures and glyph decomposition both involve 251 merging clusters; in level 2, neither of these operations 252 triggers a merge. 253 </para> 254 <para> 255 Client programs can specify level 2 behavior for a buffer by 256 setting its <literal>cluster_level</literal> to 257 <literal>HB_BUFFER_CLUSTER_LEVEL_CHARACTERS</literal>. 258 </para> 259 </listitem> 260 </itemizedlist> 261 <para> 262 As mentioned earlier, client programs using HarfBuzz often 263 assign initial cluster values in a buffer by reusing the indices 264 of the code points in the input text. This gives a sequence of 265 cluster values that is monotonically increasing (for example, 266 0,1,2,3,4). 267 </para> 268 <para> 269 It is not <emphasis>required</emphasis> that the cluster values 270 in a buffer be monotonically increasing. However, if the initial 271 cluster values in a buffer are monotonic and the buffer is 272 configured to use cluster level 0 or 1, then HarfBuzz 273 guarantees that the final cluster values in the shaped buffer 274 will also be monotonic. No such guarantee is made for cluster 275 level 2. 276 </para> 277 <para> 278 In levels 0 and 1, HarfBuzz implements the following conceptual 279 model for cluster values: 280 </para> 281 <itemizedlist spacing="compact"> 282 <listitem> 283 <para> 284 If the sequence of input cluster values is monotonic, the 285 sequence of cluster values will remain monotonic. 286 </para> 287 </listitem> 288 <listitem> 289 <para> 290 Each cluster value represents a single cluster. 291 </para> 292 </listitem> 293 <listitem> 294 <para> 295 Each cluster contains one or more glyphs and one or more 296 characters. 297 </para> 298 </listitem> 299 </itemizedlist> 300 <para> 301 In practice, this model offers several benefits. Assuming that 302 the initial cluster values were monotonically increasing 303 and distinct before shaping began, then, in the final output: 304 </para> 305 <itemizedlist spacing="compact"> 306 <listitem> 307 <para> 308 All adjacent glyphs having the same final cluster 309 value belong to the same cluster. 310 </para> 311 </listitem> 312 <listitem> 313 <para> 314 Each character belongs to the cluster that has the highest 315 cluster value <emphasis>not larger than</emphasis> its 316 initial cluster value. 317 </para> 318 </listitem> 319 </itemizedlist> 320 </section> 321 322 <section id="a-clustering-example-for-levels-0-and-1"> 323 <title>A clustering example for levels 0 and 1</title> 324 <para> 325 The basic shaping operations affect clusters in a predictable 326 manner when using level 0 or level 1: 327 </para> 328 <itemizedlist> 329 <listitem> 330 <para> 331 When two or more clusters <emphasis>merge</emphasis>, the 332 resulting merged cluster takes as its cluster value the 333 <emphasis>minimum</emphasis> of the incoming cluster values. 334 </para> 335 </listitem> 336 <listitem> 337 <para> 338 When a cluster <emphasis>decomposes</emphasis>, all of the 339 resulting child clusters inherit as their cluster value the 340 cluster value of the parent cluster. 341 </para> 342 </listitem> 343 <listitem> 344 <para> 345 When a character is <emphasis>reordered</emphasis>, the 346 reordered character and all clusters that the character 347 moves past as part of the reordering are merged into one cluster. 348 </para> 349 </listitem> 350 </itemizedlist> 351 <para> 352 The functionality, guarantees, and benefits of level 0 and level 353 1 behavior can be seen with some examples. First, let us examine 354 what happens with cluster values when shaping involves cluster 355 merging with ligatures and decomposition. 356 </para> 357 358 <para> 359 Let's say we start with the following character sequence (top row) and 360 initial cluster values (bottom row): 361 </para> 362 <programlisting> 363 A,B,C,D,E 364 0,1,2,3,4 365 </programlisting> 366 <para> 367 During shaping, HarfBuzz maps these characters to glyphs from 368 the font. For simplicity, let us assume that each character maps 369 to the corresponding, identical-looking glyph: 370 </para> 371 <programlisting> 372 A,B,C,D,E 373 0,1,2,3,4 374 </programlisting> 375 <para> 376 Now if, for example, <literal>B</literal> and <literal>C</literal> 377 form a ligature, then the clusters to which they belong 378 "merge". This merged cluster takes for its cluster 379 value the minimum of all the cluster values of the clusters that 380 went in to the ligature. In this case, we get: 381 </para> 382 <programlisting> 383 A,BC,D,E 384 0,1 ,3,4 385 </programlisting> 386 <para> 387 because 1 is the minimum of the set {1,2}, which were the 388 cluster values of <literal>B</literal> and 389 <literal>C</literal>. 390 </para> 391 <para> 392 Next, let us say that the <literal>BC</literal> ligature glyph 393 decomposes into three components, and <literal>D</literal> also 394 decomposes into two components. Whenever a cluster decomposes, 395 its components each inherit the cluster value of their parent: 396 </para> 397 <programlisting> 398 A,BC0,BC1,BC2,D0,D1,E 399 0,1 ,1 ,1 ,3 ,3 ,4 400 </programlisting> 401 <para> 402 Next, if <literal>BC2</literal> and <literal>D0</literal> form a 403 ligature, then their clusters (cluster values 1 and 3) merge into 404 <literal>min(1,3) = 1</literal>: 405 </para> 406 <programlisting> 407 A,BC0,BC1,BC2D0,D1,E 408 0,1 ,1 ,1 ,1 ,4 409 </programlisting> 410 <para> 411 Note that the entirety of cluster 3 merges into cluster 1, not 412 just the <literal>D0</literal> glyph. This reflects the fact 413 that the cluster <emphasis>must</emphasis> be treated as an 414 indivisible unit. 415 </para> 416 <para> 417 At this point, cluster 1 means: the character sequence 418 <literal>BCD</literal> is represented by glyphs 419 <literal>BC0,BC1,BC2D0,D1</literal> and cannot be broken down any 420 further. 421 </para> 422 </section> 423 <section id="reordering-in-levels-0-and-1"> 424 <title>Reordering in levels 0 and 1</title> 425 <para> 426 Another common operation in some shapers is glyph 427 reordering. In order to maintain a monotonic cluster sequence 428 when glyph reordering takes place, HarfBuzz merges the clusters 429 of everything in the reordering sequence. 430 </para> 431 <para> 432 For example, let us again start with the character sequence (top 433 row) and initial cluster values (bottom row): 434 </para> 435 <programlisting> 436 A,B,C,D,E 437 0,1,2,3,4 438 </programlisting> 439 <para> 440 If <literal>D</literal> is reordered to the position immediately 441 before <literal>B</literal>, then HarfBuzz merges the 442 <literal>B</literal>, <literal>C</literal>, and 443 <literal>D</literal> clusters — all the clusters between 444 the final position of the reordered glyph and its original 445 position. This means that we get: 446 </para> 447 <programlisting> 448 A,D,B,C,E 449 0,1,1,1,4 450 </programlisting> 451 <para> 452 as the final cluster sequence. 453 </para> 454 <para> 455 Merging this many clusters is not ideal, but it is the only 456 sensible way for HarfBuzz to maintain the guarantee that the 457 sequence of cluster values remains monotonic and to retain the 458 true relationship between glyphs and characters. 459 </para> 460 </section> 461 <section id="the-distinction-between-levels-0-and-1"> 462 <title>The distinction between levels 0 and 1</title> 463 <para> 464 The preceding examples demonstrate the main effects of using 465 cluster levels 0 and 1. The only difference between the two 466 levels is this: in level 0, at the very beginning of the shaping 467 process, HarfBuzz merges the cluster of each base character 468 with the clusters of all Unicode marks (combining or not) and 469 modifiers that follow it. 470 </para> 471 <para> 472 For example, let us start with the following character sequence 473 (top row) and accompanying initial cluster values (bottom row): 474 </para> 475 <programlisting> 476 A,acute,B 477 0,1 ,2 478 </programlisting> 479 <para> 480 The <literal>acute</literal> is a Unicode mark. If HarfBuzz is 481 using cluster level 0 on this sequence, then the 482 <literal>A</literal> and <literal>acute</literal> clusters will 483 merge, and the result will become: 484 </para> 485 <programlisting> 486 A,acute,B 487 0,0 ,2 488 </programlisting> 489 <para> 490 This merger is performed before any other script-shaping 491 steps. 492 </para> 493 <para> 494 This initial cluster merging is the default behavior of the 495 Windows shaping engine, and the old HarfBuzz codebase copied 496 that behavior to maintain compatibility. Consequently, it has 497 remained the default behavior in the new HarfBuzz codebase. 498 </para> 499 <para> 500 But this initial cluster-merging behavior makes it impossible 501 for client programs to implement some features (such as to 502 color diacritic marks differently from their base 503 characters). That is why, in level 1, HarfBuzz does not perform 504 the initial merging step. 505 </para> 506 <para> 507 For client programs that rely on HarfBuzz cluster values to 508 perform cursor positioning, level 0 is more convenient. But 509 relying on cluster boundaries for cursor positioning is wrong: cursor 510 positions should be determined based on Unicode grapheme 511 boundaries, not on shaping-cluster boundaries. As such, using 512 level 1 clustering behavior is recommended. 513 </para> 514 <para> 515 One final facet of levels 0 and 1 is worth noting. HarfBuzz 516 currently does not allow any 517 <emphasis>multiple-substitution</emphasis> GSUB lookups to 518 replace a glyph with zero glyphs (in other words, to delete a 519 glyph). 520 </para> 521 <para> 522 But, in some other situations, glyphs can be deleted. In 523 those cases, if the glyph being deleted is the last glyph of its 524 cluster, HarfBuzz makes sure to merge the deleted glyph's 525 cluster with a neighboring cluster. 526 </para> 527 <para> 528 This is done primarily to make sure that the starting cluster of the 529 text always has the cluster index pointing to the start of the text 530 for the run; more than one client program currently relies on this 531 guarantee. 532 </para> 533 <para> 534 Incidentally, Apple's CoreText does something different to 535 maintain the same promise: it inserts a glyph with id 65535 at 536 the beginning of the glyph string if the glyph corresponding to 537 the first character in the run was deleted. HarfBuzz might do 538 something similar in the future. 539 </para> 540 </section> 541 <section id="level-2"> 542 <title>Level 2</title> 543 <para> 544 HarfBuzz's level 2 cluster behavior uses a significantly 545 different model than that of level 0 and level 1. 546 </para> 547 <para> 548 The level 2 behavior is easy to describe, but it may be 549 difficult to understand in practical terms. In brief, level 2 550 performs no merging of clusters whatsoever. 551 </para> 552 <para> 553 This means that there is no initial base-and-mark merging step 554 (as is done in level 0), and it means that reordering moves and 555 ligature substitutions do not trigger a cluster merge. 556 </para> 557 <para> 558 Only one shaping operation directly affects clusters when using 559 level 2: 560 </para> 561 <itemizedlist> 562 <listitem> 563 <para> 564 When a cluster <emphasis>decomposes</emphasis>, all of the 565 resulting child clusters inherit as their cluster value the 566 cluster value of the parent cluster. 567 </para> 568 </listitem> 569 </itemizedlist> 570 <para> 571 When glyphs do form a ligature (or when some other feature 572 substitutes multiple glyphs with one glyph) the cluster value 573 of the first glyph is retained as the cluster value for the 574 resulting ligature. 575 </para> 576 <para> 577 This occurrence sounds similar to a cluster merge, but it is 578 different. In particular, no subsequent characters — 579 including marks and modifiers — are affected. They retain 580 their previous cluster values. 581 </para> 582 <para> 583 Level 2 cluster behavior is ultimately less complex than level 0 584 or level 1, but there are several cases for which processing 585 cluster values produced at level 2 may be tricky. 586 </para> 587 <section id="ligatures-with-combining-marks-in-level-2"> 588 <title>Ligatures with combining marks in level 2</title> 589 <para> 590 The first example of how HarfBuzz's level 2 cluster behavior 591 can be tricky is when the text to be shaped includes combining 592 marks attached to ligatures. 593 </para> 594 <para> 595 Let us start with an input sequence with the following 596 characters (top row) and initial cluster values (bottom row): 597 </para> 598 <programlisting> 599 A,acute,B,breve,C,circumflex 600 0,1 ,2,3 ,4,5 601 </programlisting> 602 <para> 603 If the sequence <literal>A,B,C</literal> forms a ligature, 604 then these are the cluster values HarfBuzz will return under 605 the various cluster levels: 606 </para> 607 <para> 608 Level 0: 609 </para> 610 <programlisting> 611 ABC,acute,breve,circumflex 612 0 ,0 ,0 ,0 613 </programlisting> 614 <para> 615 Level 1: 616 </para> 617 <programlisting> 618 ABC,acute,breve,circumflex 619 0 ,0 ,0 ,5 620 </programlisting> 621 <para> 622 Level 2: 623 </para> 624 <programlisting> 625 ABC,acute,breve,circumflex 626 0 ,1 ,3 ,5 627 </programlisting> 628 <para> 629 Making sense of the level 2 result is the hardest for a client 630 program, because there is nothing in the cluster values that 631 indicates that <literal>B</literal> and <literal>C</literal> 632 formed a ligature with <literal>A</literal>. 633 </para> 634 <para> 635 In contrast, the "merged" cluster values of the mark glyphs 636 that are seen in the level 0 and level 1 output are evidence 637 that a ligature substitution took place. 638 </para> 639 </section> 640 <section id="reordering-in-level-2"> 641 <title>Reordering in level 2</title> 642 <para> 643 Another example of how HarfBuzz's level 2 cluster behavior 644 can be tricky is when glyphs reorder. Consider an input sequence 645 with the following characters (top row) and initial cluster 646 values (bottom row): 647 </para> 648 <programlisting> 649 A,B,C,D,E 650 0,1,2,3,4 651 </programlisting> 652 <para> 653 Now imagine <literal>D</literal> moves before 654 <literal>B</literal> in a reordering operation. The cluster 655 values will then be: 656 </para> 657 <programlisting> 658 A,D,B,C,E 659 0,3,1,2,4 660 </programlisting> 661 <para> 662 Next, if <literal>D</literal> forms a ligature with 663 <literal>B</literal>, the output is: 664 </para> 665 <programlisting> 666 A,DB,C,E 667 0,3 ,2,4 668 </programlisting> 669 <para> 670 However, in a different scenario, in which the shaping rules 671 of the script instead caused <literal>A</literal> and 672 <literal>B</literal> to form a ligature 673 <emphasis>before</emphasis> the <literal>D</literal> reordered, the 674 result would be: 675 </para> 676 <programlisting> 677 AB,D,C,E 678 0 ,3,2,4 679 </programlisting> 680 <para> 681 There is no way for a client program to differentiate between 682 these two scenarios based on the cluster values 683 alone. Consequently, client programs that use level 2 might 684 need to undertake additional work in order to manage cursor 685 positioning, text attributes, or other desired features. 686 </para> 687 </section> 688 <section id="other-considerations-in-level-2"> 689 <title>Other considerations in level 2</title> 690 <para> 691 There may be other problems encountered with ligatures under 692 level 2, such as if the direction of the text is forced to 693 the opposite of its natural direction (for example, Arabic text 694 that is forced into left-to-right directionality). But, 695 generally speaking, these other scenarios are minor corner 696 cases that are too obscure for most client programs to need to 697 worry about. 698 </para> 699 </section> 700 </section> 701</chapter> 702