1<html> 2 3<head> 4<meta http-equiv="Content-Language" content="en-us"> 5<meta http-equiv="Content-Type" content="text/html; charset=windows-1252"> 6<meta name="Copyright (C) 2016 and later: Unicode, Inc. and others. License & terms of use: http://www.unicode.org/copyright.html"> 7<meta name="COPYRIGHT" content="Copyright 2006-2007, International Business Machines Corporation and others. All Rights Reserved."> 8<title>New Transliteration Test Files</title> 9</head> 10 11<body bgcolor="#FFFFFF"> 12 13<h2>New Transliteration Test Files</h2> 14<p>The Test_*.html files show the transliteration of characters for given 15languages. The sample for each language consists of "What Is Unicode" 16in Thai, followed by other available text. The text is broken apart into 17sentences for ease of viewing (note: we know of some problems with the sentence 18rules for Japanese and Chinese). The left column is the original, and the right 19is the romanization. The program also converts back to the original script. If 20there is a discrepancy between the source and the reverse transformation, that 21is indicated by making the background <font color="#FF0000"><b>red</b></font> 22from that point on.</p> 23<blockquote> 24 <p><i><b>Note: </b>If you have some more text that you would like added to the 25 sample, just let me know. I am particularly interested in name lists, since 26 they are the typical source.</i></p> 27</blockquote> 28<h3>Standards</h3> 29<p>The goal is to follow a given standard, such as ISO* or UNGEGN wherever 30possible. We also need to round-trip, so in some cases, that means adding some 31additional accent marks to disambiguate characters. And often the source 32standards are missing some characters, such as characters with combining Hamzas 33in Arabic. Remember that the goal for these is transliteration (unambiguously 34representing all the letters in the original), not transcription (representing 35the best pronunciation).</p> 36<ul> 37 <li><b><a href="Test_Thai-Latin.html">Thai</a>:</b> ISO 11940 < <a href="http://homepage.mac.com/sirbinks/pdf/Thai.r2.pdf">http://homepage.mac.com/sirbinks/pdf/Thai.r2.pdf</a> 38 > plus a few items: 39 <ul> 40 <li>Accents may be added to the Latin for disambiguation.</li> 41 <li>In the next release, we'd like to do the UNGEGN version < <a href="http://www.eki.ee/wgrs/rom1_th.pdf">http://www.eki.ee/wgrs/rom1_th.pdf</a> 42 > which is probably more useful (and readable), and follows more 43 closely the Thai standard.</li> 44 <li>Spaces are provided at word-breaks, using the Thai BreakIterator.</li> 45 <li>An inherent vowel (ọ) is added, as in UNGEGN. The dot is for 46 disambiguation. 47 <ul> 48 <li><i>Note: if the inherent vowel positions cannot be algorithmically 49 determined, let me know and I will remove them.</i></li> 50 </ul> 51 </li> 52 </ul> 53 </li> 54 <li><b><a href="Test_Arabic-Latin.html">Arabic</a>: </b>Generally follows 55 UNGEGN < <a href="http://www.eki.ee/wgrs/rom1_ar.pdf">http://www.eki.ee/wgrs/rom1_ar.pdf</a> 56 > 57 <ul> 58 <li>Accents may be added to the Latin for disambiguation.</li> 59 <li>Occasionally deviates in the direction of ISO 233 < <a href="http://homepage.mac.com/sirbinks/pdf/Arabic.pdf">http://homepage.mac.com/sirbinks/pdf/Arabic.pdf</a> 60 > 61 <ul> 62 <li>with underdot instead of cedilla for letter like SAD, since those 63 are explicitly in Unicode for transliteration of Arabic</li> 64 <li>adding extra non-Arabic-language letters, like PEH. Note: not all 65 extended Arabic characters are handled yet.</li> 66 </ul> 67 </li> 68 <li>Does <i>not</i> do assimilation of "al", nor hyphenation of 69 it. 70 <ul> 71 <li>While it could be done, we need to determine whether a prefix 72 "al" could occur other than as the definite article (since 73 no space is used).</li> 74 </ul> 75 </li> 76 <li>This is transliteration. For <i>transcription</i> one would want an 77 engine that added points appropriately to the Hebrew.</li> 78 </ul> 79 </li> 80 <li><b><a href="Test_Hebrew-Latin.html">Hebrew</a></b><b>: </b>Generally 81 follows UNGEGN < <a href="http://www.eki.ee/wgrs/rom1_he.pdf">http://www.eki.ee/wgrs/rom1_he.pdf</a> 82 >, with some exceptions: 83 <ul> 84 <li>Accents may be added to the Latin for disambiguation.</li> 85 <li>Combinations of dagesh, shin/sin dot that would produce different 86 letters are not yet called out.</li> 87 <li>Note that the final forms are not preserved. Thus, when going from 88 Latin to Hebrew, a character is given final form depending on its 89 position. 90 <ul> 91 <li>E.g. מםמם => mmmm => 92 מממם</li> 93 </ul> 94 </li> 95 <li>This is transliteration. For <i>transcription</i> one would want an 96 engine that added points appropriately to the Hebrew.</li> 97 <li>See also < <a href="http://homepage.mac.com/sirbinks/pdf/Hebrew.r1.pdf">http://homepage.mac.com/sirbinks/pdf/Hebrew.r1.pdf</a> 98 > for the ISO version. The Chicago Manual of Style has a clear table 99 of mappings for the vowel marks.</li> 100 </ul> 101 </li> 102 <li><b><a href="Test_Han-Latin.html">Han</a>:</b> Uses the <a href="http://www.mandarintools.com/cedict.html">CEDICT</a> 103 data plus Unicode Unihan <i>kMandarin</i> values for pinyin. Doesn't 104 roundtrip! 105 <ul> 106 <li><i>Note: </i>the Chinese pronunciation of Han characters varies by 107 context and grammar, though nowhere near as much a Japanese. 108 <ul> 109 <li>Ideally we'd have an underlying engine for this. In 2.4 we will 110 have a plug-in interface so that people could add one, such as the 111 IBM engine.</li> 112 <li>The data from CEDICT and Unihan don't list the most frequent 113 choice first, so we will be updating that.</li> 114 </ul> 115 </li> 116 </ul> 117 </li> 118 <li><a href="Test_Greek-Latin_UNGEGN.html"><b>Greek/UNGEGN</b></a>: Uses a 119 modern Greek transliteration, based on the UNGEGN rules at < <a href="http://www.eki.ee/wgrs/rom1_el.pdf">http://www.eki.ee/wgrs/rom1_el.pdf</a> 120 >. This version will not roundtrip ancient Greek.</li> 121 <li><a href="Test_Greek-Latin.html"><b>Greek</b></a>: Uses a classic Greek 122 transliteration. This version will not roundtrip modern Greek.</li> 123</ul> 124<h3><b>Notes</b></h3> 125<ol> 126 <li>For readability, the files have a few other things besides just the 127 transliteration: 128 <ul> 129 <li>The first word of the sentences are titlecased, as are names (where we 130 have a name-list, such as in Thai).</li> 131 <li>The Latin in the original is mapped to the private-use zone before 132 conversion, and then again after conversion. This does have the downside 133 that any rules (such as in Han) that need to know the context (e.g. for 134 inserting spaces or capitalization) will gum up a little bit. This is 135 just an artifact of the test display.</li> 136 </ul> 137 </li> 138 <li>I don't think that ISO 11940 is a particularly good way to romanize, but 139 it is at least complete and a standard. So what I am interested in just for 140 now is whether the samples in the file follow it (with the above 141 exceptions).</li> 142 <li>Some of the files also have a set of characters at the end, one character 143 per row, with a following row listing the hex and name.</li> 144 <li>The source rules for all of these is in the following URL. So if you want 145 to know the details of how the characters are handled, that is the place to 146 look. 147 <ul> 148 <li><a href="http://source.icu-project.org/repos/icu/icu/trunk/source/data/translit/">http://source.icu-project.org/repos/icu/icu/trunk/source/data/translit/</a></li> 149 </ul> 150 </li> 151</ol> 152 153</body> 154 155</html> 156