1*99e0aae7SDavid Rees# String Support for Emboss 2*99e0aae7SDavid Rees 3*99e0aae7SDavid ReesGitHub Issue [#28](https://github.com/google/emboss/issues/28) 4*99e0aae7SDavid Rees 5*99e0aae7SDavid Rees## Background 6*99e0aae7SDavid Rees 7*99e0aae7SDavid ReesIt is somewhat common to embed short strings into binary structures; examples 8*99e0aae7SDavid Reesinclude serial numbers and firmware revisions, although in some cases even 9*99e0aae7SDavid Reesthings like IP addresses are encoded as ASCII text embedded in a larger binary 10*99e0aae7SDavid Reesmessage. 11*99e0aae7SDavid Rees 12*99e0aae7SDavid ReesHistorically, we have modeled such fields in Emboss by using `UInt:8[]`; that 13*99e0aae7SDavid Reesis, arrays of 8-bit uints. This is more-or-less functional, but can be awkward 14*99e0aae7SDavid Reesfor things like text format output, and provides no way to add assertions to 15*99e0aae7SDavid Reesstring fields. 16*99e0aae7SDavid Rees 17*99e0aae7SDavid ReesString support is complicated by the fact that there are several common ways of 18*99e0aae7SDavid Reesdelimiting strings: 19*99e0aae7SDavid Rees 20*99e0aae7SDavid Rees1. Length determined by another field -- that is, the size of the string is 21*99e0aae7SDavid Rees explicit. 22*99e0aae7SDavid Rees2. The string is *terminated* by a specific byte value, usually `'\0'`. In 23*99e0aae7SDavid Rees this case, there may be additional "garbage" bytes after the terminator, 24*99e0aae7SDavid Rees which should not be considered to be part of the string. 25*99e0aae7SDavid Rees3. The string is *padded* by a specific byte value, usually 32 (`' '`). In 26*99e0aae7SDavid Rees this case, the "padding" character can usually occur inside the string, 27*99e0aae7SDavid Rees and only trailing padding characters should be trimmed off. 28*99e0aae7SDavid Rees 29*99e0aae7SDavid ReesFor both terminated and padded strings, some formats allow the string to run to 30*99e0aae7SDavid Reesthe very end of its field, with no terminator/padding, and some require the 31*99e0aae7SDavid Reesterminator/padding. In general, it seems that terminated strings are more 32*99e0aae7SDavid Reeslikely to require the terminator, while padded strings can usually be entered 33*99e0aae7SDavid Reeswith no padding. 34*99e0aae7SDavid Rees 35*99e0aae7SDavid ReesThere are, no doubt, other ways of delimiting strings. These seem to be rare 36*99e0aae7SDavid Reesand sui generis, and can often be handled by modeling them as length-determined 37*99e0aae7SDavid Reesstrings, then applying the necessary logic in code. 38*99e0aae7SDavid Rees 39*99e0aae7SDavid ReesThere are also multiple *encodings* for strings, such as ASCII, ISO/IEC 8859-1 40*99e0aae7SDavid Rees("Latin-1"), UTF-8, UTF-16, etc. UTF-16 seems to be rare outside of 41*99e0aae7SDavid ReesWindows-based software and Java. Hardware almost always appears to use ASCII 42*99e0aae7SDavid Rees(encoded as one character per byte, with the high bit always clear), although 43*99e0aae7SDavid ReesJava ME-based systems may use UTF-16. 44*99e0aae7SDavid Rees 45*99e0aae7SDavid Rees 46*99e0aae7SDavid Rees## Proposal 47*99e0aae7SDavid Rees 48*99e0aae7SDavid Rees### Bytestrings Only 49*99e0aae7SDavid Rees 50*99e0aae7SDavid ReesAll strings in Emboss should be considered to be opaque blobs of bytes; 51*99e0aae7SDavid Reesinterpretation as ASCII, Latin-1, UTF-8, etc. should be left to the application. 52*99e0aae7SDavid Rees 53*99e0aae7SDavid ReesUTF-16 strings are explicitly not handled by this proposal. In principle, one 54*99e0aae7SDavid Reescould add a "byte width" parameter to the string types, or use a prefix like `W` 55*99e0aae7SDavid Reesto indicate "wide string" types, but it does not seem important for now. This 56*99e0aae7SDavid Reesdecision can be revisited later. 57*99e0aae7SDavid Rees 58*99e0aae7SDavid Rees 59*99e0aae7SDavid Rees### New Built-In Types 60*99e0aae7SDavid Rees 61*99e0aae7SDavid ReesAdd three new types to the Prelude (names subject to change): 62*99e0aae7SDavid Rees 63*99e0aae7SDavid Rees1. `FixString`, a string whose contents should be the entire field containing 64*99e0aae7SDavid Rees the `FixString`. When writing to a `FixString`, the value must be exactly 65*99e0aae7SDavid Rees the same length as the field. 66*99e0aae7SDavid Rees 67*99e0aae7SDavid Rees `CouldWriteValue()` should return `true` for all strings that are exactly 68*99e0aae7SDavid Rees the correct length. 69*99e0aae7SDavid Rees 70*99e0aae7SDavid Rees `FixString` is very close to a notional `Blob` type or the current 71*99e0aae7SDavid Rees `UInt:8[]` type, except for differences in text format. 72*99e0aae7SDavid Rees 73*99e0aae7SDavid Rees2. `ZString`, a terminated string. A `ZString` with no arguments uses a null 74*99e0aae7SDavid Rees byte (`'\0'`) as the terminator. An optional argument can be used to 75*99e0aae7SDavid Rees specify the terminator -- a `ZString(36)`, for example, would be terminated 76*99e0aae7SDavid Rees by `$`. When reading, the value returned is all bytes up to, but not 77*99e0aae7SDavid Rees including, the first terminator byte. When writing, for compatibility, the 78*99e0aae7SDavid Rees entire field should be written, using the terminator value for padding if 79*99e0aae7SDavid Rees there is extra space. A second optional parameter can be used to specify 80*99e0aae7SDavid Rees that the terminator is not required: `ZString(0, false)` can fill the 81*99e0aae7SDavid Rees underlying field with no terminator. 82*99e0aae7SDavid Rees 83*99e0aae7SDavid Rees `CouldWriteValue()` should return `true` if the value is no longer than the 84*99e0aae7SDavid Rees field and the value does not *contain* any instances of the terminator 85*99e0aae7SDavid Rees byte. 86*99e0aae7SDavid Rees 87*99e0aae7SDavid Rees3. `PaddedString`, a padded string. A `PaddedString` with no arguments uses 88*99e0aae7SDavid Rees space (`' '`, 32) as the padding value. An optional argument can be used to 89*99e0aae7SDavid Rees specify the padding -- a `PaddedString(0)`, for example, would be padded 90*99e0aae7SDavid Rees with null bytes. When reading, the end of the string is discovered by 91*99e0aae7SDavid Rees walking *backwards* from the end until a non-padding byte is found, then 92*99e0aae7SDavid Rees returning all bytes from the start of the string to the end. When writing, 93*99e0aae7SDavid Rees any excess bytes will be filled with the padding value. 94*99e0aae7SDavid Rees 95*99e0aae7SDavid Rees Although, technically, "at least one byte of padding" could be enforced by 96*99e0aae7SDavid Rees making the `PaddedString` one byte shorter and following it with a one-byte 97*99e0aae7SDavid Rees field whose value *must* be the padding byte, for convenience `PaddedString` 98*99e0aae7SDavid Rees should take a second optional parameter to specify that the terminator *is* 99*99e0aae7SDavid Rees required: `PaddedString(32, true)` must have at least one space at the end. 100*99e0aae7SDavid Rees 101*99e0aae7SDavid Rees `CouldWriteValue()` should return `true` if the value is no longer than the 102*99e0aae7SDavid Rees field and the value does not *end with* the padding byte. 103*99e0aae7SDavid Rees 104*99e0aae7SDavid Rees 105*99e0aae7SDavid Rees### String Constants 106*99e0aae7SDavid Rees 107*99e0aae7SDavid ReesString constants (used in constructs such as `[requires: this == "abcd"]`) may 108*99e0aae7SDavid Reestake two forms: 109*99e0aae7SDavid Rees 110*99e0aae7SDavid Rees1. `"A quoted string using C-style escapes like \n"` 111*99e0aae7SDavid Rees 112*99e0aae7SDavid Rees In addition to standard C89 escapes (as interpreted by an ASCII Unix 113*99e0aae7SDavid Rees compiler): 114*99e0aae7SDavid Rees 115*99e0aae7SDavid Rees * `\0` => 0 116*99e0aae7SDavid Rees * `\a` => 7 117*99e0aae7SDavid Rees * `\b` => 8 118*99e0aae7SDavid Rees * `\t` => 9 119*99e0aae7SDavid Rees * `\n` => 10 120*99e0aae7SDavid Rees * `\v` => 11 121*99e0aae7SDavid Rees * `\f` => 12 122*99e0aae7SDavid Rees * `\r` => 13 123*99e0aae7SDavid Rees * `\"` => 34 124*99e0aae7SDavid Rees * `\'` => 39 125*99e0aae7SDavid Rees * `\?` => 63 (part of the C standard, but rarely used) 126*99e0aae7SDavid Rees * `\\` => 92 127*99e0aae7SDavid Rees * <code>\x*hh*</code> => 0x*hh* 128*99e0aae7SDavid Rees 129*99e0aae7SDavid Rees The following non-C-standard escapes should be allowed: 130*99e0aae7SDavid Rees 131*99e0aae7SDavid Rees * `\e` => 27 (not actually standard, but common) 132*99e0aae7SDavid Rees * <code>\d*nnn*</code> => *nnn* 133*99e0aae7SDavid Rees * <code>\x{*hh*}</code> => 0x*hh* 134*99e0aae7SDavid Rees * <code>\d{*nnn*}</code> => *nnn* 135*99e0aae7SDavid Rees 136*99e0aae7SDavid Rees Note that the standard C escape <code>\\*nnn*</code> is explicitly not 137*99e0aae7SDavid Rees supported. C treats *nnn* as octal, which is often surprising, and modern 138*99e0aae7SDavid Rees languages (the cut off date appears to be about 1993 -- right between Python 139*99e0aae7SDavid Rees 2 and Java) have largely dropped support for the octal escapes. 140*99e0aae7SDavid Rees 141*99e0aae7SDavid Rees Based on a brief survey, only `\n`, `\t`, `\"`, `\\`, and `\'` appear to be 142*99e0aae7SDavid Rees (nearly) universal among popular programming languages. <code>\x*hh*</code> 143*99e0aae7SDavid Rees is very common, though not universal. <code>\u*nnnn*</code>, where *nnnn* 144*99e0aae7SDavid Rees is a Unicode hex value to be encoded as UTF-8 or UTF-16, also appears to be 145*99e0aae7SDavid Rees common, but only for text strings. 146*99e0aae7SDavid Rees 147*99e0aae7SDavid Rees To avoid ambiguity, the un-braced <code>\x*hh*</code> escape should be 148*99e0aae7SDavid Rees required to have 2 hex digits, and the <code>\d*nnn*</code> escape should be 149*99e0aae7SDavid Rees required to have exactly 3 decimal digits. The braced versions -- 150*99e0aae7SDavid Rees <code>\x{*hh*}</code> and <code>\d{*nnn*}</code> -- could have any number of 151*99e0aae7SDavid Rees digits, but should be required to evaluate to a value in the range 0 to 255: 152*99e0aae7SDavid Rees that is, `\d{000000100}` should be allowed, but `\d{256}` should not. 153*99e0aae7SDavid Rees 154*99e0aae7SDavid Rees `\` characters should not be allowed outside of the escape sequences 155*99e0aae7SDavid Rees specified here. 156*99e0aae7SDavid Rees 157*99e0aae7SDavid Rees For now, only 7-bit ASCII printable characters (byte values 32 through 126) 158*99e0aae7SDavid Rees should be allowed in `"quoted strings"`, even though `.emb` files generally 159*99e0aae7SDavid Rees allow UTF-8. This requirement may be relaxed in the future. 160*99e0aae7SDavid Rees 161*99e0aae7SDavid Rees2. A list of bytes in `{}`, where each byte is either a single-quoted character 162*99e0aae7SDavid Rees (`'a'`) or a numeric constant (e.g., `0x20` or `32`). 163*99e0aae7SDavid Rees 164*99e0aae7SDavid Rees For ease of transition from existing `UInt:8[]` fields, explicit index 165*99e0aae7SDavid Rees markers (`[8]:`) in the list should be allowed if the index exactly matches 166*99e0aae7SDavid Rees the current cursor index; this matches output from the current Emboss text 167*99e0aae7SDavid Rees format for `UInt:8[]`. 168*99e0aae7SDavid Rees 169*99e0aae7SDavid ReesThe existing parameter system will need to be extended to allow default values, 170*99e0aae7SDavid Reesand to allow `external` types to accept parameters if they do not already. 171*99e0aae7SDavid Rees 172*99e0aae7SDavid Rees 173*99e0aae7SDavid Rees### String Field Methods (C++) 174*99e0aae7SDavid Rees 175*99e0aae7SDavid Rees#### C++ String Type Parameterization 176*99e0aae7SDavid Rees 177*99e0aae7SDavid ReesAll methods that accept or return a string value should be templated on the C++ 178*99e0aae7SDavid Reestype to use (`std::string`, `std::string_view`, `char *`, etc.). 179*99e0aae7SDavid Rees 180*99e0aae7SDavid ReesFor methods that accept a string parameter (`Write`, etc.), the template 181*99e0aae7SDavid Reesargument should be inferred, and they can be called without specifying the type. 182*99e0aae7SDavid Rees 183*99e0aae7SDavid ReesFor methods that only return a string value (`Read`, etc.), the template 184*99e0aae7SDavid Reesargument would need to be specified: `Read<std::string_view>()`. 185*99e0aae7SDavid Rees 186*99e0aae7SDavid Rees`char *` should not be accepted as a return type, due to problems with ensuring 187*99e0aae7SDavid Reesthat there is actually a null byte at the end of the string. 188*99e0aae7SDavid Rees 189*99e0aae7SDavid ReesAs an input type, `char *` is like to need explicit specialization. 190*99e0aae7SDavid Rees 191*99e0aae7SDavid ReesIn many (most? all?) cases, methods should have no problem with some types that 192*99e0aae7SDavid Reesare not really "string" types, such as `std::vector<char>`. 193*99e0aae7SDavid Rees 194*99e0aae7SDavid ReesString types that use `signed char` or `unsigned char` instead of `char` (e.g., 195*99e0aae7SDavid Rees`std::basic_string<unsigned char>`) should be explicitly supported. 196*99e0aae7SDavid Rees 197*99e0aae7SDavid ReesIf the `BackingStorage` is not `ContiguousBuffer` (or some equivalent), it seems 198*99e0aae7SDavid Reesthat it might be easy to hit undefined behavior with something like 199*99e0aae7SDavid Rees`Read<std::string_view>()`, since the iterator type returned by `begin()` and 200*99e0aae7SDavid Rees`end()` would not correctly model `std::contiguous_iterator`. The cautious 201*99e0aae7SDavid Reesapproach would be to disable `Read()` and `UncheckedRead()` if the backing 202*99e0aae7SDavid Reesstorage is not `ContiguousBuffer`; readout to something like `std::string` could 203*99e0aae7SDavid Reesstill be explicitly performed using the `begin()`/`end()` iterators. 204*99e0aae7SDavid ReesAlternately, for non-`ContiguousBuffer` backing storage, `Read()` could be 205*99e0aae7SDavid Reesexplicitly limited to a small set of known-good types, such as `std::string` and 206*99e0aae7SDavid Rees`std::vector<char>`. 207*99e0aae7SDavid Rees 208*99e0aae7SDavid Rees 209*99e0aae7SDavid Rees#### Methods 210*99e0aae7SDavid Rees 211*99e0aae7SDavid Rees`Read()`, `UncheckedRead()`, `Write()`, and `UncheckedWrite()` should be defined 212*99e0aae7SDavid Reesas one would expect. 213*99e0aae7SDavid Rees 214*99e0aae7SDavid Rees`ToString()` should be an alias for `Read()`, to ease conversion from 215*99e0aae7SDavid Rees`UInt:8[]`. 216*99e0aae7SDavid Rees 217*99e0aae7SDavid Rees`CouldWriteValue()` should be defined as specified in the previous section. 218*99e0aae7SDavid Rees 219*99e0aae7SDavid Rees`Ok()` should return `true` if the string has storage (though it could be 220*99e0aae7SDavid Reeszero-length storage) and the bytes match the requirements (e.g., if a terminator 221*99e0aae7SDavid Reesor padding byte is required, `Ok()` should only return `true` if such a byte is 222*99e0aae7SDavid Reespresent). 223*99e0aae7SDavid Rees 224*99e0aae7SDavid Rees`Size()` should return the (logical) length of the string in bytes. 225*99e0aae7SDavid Rees 226*99e0aae7SDavid Rees`MaxSize()` should return `BackingStorage().SizeInBytes()` or 227*99e0aae7SDavid Rees`BackingStorage().SizeInBytes() - 1` if the string requires a padding or 228*99e0aae7SDavid Reesterminator byte. 229*99e0aae7SDavid Rees 230*99e0aae7SDavid Rees`begin()`, `end()`, `rbegin()`, `rend()` should be defined as expected for a 231*99e0aae7SDavid ReesC++ container type. 232*99e0aae7SDavid Rees 233*99e0aae7SDavid Rees`operator[]` should return the value of a single byte at the specified offset. 234*99e0aae7SDavid Rees 235*99e0aae7SDavid Rees 236*99e0aae7SDavid Rees#### `emboss::String` Type 237*99e0aae7SDavid Rees 238*99e0aae7SDavid Rees(This section should not be considered particularly authoritative; the actual 239*99e0aae7SDavid Reesimplementation could differ greatly if another strategy is turns out to be 240*99e0aae7SDavid Reeseasier or less complex in practice.) 241*99e0aae7SDavid Rees 242*99e0aae7SDavid ReesBecause values retrieved from the different string types can be used 243*99e0aae7SDavid Reesinterchangeably at the expression layer (e.g., `let s = condition ? z_string : 244*99e0aae7SDavid Reesfix_string`), there must be a way for all views over strings to return a common 245*99e0aae7SDavid Reestype. This is complicated by two requirements: 246*99e0aae7SDavid Rees 247*99e0aae7SDavid Rees1. `emboss::String` should not allocate memory. 248*99e0aae7SDavid Rees2. `emboss::String` needs to handle backing storage that is not 249*99e0aae7SDavid Rees `ContiguousBuffer`. It also needs to handle constant strings (`let x = 250*99e0aae7SDavid Rees "string"`), and be able to assign `Storage`-based strings to constant 251*99e0aae7SDavid Rees strings and vice versa. 252*99e0aae7SDavid Rees 253*99e0aae7SDavid ReesTo satisfy the first requirement, `emboss::String` will need to hold a reference 254*99e0aae7SDavid Reesto the underlying storage, not actually copy bytes. 255*99e0aae7SDavid Rees 256*99e0aae7SDavid ReesOne way to satisfy the second requirement would be to simply copy the string's 257*99e0aae7SDavid Reesbytes out to a new buffer, but that conflicts with the first requirement. 258*99e0aae7SDavid ReesInstead, it should be a sum type over a `Storage` type parameter and a constant 259*99e0aae7SDavid Reesstring, like: 260*99e0aae7SDavid Rees 261*99e0aae7SDavid Rees```c++ 262*99e0aae7SDavid Reestemplate <typename Storage> 263*99e0aae7SDavid Reesclass String { 264*99e0aae7SDavid Rees public: 265*99e0aae7SDavid Rees String(); 266*99e0aae7SDavid Rees String(const char *data, int size); 267*99e0aae7SDavid Rees String(Storage); 268*99e0aae7SDavid Rees // ... operator= ... 269*99e0aae7SDavid Rees int size() constexpr; 270*99e0aae7SDavid Rees char operator[](int index) constexpr { 271*99e0aae7SDavid Rees return storage_.Index() == 0 ? backports::Get<0>(storage_)[index] 272*99e0aae7SDavid Rees : backports::Get<1>(storage_).data()[index]; 273*99e0aae7SDavid Rees } 274*99e0aae7SDavid Rees // ... begin(), end(), etc. ... 275*99e0aae7SDavid Rees 276*99e0aae7SDavid Rees private: 277*99e0aae7SDavid Rees // TODO: replace backports::Variant with std::variant in 2027, when Emboss 278*99e0aae7SDavid Rees // requires C++17. 279*99e0aae7SDavid Rees backports::Variant<const char *, Storage> storage_; 280*99e0aae7SDavid Rees}; 281*99e0aae7SDavid Rees``` 282*99e0aae7SDavid Rees 283*99e0aae7SDavid ReesAt least for now, `emboss::String` does not need to be exposed as a documented, 284*99e0aae7SDavid Reessupported API -- user code can use `Read<std::string_view>()` and similar 285*99e0aae7SDavid Reesoperations as needed, with full knowledge of the underlying storage type. 286*99e0aae7SDavid Rees 287*99e0aae7SDavid ReesComparisons and assignments between `emboss::String`s with different `Storage` 288*99e0aae7SDavid Reestype parameters do not need to be supported, since they cannot be generated by 289*99e0aae7SDavid Reesthe code generator -- C++ codegen would only need those operations for 290*99e0aae7SDavid Rees`emboss::String`s that are derived from the same parent structure. 291*99e0aae7SDavid Rees 292*99e0aae7SDavid Rees 293*99e0aae7SDavid Rees### Handling in Other Languages 294*99e0aae7SDavid Rees 295*99e0aae7SDavid ReesC++ is unusual in that it does not differentiate at a language level between 296*99e0aae7SDavid Reestext strings and byte strings. Most other languages have different types for 297*99e0aae7SDavid Reesbyte strings and text strings. 298*99e0aae7SDavid Rees 299*99e0aae7SDavid ReesFor all languages that differentiate, Emboss strings should be treated as byte 300*99e0aae7SDavid Reesstrings or byte arrays (Python3 `bytes`, Rust `Vec<u8>`, Proto `bytes`, etc.) 301*99e0aae7SDavid Rees 302*99e0aae7SDavid ReesOther than this caveat, Emboss string support should be straightforward in other 303*99e0aae7SDavid Reeslanguages. 304*99e0aae7SDavid Rees 305*99e0aae7SDavid Rees 306*99e0aae7SDavid Rees### Text Format 307*99e0aae7SDavid Rees 308*99e0aae7SDavid ReesText format output should use the `"quoted string"` style. Byte values outside 309*99e0aae7SDavid Reesthe range 32 through 126 should be emitted as escapes. Values with standard 310*99e0aae7SDavid Reesshorthand escapes (10 => `'\n'`, 0 => `'\0'`, etc.) should be emitted as such. 311*99e0aae7SDavid ReesFor other values, hex escapes with exactly two digits (e.g., `\x06`, not `\x6`) 312*99e0aae7SDavid Reesshould be emitted. It may be desirable to allow some `[text_format]` control 313*99e0aae7SDavid Reesover the output in the future. 314*99e0aae7SDavid Rees 315*99e0aae7SDavid ReesText format input should allow both `"quoted string"` and list-of-bytes styles, 316*99e0aae7SDavid Reeswith exactly the same rules as string constants in an `.emb` file, except that 317*99e0aae7SDavid Reesbytes > 126 might be allowed in a `"quoted string"`. 318*99e0aae7SDavid Rees 319*99e0aae7SDavid Rees 320*99e0aae7SDavid Rees### Expressions 321*99e0aae7SDavid Rees 322*99e0aae7SDavid Rees#### Type System Changes 323*99e0aae7SDavid Rees 324*99e0aae7SDavid ReesIn order to facilitate `[requires]` on string types, the new types should have a 325*99e0aae7SDavid Reesnew 'string' expression type. 326*99e0aae7SDavid Rees 327*99e0aae7SDavid Rees 328*99e0aae7SDavid Rees#### Runtime Representation 329*99e0aae7SDavid Rees 330*99e0aae7SDavid ReesIn this proposal, no string manipulation are allowed, so temporary strings 331*99e0aae7SDavid Rees(which might require memory allocation) will not be necessary. 332*99e0aae7SDavid Rees 333*99e0aae7SDavid Rees 334*99e0aae7SDavid Rees#### String Attribute Representation 335*99e0aae7SDavid Rees 336*99e0aae7SDavid ReesAttributes values are currently represented by a special `AttributeValue` type 337*99e0aae7SDavid Reeswhich can hold either an `Expression` or a `String`. With a string expression 338*99e0aae7SDavid Reestype, `AttributeValue` can be replaced by a plain `Expression`. This will 339*99e0aae7SDavid Reesrequire changes to everything that touches `AttributeValue`. 340*99e0aae7SDavid Rees 341*99e0aae7SDavid ReesAlternately, `AttributeValue` could be left in the IR with only `Expression`, 342*99e0aae7SDavid Reesin which case only code that touches string attributes (`[byte_order]` and 343*99e0aae7SDavid Rees`[(cpp) namespace]`) needs to change. 344*99e0aae7SDavid Rees 345*99e0aae7SDavid Rees 346*99e0aae7SDavid Rees#### String Comparisons 347*99e0aae7SDavid Rees 348*99e0aae7SDavid ReesComparison operations (`==`, `<`, `>`, `>=`, `<=`, `!=`) should be allowed, 349*99e0aae7SDavid Reessince these can be handled by passing references to existing memory. 350*99e0aae7SDavid Rees 351*99e0aae7SDavid ReesEquality and inequality (`==` and `!=`) should be defined in the expected way: 352*99e0aae7SDavid Reestwo strings are equal iff they are the same length and the corresponding bytes 353*99e0aae7SDavid Reesin each string have the same value, and they are unequal if they are not equal. 354*99e0aae7SDavid Rees 355*99e0aae7SDavid ReesFor ordering, strings should be compared lexically, using the binary value of 356*99e0aae7SDavid Reeseach byte, with no regard for semantic collation. That is, `"Z" < "a"`, since 357*99e0aae7SDavid Rees`'Z'` is 90 and `'a'` is 97. 358*99e0aae7SDavid Rees 359*99e0aae7SDavid ReesWhen one string is a strict prefix of another string, the shorter string should 360*99e0aae7SDavid Reesbe "less than" the longer; e.g., `"abc" < "abcdef"`. This is the same as the 361*99e0aae7SDavid Reesnatural ordering for zero-terminated strings. 362*99e0aae7SDavid Rees 363*99e0aae7SDavid Rees 364*99e0aae7SDavid Rees#### Future String Operations 365*99e0aae7SDavid Rees 366*99e0aae7SDavid ReesIt may be desirable, at some future point, to allow various string 367*99e0aae7SDavid Reesmanipulations, such as concatenation or repetition, at least for compile-time 368*99e0aae7SDavid Reesstrings. 369*99e0aae7SDavid Rees 370*99e0aae7SDavid ReesA substring operation should be possible without requiring memory allocation. 371*99e0aae7SDavid Rees 372*99e0aae7SDavid ReesIndexing into a string (`str[offset]`) should be allowed if/when indexing into 373*99e0aae7SDavid Reesan array is finally supported. 374*99e0aae7SDavid Rees 375*99e0aae7SDavid Rees 376*99e0aae7SDavid Rees### Arrays of Strings 377*99e0aae7SDavid Rees 378*99e0aae7SDavid ReesIn some cases, it may be desirable to have an array of strings, like: 379*99e0aae7SDavid Rees 380*99e0aae7SDavid Rees``` 381*99e0aae7SDavid Reesstruct Foo: 382*99e0aae7SDavid Rees 0 [+100] ZString[10] list 383*99e0aae7SDavid Rees``` 384*99e0aae7SDavid Rees 385*99e0aae7SDavid ReesAlthough somewhat awkward, the existing explicit-length syntax should work: 386*99e0aae7SDavid Rees 387*99e0aae7SDavid Rees``` 388*99e0aae7SDavid Reesstruct Foo: 389*99e0aae7SDavid Rees 0 [+100] ZString:80[10] list # 10 10-byte (80-bit) strings 390*99e0aae7SDavid Rees``` 391