xref: /aosp_15_r20/external/emboss/doc/design_docs/strings.md (revision 99e0aae7469b87d12f0ad23e61142c2d74c1ef70)
1*99e0aae7SDavid Rees# String Support for Emboss
2*99e0aae7SDavid Rees
3*99e0aae7SDavid ReesGitHub Issue [#28](https://github.com/google/emboss/issues/28)
4*99e0aae7SDavid Rees
5*99e0aae7SDavid Rees## Background
6*99e0aae7SDavid Rees
7*99e0aae7SDavid ReesIt is somewhat common to embed short strings into binary structures; examples
8*99e0aae7SDavid Reesinclude serial numbers and firmware revisions, although in some cases even
9*99e0aae7SDavid Reesthings like IP addresses are encoded as ASCII text embedded in a larger binary
10*99e0aae7SDavid Reesmessage.
11*99e0aae7SDavid Rees
12*99e0aae7SDavid ReesHistorically, we have modeled such fields in Emboss by using `UInt:8[]`; that
13*99e0aae7SDavid Reesis, arrays of 8-bit uints.  This is more-or-less functional, but can be awkward
14*99e0aae7SDavid Reesfor things like text format output, and provides no way to add assertions to
15*99e0aae7SDavid Reesstring fields.
16*99e0aae7SDavid Rees
17*99e0aae7SDavid ReesString support is complicated by the fact that there are several common ways of
18*99e0aae7SDavid Reesdelimiting strings:
19*99e0aae7SDavid Rees
20*99e0aae7SDavid Rees1.  Length determined by another field -- that is, the size of the string is
21*99e0aae7SDavid Rees    explicit.
22*99e0aae7SDavid Rees2.  The string is *terminated* by a specific byte value, usually `'\0'`.  In
23*99e0aae7SDavid Rees    this case, there may be additional "garbage" bytes after the terminator,
24*99e0aae7SDavid Rees    which should not be considered to be part of the string.
25*99e0aae7SDavid Rees3.  The string is *padded* by a specific byte value, usually 32 (`' '`).  In
26*99e0aae7SDavid Rees    this case, the "padding" character can usually occur inside the string,
27*99e0aae7SDavid Rees    and only trailing padding characters should be trimmed off.
28*99e0aae7SDavid Rees
29*99e0aae7SDavid ReesFor both terminated and padded strings, some formats allow the string to run to
30*99e0aae7SDavid Reesthe very end of its field, with no terminator/padding, and some require the
31*99e0aae7SDavid Reesterminator/padding.  In general, it seems that terminated strings are more
32*99e0aae7SDavid Reeslikely to require the terminator, while padded strings can usually be entered
33*99e0aae7SDavid Reeswith no padding.
34*99e0aae7SDavid Rees
35*99e0aae7SDavid ReesThere are, no doubt, other ways of delimiting strings.  These seem to be rare
36*99e0aae7SDavid Reesand sui generis, and can often be handled by modeling them as length-determined
37*99e0aae7SDavid Reesstrings, then applying the necessary logic in code.
38*99e0aae7SDavid Rees
39*99e0aae7SDavid ReesThere are also multiple *encodings* for strings, such as ASCII, ISO/IEC 8859-1
40*99e0aae7SDavid Rees("Latin-1"), UTF-8, UTF-16, etc.  UTF-16 seems to be rare outside of
41*99e0aae7SDavid ReesWindows-based software and Java.  Hardware almost always appears to use ASCII
42*99e0aae7SDavid Rees(encoded as one character per byte, with the high bit always clear), although
43*99e0aae7SDavid ReesJava ME-based systems may use UTF-16.
44*99e0aae7SDavid Rees
45*99e0aae7SDavid Rees
46*99e0aae7SDavid Rees## Proposal
47*99e0aae7SDavid Rees
48*99e0aae7SDavid Rees### Bytestrings Only
49*99e0aae7SDavid Rees
50*99e0aae7SDavid ReesAll strings in Emboss should be considered to be opaque blobs of bytes;
51*99e0aae7SDavid Reesinterpretation as ASCII, Latin-1, UTF-8, etc. should be left to the application.
52*99e0aae7SDavid Rees
53*99e0aae7SDavid ReesUTF-16 strings are explicitly not handled by this proposal.  In principle, one
54*99e0aae7SDavid Reescould add a "byte width" parameter to the string types, or use a prefix like `W`
55*99e0aae7SDavid Reesto indicate "wide string" types, but it does not seem important for now.  This
56*99e0aae7SDavid Reesdecision can be revisited later.
57*99e0aae7SDavid Rees
58*99e0aae7SDavid Rees
59*99e0aae7SDavid Rees### New Built-In Types
60*99e0aae7SDavid Rees
61*99e0aae7SDavid ReesAdd three new types to the Prelude (names subject to change):
62*99e0aae7SDavid Rees
63*99e0aae7SDavid Rees1.  `FixString`, a string whose contents should be the entire field containing
64*99e0aae7SDavid Rees    the `FixString`.  When writing to a `FixString`, the value must be exactly
65*99e0aae7SDavid Rees    the same length as the field.
66*99e0aae7SDavid Rees
67*99e0aae7SDavid Rees    `CouldWriteValue()` should return `true` for all strings that are exactly
68*99e0aae7SDavid Rees    the correct length.
69*99e0aae7SDavid Rees
70*99e0aae7SDavid Rees    `FixString` is very close to a notional `Blob` type or the current
71*99e0aae7SDavid Rees    `UInt:8[]` type, except for differences in text format.
72*99e0aae7SDavid Rees
73*99e0aae7SDavid Rees2.  `ZString`, a terminated string.  A `ZString` with no arguments uses a null
74*99e0aae7SDavid Rees    byte (`'\0'`) as the terminator.  An optional argument can be used to
75*99e0aae7SDavid Rees    specify the terminator -- a `ZString(36)`, for example, would be terminated
76*99e0aae7SDavid Rees    by `$`.  When reading, the value returned is all bytes up to, but not
77*99e0aae7SDavid Rees    including, the first terminator byte.  When writing, for compatibility, the
78*99e0aae7SDavid Rees    entire field should be written, using the terminator value for padding if
79*99e0aae7SDavid Rees    there is extra space.  A second optional parameter can be used to specify
80*99e0aae7SDavid Rees    that the terminator is not required: `ZString(0, false)` can fill the
81*99e0aae7SDavid Rees    underlying field with no terminator.
82*99e0aae7SDavid Rees
83*99e0aae7SDavid Rees    `CouldWriteValue()` should return `true` if the value is no longer than the
84*99e0aae7SDavid Rees    field and the value does not *contain* any instances of the terminator
85*99e0aae7SDavid Rees    byte.
86*99e0aae7SDavid Rees
87*99e0aae7SDavid Rees3.  `PaddedString`, a padded string.  A `PaddedString` with no arguments uses
88*99e0aae7SDavid Rees    space (`' '`, 32) as the padding value.  An optional argument can be used to
89*99e0aae7SDavid Rees    specify the padding -- a `PaddedString(0)`, for example, would be padded
90*99e0aae7SDavid Rees    with null bytes.  When reading, the end of the string is discovered by
91*99e0aae7SDavid Rees    walking *backwards* from the end until a non-padding byte is found, then
92*99e0aae7SDavid Rees    returning all bytes from the start of the string to the end.  When writing,
93*99e0aae7SDavid Rees    any excess bytes will be filled with the padding value.
94*99e0aae7SDavid Rees
95*99e0aae7SDavid Rees    Although, technically, "at least one byte of padding" could be enforced by
96*99e0aae7SDavid Rees    making the `PaddedString` one byte shorter and following it with a one-byte
97*99e0aae7SDavid Rees    field whose value *must* be the padding byte, for convenience `PaddedString`
98*99e0aae7SDavid Rees    should take a second optional parameter to specify that the terminator *is*
99*99e0aae7SDavid Rees    required: `PaddedString(32, true)` must have at least one space at the end.
100*99e0aae7SDavid Rees
101*99e0aae7SDavid Rees    `CouldWriteValue()` should return `true` if the value is no longer than the
102*99e0aae7SDavid Rees    field and the value does not *end with* the padding byte.
103*99e0aae7SDavid Rees
104*99e0aae7SDavid Rees
105*99e0aae7SDavid Rees### String Constants
106*99e0aae7SDavid Rees
107*99e0aae7SDavid ReesString constants (used in constructs such as `[requires: this == "abcd"]`) may
108*99e0aae7SDavid Reestake two forms:
109*99e0aae7SDavid Rees
110*99e0aae7SDavid Rees1.  `"A quoted string using C-style escapes like \n"`
111*99e0aae7SDavid Rees
112*99e0aae7SDavid Rees    In addition to standard C89 escapes (as interpreted by an ASCII Unix
113*99e0aae7SDavid Rees    compiler):
114*99e0aae7SDavid Rees
115*99e0aae7SDavid Rees    *   `\0` => 0
116*99e0aae7SDavid Rees    *   `\a` => 7
117*99e0aae7SDavid Rees    *   `\b` => 8
118*99e0aae7SDavid Rees    *   `\t` => 9
119*99e0aae7SDavid Rees    *   `\n` => 10
120*99e0aae7SDavid Rees    *   `\v` => 11
121*99e0aae7SDavid Rees    *   `\f` => 12
122*99e0aae7SDavid Rees    *   `\r` => 13
123*99e0aae7SDavid Rees    *   `\"` => 34
124*99e0aae7SDavid Rees    *   `\'` => 39
125*99e0aae7SDavid Rees    *   `\?` => 63 (part of the C standard, but rarely used)
126*99e0aae7SDavid Rees    *   `\\` => 92
127*99e0aae7SDavid Rees    *   <code>\x*hh*</code> => 0x*hh*
128*99e0aae7SDavid Rees
129*99e0aae7SDavid Rees    The following non-C-standard escapes should be allowed:
130*99e0aae7SDavid Rees
131*99e0aae7SDavid Rees    *   `\e` => 27 (not actually standard, but common)
132*99e0aae7SDavid Rees    *   <code>\d*nnn*</code> => *nnn*
133*99e0aae7SDavid Rees    *   <code>\x{*hh*}</code> => 0x*hh*
134*99e0aae7SDavid Rees    *   <code>\d{*nnn*}</code> => *nnn*
135*99e0aae7SDavid Rees
136*99e0aae7SDavid Rees    Note that the standard C escape <code>\\*nnn*</code> is explicitly not
137*99e0aae7SDavid Rees    supported.  C treats *nnn* as octal, which is often surprising, and modern
138*99e0aae7SDavid Rees    languages (the cut off date appears to be about 1993 -- right between Python
139*99e0aae7SDavid Rees    2 and Java) have largely dropped support for the octal escapes.
140*99e0aae7SDavid Rees
141*99e0aae7SDavid Rees    Based on a brief survey, only `\n`, `\t`, `\"`, `\\`, and `\'` appear to be
142*99e0aae7SDavid Rees    (nearly) universal among popular programming languages.  <code>\x*hh*</code>
143*99e0aae7SDavid Rees    is very common, though not universal.  <code>\u*nnnn*</code>, where *nnnn*
144*99e0aae7SDavid Rees    is a Unicode hex value to be encoded as UTF-8 or UTF-16, also appears to be
145*99e0aae7SDavid Rees    common, but only for text strings.
146*99e0aae7SDavid Rees
147*99e0aae7SDavid Rees    To avoid ambiguity, the un-braced <code>\x*hh*</code> escape should be
148*99e0aae7SDavid Rees    required to have 2 hex digits, and the <code>\d*nnn*</code> escape should be
149*99e0aae7SDavid Rees    required to have exactly 3 decimal digits.  The braced versions --
150*99e0aae7SDavid Rees    <code>\x{*hh*}</code> and <code>\d{*nnn*}</code> -- could have any number of
151*99e0aae7SDavid Rees    digits, but should be required to evaluate to a value in the range 0 to 255:
152*99e0aae7SDavid Rees    that is, `\d{000000100}` should be allowed, but `\d{256}` should not.
153*99e0aae7SDavid Rees
154*99e0aae7SDavid Rees    `\` characters should not be allowed outside of the escape sequences
155*99e0aae7SDavid Rees    specified here.
156*99e0aae7SDavid Rees
157*99e0aae7SDavid Rees    For now, only 7-bit ASCII printable characters (byte values 32 through 126)
158*99e0aae7SDavid Rees    should be allowed in `"quoted strings"`, even though `.emb` files generally
159*99e0aae7SDavid Rees    allow UTF-8.  This requirement may be relaxed in the future.
160*99e0aae7SDavid Rees
161*99e0aae7SDavid Rees2.  A list of bytes in `{}`, where each byte is either a single-quoted character
162*99e0aae7SDavid Rees    (`'a'`) or a numeric constant (e.g., `0x20` or `32`).
163*99e0aae7SDavid Rees
164*99e0aae7SDavid Rees    For ease of transition from existing `UInt:8[]` fields, explicit index
165*99e0aae7SDavid Rees    markers (`[8]:`) in the list should be allowed if the index exactly matches
166*99e0aae7SDavid Rees    the current cursor index; this matches output from the current Emboss text
167*99e0aae7SDavid Rees    format for `UInt:8[]`.
168*99e0aae7SDavid Rees
169*99e0aae7SDavid ReesThe existing parameter system will need to be extended to allow default values,
170*99e0aae7SDavid Reesand to allow `external` types to accept parameters if they do not already.
171*99e0aae7SDavid Rees
172*99e0aae7SDavid Rees
173*99e0aae7SDavid Rees### String Field Methods (C++)
174*99e0aae7SDavid Rees
175*99e0aae7SDavid Rees#### C++ String Type Parameterization
176*99e0aae7SDavid Rees
177*99e0aae7SDavid ReesAll methods that accept or return a string value should be templated on the C++
178*99e0aae7SDavid Reestype to use (`std::string`, `std::string_view`, `char *`, etc.).
179*99e0aae7SDavid Rees
180*99e0aae7SDavid ReesFor methods that accept a string parameter (`Write`, etc.), the template
181*99e0aae7SDavid Reesargument should be inferred, and they can be called without specifying the type.
182*99e0aae7SDavid Rees
183*99e0aae7SDavid ReesFor methods that only return a string value (`Read`, etc.), the template
184*99e0aae7SDavid Reesargument would need to be specified: `Read<std::string_view>()`.
185*99e0aae7SDavid Rees
186*99e0aae7SDavid Rees`char *` should not be accepted as a return type, due to problems with ensuring
187*99e0aae7SDavid Reesthat there is actually a null byte at the end of the string.
188*99e0aae7SDavid Rees
189*99e0aae7SDavid ReesAs an input type, `char *` is like to need explicit specialization.
190*99e0aae7SDavid Rees
191*99e0aae7SDavid ReesIn many (most? all?) cases, methods should have no problem with some types that
192*99e0aae7SDavid Reesare not really "string" types, such as `std::vector<char>`.
193*99e0aae7SDavid Rees
194*99e0aae7SDavid ReesString types that use `signed char` or `unsigned char` instead of `char` (e.g.,
195*99e0aae7SDavid Rees`std::basic_string<unsigned char>`) should be explicitly supported.
196*99e0aae7SDavid Rees
197*99e0aae7SDavid ReesIf the `BackingStorage` is not `ContiguousBuffer` (or some equivalent), it seems
198*99e0aae7SDavid Reesthat it might be easy to hit undefined behavior with something like
199*99e0aae7SDavid Rees`Read<std::string_view>()`, since the iterator type returned by `begin()` and
200*99e0aae7SDavid Rees`end()` would not correctly model `std::contiguous_iterator`.  The cautious
201*99e0aae7SDavid Reesapproach would be to disable `Read()` and `UncheckedRead()` if the backing
202*99e0aae7SDavid Reesstorage is not `ContiguousBuffer`; readout to something like `std::string` could
203*99e0aae7SDavid Reesstill be explicitly performed using the `begin()`/`end()` iterators.
204*99e0aae7SDavid ReesAlternately, for non-`ContiguousBuffer` backing storage, `Read()` could be
205*99e0aae7SDavid Reesexplicitly limited to a small set of known-good types, such as `std::string` and
206*99e0aae7SDavid Rees`std::vector<char>`.
207*99e0aae7SDavid Rees
208*99e0aae7SDavid Rees
209*99e0aae7SDavid Rees#### Methods
210*99e0aae7SDavid Rees
211*99e0aae7SDavid Rees`Read()`, `UncheckedRead()`, `Write()`, and `UncheckedWrite()` should be defined
212*99e0aae7SDavid Reesas one would expect.
213*99e0aae7SDavid Rees
214*99e0aae7SDavid Rees`ToString()` should be an alias for `Read()`, to ease conversion from
215*99e0aae7SDavid Rees`UInt:8[]`.
216*99e0aae7SDavid Rees
217*99e0aae7SDavid Rees`CouldWriteValue()` should be defined as specified in the previous section.
218*99e0aae7SDavid Rees
219*99e0aae7SDavid Rees`Ok()` should return `true` if the string has storage (though it could be
220*99e0aae7SDavid Reeszero-length storage) and the bytes match the requirements (e.g., if a terminator
221*99e0aae7SDavid Reesor padding byte is required, `Ok()` should only return `true` if such a byte is
222*99e0aae7SDavid Reespresent).
223*99e0aae7SDavid Rees
224*99e0aae7SDavid Rees`Size()` should return the (logical) length of the string in bytes.
225*99e0aae7SDavid Rees
226*99e0aae7SDavid Rees`MaxSize()` should return `BackingStorage().SizeInBytes()` or
227*99e0aae7SDavid Rees`BackingStorage().SizeInBytes() - 1` if the string requires a padding or
228*99e0aae7SDavid Reesterminator byte.
229*99e0aae7SDavid Rees
230*99e0aae7SDavid Rees`begin()`, `end()`, `rbegin()`, `rend()` should be defined as expected for a
231*99e0aae7SDavid ReesC++ container type.
232*99e0aae7SDavid Rees
233*99e0aae7SDavid Rees`operator[]` should return the value of a single byte at the specified offset.
234*99e0aae7SDavid Rees
235*99e0aae7SDavid Rees
236*99e0aae7SDavid Rees#### `emboss::String` Type
237*99e0aae7SDavid Rees
238*99e0aae7SDavid Rees(This section should not be considered particularly authoritative; the actual
239*99e0aae7SDavid Reesimplementation could differ greatly if another strategy is turns out to be
240*99e0aae7SDavid Reeseasier or less complex in practice.)
241*99e0aae7SDavid Rees
242*99e0aae7SDavid ReesBecause values retrieved from the different string types can be used
243*99e0aae7SDavid Reesinterchangeably at the expression layer (e.g., `let s = condition ? z_string :
244*99e0aae7SDavid Reesfix_string`), there must be a way for all views over strings to return a common
245*99e0aae7SDavid Reestype.  This is complicated by two requirements:
246*99e0aae7SDavid Rees
247*99e0aae7SDavid Rees1.  `emboss::String` should not allocate memory.
248*99e0aae7SDavid Rees2.  `emboss::String` needs to handle backing storage that is not
249*99e0aae7SDavid Rees    `ContiguousBuffer`.  It also needs to handle constant strings (`let x =
250*99e0aae7SDavid Rees    "string"`), and be able to assign `Storage`-based strings to constant
251*99e0aae7SDavid Rees    strings and vice versa.
252*99e0aae7SDavid Rees
253*99e0aae7SDavid ReesTo satisfy the first requirement, `emboss::String` will need to hold a reference
254*99e0aae7SDavid Reesto the underlying storage, not actually copy bytes.
255*99e0aae7SDavid Rees
256*99e0aae7SDavid ReesOne way to satisfy the second requirement would be to simply copy the string's
257*99e0aae7SDavid Reesbytes out to a new buffer, but that conflicts with the first requirement.
258*99e0aae7SDavid ReesInstead, it should be a sum type over a `Storage` type parameter and a constant
259*99e0aae7SDavid Reesstring, like:
260*99e0aae7SDavid Rees
261*99e0aae7SDavid Rees```c++
262*99e0aae7SDavid Reestemplate <typename Storage>
263*99e0aae7SDavid Reesclass String {
264*99e0aae7SDavid Rees public:
265*99e0aae7SDavid Rees  String();
266*99e0aae7SDavid Rees  String(const char *data, int size);
267*99e0aae7SDavid Rees  String(Storage);
268*99e0aae7SDavid Rees  // ... operator= ...
269*99e0aae7SDavid Rees  int size() constexpr;
270*99e0aae7SDavid Rees  char operator[](int index) constexpr {
271*99e0aae7SDavid Rees    return storage_.Index() == 0 ? backports::Get<0>(storage_)[index]
272*99e0aae7SDavid Rees                                 : backports::Get<1>(storage_).data()[index];
273*99e0aae7SDavid Rees  }
274*99e0aae7SDavid Rees  // ... begin(), end(), etc. ...
275*99e0aae7SDavid Rees
276*99e0aae7SDavid Rees private:
277*99e0aae7SDavid Rees  // TODO: replace backports::Variant with std::variant in 2027, when Emboss
278*99e0aae7SDavid Rees  // requires C++17.
279*99e0aae7SDavid Rees  backports::Variant<const char *, Storage> storage_;
280*99e0aae7SDavid Rees};
281*99e0aae7SDavid Rees```
282*99e0aae7SDavid Rees
283*99e0aae7SDavid ReesAt least for now, `emboss::String` does not need to be exposed as a documented,
284*99e0aae7SDavid Reessupported API -- user code can use `Read<std::string_view>()` and similar
285*99e0aae7SDavid Reesoperations as needed, with full knowledge of the underlying storage type.
286*99e0aae7SDavid Rees
287*99e0aae7SDavid ReesComparisons and assignments between `emboss::String`s with different `Storage`
288*99e0aae7SDavid Reestype parameters do not need to be supported, since they cannot be generated by
289*99e0aae7SDavid Reesthe code generator -- C++ codegen would only need those operations for
290*99e0aae7SDavid Rees`emboss::String`s that are derived from the same parent structure.
291*99e0aae7SDavid Rees
292*99e0aae7SDavid Rees
293*99e0aae7SDavid Rees### Handling in Other Languages
294*99e0aae7SDavid Rees
295*99e0aae7SDavid ReesC++ is unusual in that it does not differentiate at a language level between
296*99e0aae7SDavid Reestext strings and byte strings.  Most other languages have different types for
297*99e0aae7SDavid Reesbyte strings and text strings.
298*99e0aae7SDavid Rees
299*99e0aae7SDavid ReesFor all languages that differentiate, Emboss strings should be treated as byte
300*99e0aae7SDavid Reesstrings or byte arrays (Python3 `bytes`, Rust `Vec<u8>`, Proto `bytes`, etc.)
301*99e0aae7SDavid Rees
302*99e0aae7SDavid ReesOther than this caveat, Emboss string support should be straightforward in other
303*99e0aae7SDavid Reeslanguages.
304*99e0aae7SDavid Rees
305*99e0aae7SDavid Rees
306*99e0aae7SDavid Rees### Text Format
307*99e0aae7SDavid Rees
308*99e0aae7SDavid ReesText format output should use the `"quoted string"` style.  Byte values outside
309*99e0aae7SDavid Reesthe range 32 through 126 should be emitted as escapes.  Values with standard
310*99e0aae7SDavid Reesshorthand escapes (10 => `'\n'`, 0 => `'\0'`, etc.) should be emitted as such.
311*99e0aae7SDavid ReesFor other values, hex escapes with exactly two digits (e.g., `\x06`, not `\x6`)
312*99e0aae7SDavid Reesshould be emitted.  It may be desirable to allow some `[text_format]` control
313*99e0aae7SDavid Reesover the output in the future.
314*99e0aae7SDavid Rees
315*99e0aae7SDavid ReesText format input should allow both `"quoted string"` and list-of-bytes styles,
316*99e0aae7SDavid Reeswith exactly the same rules as string constants in an `.emb` file, except that
317*99e0aae7SDavid Reesbytes > 126 might be allowed in a `"quoted string"`.
318*99e0aae7SDavid Rees
319*99e0aae7SDavid Rees
320*99e0aae7SDavid Rees### Expressions
321*99e0aae7SDavid Rees
322*99e0aae7SDavid Rees#### Type System Changes
323*99e0aae7SDavid Rees
324*99e0aae7SDavid ReesIn order to facilitate `[requires]` on string types, the new types should have a
325*99e0aae7SDavid Reesnew 'string' expression type.
326*99e0aae7SDavid Rees
327*99e0aae7SDavid Rees
328*99e0aae7SDavid Rees#### Runtime Representation
329*99e0aae7SDavid Rees
330*99e0aae7SDavid ReesIn this proposal, no string manipulation are allowed, so temporary strings
331*99e0aae7SDavid Rees(which might require memory allocation) will not be necessary.
332*99e0aae7SDavid Rees
333*99e0aae7SDavid Rees
334*99e0aae7SDavid Rees#### String Attribute Representation
335*99e0aae7SDavid Rees
336*99e0aae7SDavid ReesAttributes values are currently represented by a special `AttributeValue` type
337*99e0aae7SDavid Reeswhich can hold either an `Expression` or a `String`.  With a string expression
338*99e0aae7SDavid Reestype, `AttributeValue` can be replaced by a plain `Expression`.  This will
339*99e0aae7SDavid Reesrequire changes to everything that touches `AttributeValue`.
340*99e0aae7SDavid Rees
341*99e0aae7SDavid ReesAlternately, `AttributeValue` could be left in the IR with only `Expression`,
342*99e0aae7SDavid Reesin which case only code that touches string attributes (`[byte_order]` and
343*99e0aae7SDavid Rees`[(cpp) namespace]`) needs to change.
344*99e0aae7SDavid Rees
345*99e0aae7SDavid Rees
346*99e0aae7SDavid Rees#### String Comparisons
347*99e0aae7SDavid Rees
348*99e0aae7SDavid ReesComparison operations (`==`, `<`, `>`, `>=`, `<=`, `!=`) should be allowed,
349*99e0aae7SDavid Reessince these can be handled by passing references to existing memory.
350*99e0aae7SDavid Rees
351*99e0aae7SDavid ReesEquality and inequality (`==` and `!=`) should be defined in the expected way:
352*99e0aae7SDavid Reestwo strings are equal iff they are the same length and the corresponding bytes
353*99e0aae7SDavid Reesin each string have the same value, and they are unequal if they are not equal.
354*99e0aae7SDavid Rees
355*99e0aae7SDavid ReesFor ordering, strings should be compared lexically, using the binary value of
356*99e0aae7SDavid Reeseach byte, with no regard for semantic collation.  That is, `"Z" < "a"`, since
357*99e0aae7SDavid Rees`'Z'` is 90 and `'a'` is 97.
358*99e0aae7SDavid Rees
359*99e0aae7SDavid ReesWhen one string is a strict prefix of another string, the shorter string should
360*99e0aae7SDavid Reesbe "less than" the longer; e.g., `"abc" < "abcdef"`.  This is the same as the
361*99e0aae7SDavid Reesnatural ordering for zero-terminated strings.
362*99e0aae7SDavid Rees
363*99e0aae7SDavid Rees
364*99e0aae7SDavid Rees#### Future String Operations
365*99e0aae7SDavid Rees
366*99e0aae7SDavid ReesIt may be desirable, at some future point, to allow various string
367*99e0aae7SDavid Reesmanipulations, such as concatenation or repetition, at least for compile-time
368*99e0aae7SDavid Reesstrings.
369*99e0aae7SDavid Rees
370*99e0aae7SDavid ReesA substring operation should be possible without requiring memory allocation.
371*99e0aae7SDavid Rees
372*99e0aae7SDavid ReesIndexing into a string (`str[offset]`) should be allowed if/when indexing into
373*99e0aae7SDavid Reesan array is finally supported.
374*99e0aae7SDavid Rees
375*99e0aae7SDavid Rees
376*99e0aae7SDavid Rees### Arrays of Strings
377*99e0aae7SDavid Rees
378*99e0aae7SDavid ReesIn some cases, it may be desirable to have an array of strings, like:
379*99e0aae7SDavid Rees
380*99e0aae7SDavid Rees```
381*99e0aae7SDavid Reesstruct Foo:
382*99e0aae7SDavid Rees  0 [+100]  ZString[10]  list
383*99e0aae7SDavid Rees```
384*99e0aae7SDavid Rees
385*99e0aae7SDavid ReesAlthough somewhat awkward, the existing explicit-length syntax should work:
386*99e0aae7SDavid Rees
387*99e0aae7SDavid Rees```
388*99e0aae7SDavid Reesstruct Foo:
389*99e0aae7SDavid Rees  0 [+100]  ZString:80[10]  list  # 10 10-byte (80-bit) strings
390*99e0aae7SDavid Rees```
391