xref: /aosp_15_r20/external/emboss/doc/design_docs/division_and_modulus.md (revision 99e0aae7469b87d12f0ad23e61142c2d74c1ef70)
1# Design Sketch: Integer Division and Modulus Operators
2
3## Overview
4
5Currently, Emboss does not support division, which limits what it can express
6in its expression language.  In particular, it is awkward to handle things like
7arrays of 16-bit values whose size is specified in bytes.
8
9The reason Emboss does not yet have support for division is that division is
10tricky to handle correctly.  This document attempts to explain the pitfalls and
11the proper solutions.
12
13Note that Emboss does not support non-integer arithmetic, so this document only
14talks about *integer* division.
15
16
17## Syntax
18
19### Symbol
20
21Many programming languages use `/` for all division operations, whether they
22are "integer" division, floating point division, rational arithmetic division,
23matrix division, or anything else.
24
25Very, very few languages (Haskell) use an operator other than `%` for
26modulus/remainder, and no popular language overloads `%` with a second meaning.
27
28(Note that Haskell defines `%` as *rational number division*.  `mod` and `rem`
29are used for modulus and remainder, respectively, and `div` and `quot` are used
30for truncating and flooring division.)
31
32`%` is technically modulus in some languages, and remainder in others, but in
33all cases the equality `x == (x / n * n) + x % n` holds, and does not seem to
34confuse programmers.
35
36
37#### C, C++, C#, Java, Go, Python 2, Rust, SQL, ...
38
39All of these languages use `/` for multiple kinds of division, including
40integer division.
41
42Python 2 deprecated `/` for integer division; in a Python 2 program with `from
43__future__ import division`, `/` is floating-point division.
44
45C++ and Python both allow `/` to be overloaded for user-defined types.
46
47
48#### JavaScript, TypeScript, Perl 5
49
50These languages use `/` for floating-point division, and do not directly
51support integer division; both operands and result must be cast to int:
52
53    ((l | 0) / (r | 0)) | 0  // JS, TS
54
55    int(int($l) / int($r))    # Perl
56
57
58#### OCaml, Haskell, Python 3
59
60These languages use separate operators for integer and non-integer division.
61
62Integer division:
63
64    l `div` r    -- Haskell
65    l / r        (* OCaml *)
66    l // r        # Python 3
67
68Non-integer division:
69
70    l / r        -- Haskell
71    l /. r       (* OCaml *)
72    l / r         # Python 3
73
74
75#### Emboss
76
77Emboss should use `//` for division, following Python 3's example.
78
79This is the only unambiguous integer division operator used by a TIOBE top-20
80language, *and* Python 3's `//` has the same division semantics (flooring
81division) as Emboss -- most other programming languages use truncating
82(round-toward-0) division.
83
84If Emboss ever gets support for floating-point arithmetic, it should probably
85use OCaml's `/.` operator for division on floats.
86
87Because `/` does different things in different popular languages (flooring
88integer division, truncating integer division, floating-point division, or
89multiple types, depending on operand types), Emboss should not use it.
90
91Emboss should use `%` for modulus, following the example of almost every other
92language.
93
94
95### Precedence
96
97In most programming languages, division is a left-associative operator with
98equal precedence to multiplication; i.e. this:
99
100    a * b / c * d
101
102is equivalent to this:
103
104    ((a * b) / c) * d
105
106However, a nontrivial percentage of programmers tend to read it as:
107
108    (a * b) / (c * d)
109
110This may be because, after grade school, most division is expressed in fraction
111form, like:
112
113    a * b
114    -----
115    c * d
116
117Interestingly, fewer programmers seem to mis-read extra division:
118
119    a * b / c / d
120
121is usually (correctly) parsed as:
122
123    ((a * b) / c) / d
124
125Given that confusion, in Emboss, the construct:
126
127    a * b / c * d
128
129should be a syntax error.
130
131The constructs:
132
133    a * b / c / d
134    a * b / c + d
135
136could be syntax errors, although the second one (`... + d`), in particular,
137seems to get parsed correctly by all programmers in my polling.
138
139This is in keeping with two general rules of Emboss design:
140
1411.  It is better to lock things down at first and gradually ease restrictions
142    than the other way around.
1432.  Emboss syntax should be as unambiguous as possible, even when it makes
144    implementation trickier or `.emb` files slightly wordier.  (See, for
145    example, the "separate-but-equal" precedence of the `&&` and `||`
146    operators in Emboss.)
147
148
149## Semantics
150
151### Flooring Division vs Truncating Division
152
153Roughly speaking, there are two common forms of "integer division:" *flooring*
154(sometimes called *Euclidean*) and *truncating* (sometimes called
155*round-toward-0*) division.
156
157| Operation  | Flooring Result  | Truncating Result |
158| ---------- | ---------------- | ----------------- |
159| +8 / +3    | 2                | 2                 |
160| +8 / -3    | -3               | -2                |
161| -8 / +3    | -3               | -2                |
162| -8 / -3    | -3               | 2                 |
163| +8 % +3    | 2                | 2                 |
164| +8 % -3    | -1               | 2                 |
165| -8 % +3    | 1                | -2                |
166| -8 % -3    | -2               | -2                |
167
168Most programming languages and most CPUs implement truncating division,
169[however, truncating division is irregular around
1700](https://dl.acm.org/doi/pdf/10.1145/128861.128862), which prevents some kinds
171of expression rewrites (e.g., `(x + 6) / 3` is not the same as `x / 3 + 2` when
172`x == -4`) and would force Emboss's bounds tracking to have wider bounds in
173some common cases, such as modulus by a known constant, where the dividend could
174be either positive or negative.
175
176Emboss should use flooring division.
177
178
179### Undefined Results
180
181Division is the first operation in Emboss's expression language which can have
182an undefined result: all other operations are total.
183
184Emboss does have some notion of "invalid value," which it uses at runtime to
185handle fields whose backing store does not exist or whose existence condition is
186false, however, "invalid value" is not fully propagated in the expression type
187system, and it has a somewhat different meaning than "undefined."
188
189Currently, integer types in the Emboss expression language are sets of the
190form:
191
192    {x ∈ ℤ | (min ≤ x ∨ min = -infinity) ∧
193             (x ≤ max ∨ max = infinity) ∧
194             x MOD m = n ∧
195             0 ≤ n < m ∧
196             min ∈ ℤ ∪ {-infinity} ∧
197             max ∈ ℤ ∪ {infinity} ∧
198             m ∈ ℤ ∪ {infinity} ∧
199             n ∈ ℤ}
200
201where `min`, `max`, `m`, and `n` are parameters.  The special values `infinity`
202and `-infinity` for `max` and `min` indicate that there is no known upper or
203lower bound, and the special value `infinity` for `m` indicates that `x` has a
204constant value equal to `n`.
205
206This will have to change to something like:
207
208    {x ∈ ℤ ∪ {undefined} | ((can_be_undefined ∧ x = undefined) ∨
209                            (min ≤ x ≤ max ∧ x MOD m = n)) ∧
210                           0 ≤ n < m ∧
211                           can_be_undefined ∈ �� ∧
212                           min ∈ ℤ ∪ {-infinity, undefined} ∧
213                           max ∈ ℤ ∪ {infinity, undefined} ∧
214                           m ∈ ℤ ∪ {infinity} ∧
215                           n ∈ ℤ ∪ {undefined}}
216
217where `can_be_undefined` is a new boolean parameter.
218
219This also leaks out into boolean types, which are currently either {true,
220false}, {true}, or {false}: they can now be {true, false}, {true}, {false},
221{true, false, undefined}, {true, undefined}, {false, undefined}, or
222{undefined}.
223
224
225### Expression Bound Calculations
226
227#### Division
228
229For an expression x of the form `l // r`, Emboss must figure out the five
230parameters of the type of x (`x_min`, `x_max`, `x_m`, `x_n`,
231`x_can_be_undefined`) from the five parameters of each of the types of `l` and
232`r`.
233
234    x_can_be_undefined = l_can_be_undefined ∨
235                         r_can_be_undefined ∨
236                         r_min ≤ 0 ≤ r_max
237
238That is, `x` can be undefined if either input is undefined, or if there could
239be division by zero.
240
241If `r_min = 0 = r_max`, then:
242
243    x_max = undefined
244    x_min = undefined
245    x_m = infinity
246    x_n = undefined
247
248Otherwise, the remaining parameters can be computed by parts, although for
249`x_max` and `x_min` it is simpler to simply check all pairs of {`l_min`, `-1`,
250`1`, `l_max`} // {`r_min`, `-1`, `1`, `r_max`} (removing zeroes from the
251right-hand side, and removing `-1` and `1` if they do not fall between the
252`min` and `max`).
253
254`x_m` is:
255
256    infinity      if l_m = r_m = infinity else
257    GCD(l_m, r_n) if r_m = infinity       else
258    1
259
260`x_n` is
261
262    l_n // r_n                     if l_m = r_m = infinity else
263    (l_n // r_n) MOD GCD(l_m, r_n) if r_m = infinity       else
264    0
265
266
267#### Modulus
268
269Modulus is somewhat simpler.
270
271    x_can_be_undefined = l_can_be_undefined ∨
272                         r_can_be_undefined ∨
273                         r_min ≤ 0 ≤ r_max
274
275If `l_m = r_m = infinity`, `x_m = infinity` and `x_min = x_max = x_n = l_n %
276r_n`.
277
278Otherwise:
279
280    x_max = max(0, r_max - 1)
281    x_min = min(0, r_min + 1)
282    x_m = 1
283    x_n = 0
284
285Note that, as with some other bounds calculations in Emboss, there are some
286special cases where `l` or `r` is a very small set where it is technically
287possible to find a narrower bound: for example, if 6 ≤ l ≤ 7, l % 4 could have
288the bounds `x_max = 3` and `x_min = 2`.  However, Emboss does not need to find
289the absolute tightest bound; it only needs to find a reasonable bound.
290
291
292## Other Notes
293
294### `IsComplete()`
295
296The *intention* of `IsComplete()` is that it should return `true` iff adding
297more bytes cannot change the result of `Ok()` -- that is, iff the structure is
298"complete" in the sense that there are enough bytes to hold the structure, even
299if the structure is broken in some other way.
300
301If the size of the structure is a dynamic value which involves division by
302zero, the second definition of `IsComplete()` becomes ill-defined, in that it
303becomes impossible to know if there are enough bytes for the structure:
304
305    struct Foo:
306      0 [+2]               UInt      divisor
307      2 [+512 // divisor]  UInt:8[]  payload
308
309If `divisor == 0`, the size of `payload` is undefined, so the structure's
310completeness is undefined.
311
312On the other hand, if `divisor == 0`, then the structure can never be `Ok()`,
313so in that sense it is 'complete': there is no way to add more bytes and get a
314structure that is functional.  There may be another way for the client software
315to recover, such as scanning an input stream for a new start-of-message marker.
316
317I (bolms@) think that the best option is:
318
3191.  Add a new method `Maybe<bool> IsUndefined()` to virtual field views (and
320    maybe to `UIntView`, `IntView`, and `FlagView`).
3212.  If `IntrinsicSizeInBytes().IsUndefined()`, `IsComplete()` should return
322    `true`.  (Note that `IntrinsicSizeInBytes()` is just the way the
323    `$size_in_bytes` implicit virtual field is spelled in C++.)
324    *   This will require a bunch of plumbing into the whole expression
325        generation logic -- probably changing `Maybe<T>` to have a 'why not'
326        reason when there is no `T`.
3273.  Otherwise, proceed with the current `IsComplete()` logic.
3284.  Update the documentation for `IntrinsicSizeInBytes()` and `IsComplete()` to
329    note this caveat.
330
331In practice, it is rare to have a dynamic value that could be zero as a divisor
332in a field position or length: such divisions are almost always either division
333by a constant or division by one of a small set of known divisors, often powers
334of 2 (and in many of those cases, a shift operation would fit better).
335