1# Design Sketch: Integer Division and Modulus Operators 2 3## Overview 4 5Currently, Emboss does not support division, which limits what it can express 6in its expression language. In particular, it is awkward to handle things like 7arrays of 16-bit values whose size is specified in bytes. 8 9The reason Emboss does not yet have support for division is that division is 10tricky to handle correctly. This document attempts to explain the pitfalls and 11the proper solutions. 12 13Note that Emboss does not support non-integer arithmetic, so this document only 14talks about *integer* division. 15 16 17## Syntax 18 19### Symbol 20 21Many programming languages use `/` for all division operations, whether they 22are "integer" division, floating point division, rational arithmetic division, 23matrix division, or anything else. 24 25Very, very few languages (Haskell) use an operator other than `%` for 26modulus/remainder, and no popular language overloads `%` with a second meaning. 27 28(Note that Haskell defines `%` as *rational number division*. `mod` and `rem` 29are used for modulus and remainder, respectively, and `div` and `quot` are used 30for truncating and flooring division.) 31 32`%` is technically modulus in some languages, and remainder in others, but in 33all cases the equality `x == (x / n * n) + x % n` holds, and does not seem to 34confuse programmers. 35 36 37#### C, C++, C#, Java, Go, Python 2, Rust, SQL, ... 38 39All of these languages use `/` for multiple kinds of division, including 40integer division. 41 42Python 2 deprecated `/` for integer division; in a Python 2 program with `from 43__future__ import division`, `/` is floating-point division. 44 45C++ and Python both allow `/` to be overloaded for user-defined types. 46 47 48#### JavaScript, TypeScript, Perl 5 49 50These languages use `/` for floating-point division, and do not directly 51support integer division; both operands and result must be cast to int: 52 53 ((l | 0) / (r | 0)) | 0 // JS, TS 54 55 int(int($l) / int($r)) # Perl 56 57 58#### OCaml, Haskell, Python 3 59 60These languages use separate operators for integer and non-integer division. 61 62Integer division: 63 64 l `div` r -- Haskell 65 l / r (* OCaml *) 66 l // r # Python 3 67 68Non-integer division: 69 70 l / r -- Haskell 71 l /. r (* OCaml *) 72 l / r # Python 3 73 74 75#### Emboss 76 77Emboss should use `//` for division, following Python 3's example. 78 79This is the only unambiguous integer division operator used by a TIOBE top-20 80language, *and* Python 3's `//` has the same division semantics (flooring 81division) as Emboss -- most other programming languages use truncating 82(round-toward-0) division. 83 84If Emboss ever gets support for floating-point arithmetic, it should probably 85use OCaml's `/.` operator for division on floats. 86 87Because `/` does different things in different popular languages (flooring 88integer division, truncating integer division, floating-point division, or 89multiple types, depending on operand types), Emboss should not use it. 90 91Emboss should use `%` for modulus, following the example of almost every other 92language. 93 94 95### Precedence 96 97In most programming languages, division is a left-associative operator with 98equal precedence to multiplication; i.e. this: 99 100 a * b / c * d 101 102is equivalent to this: 103 104 ((a * b) / c) * d 105 106However, a nontrivial percentage of programmers tend to read it as: 107 108 (a * b) / (c * d) 109 110This may be because, after grade school, most division is expressed in fraction 111form, like: 112 113 a * b 114 ----- 115 c * d 116 117Interestingly, fewer programmers seem to mis-read extra division: 118 119 a * b / c / d 120 121is usually (correctly) parsed as: 122 123 ((a * b) / c) / d 124 125Given that confusion, in Emboss, the construct: 126 127 a * b / c * d 128 129should be a syntax error. 130 131The constructs: 132 133 a * b / c / d 134 a * b / c + d 135 136could be syntax errors, although the second one (`... + d`), in particular, 137seems to get parsed correctly by all programmers in my polling. 138 139This is in keeping with two general rules of Emboss design: 140 1411. It is better to lock things down at first and gradually ease restrictions 142 than the other way around. 1432. Emboss syntax should be as unambiguous as possible, even when it makes 144 implementation trickier or `.emb` files slightly wordier. (See, for 145 example, the "separate-but-equal" precedence of the `&&` and `||` 146 operators in Emboss.) 147 148 149## Semantics 150 151### Flooring Division vs Truncating Division 152 153Roughly speaking, there are two common forms of "integer division:" *flooring* 154(sometimes called *Euclidean*) and *truncating* (sometimes called 155*round-toward-0*) division. 156 157| Operation | Flooring Result | Truncating Result | 158| ---------- | ---------------- | ----------------- | 159| +8 / +3 | 2 | 2 | 160| +8 / -3 | -3 | -2 | 161| -8 / +3 | -3 | -2 | 162| -8 / -3 | -3 | 2 | 163| +8 % +3 | 2 | 2 | 164| +8 % -3 | -1 | 2 | 165| -8 % +3 | 1 | -2 | 166| -8 % -3 | -2 | -2 | 167 168Most programming languages and most CPUs implement truncating division, 169[however, truncating division is irregular around 1700](https://dl.acm.org/doi/pdf/10.1145/128861.128862), which prevents some kinds 171of expression rewrites (e.g., `(x + 6) / 3` is not the same as `x / 3 + 2` when 172`x == -4`) and would force Emboss's bounds tracking to have wider bounds in 173some common cases, such as modulus by a known constant, where the dividend could 174be either positive or negative. 175 176Emboss should use flooring division. 177 178 179### Undefined Results 180 181Division is the first operation in Emboss's expression language which can have 182an undefined result: all other operations are total. 183 184Emboss does have some notion of "invalid value," which it uses at runtime to 185handle fields whose backing store does not exist or whose existence condition is 186false, however, "invalid value" is not fully propagated in the expression type 187system, and it has a somewhat different meaning than "undefined." 188 189Currently, integer types in the Emboss expression language are sets of the 190form: 191 192 {x ∈ ℤ | (min ≤ x ∨ min = -infinity) ∧ 193 (x ≤ max ∨ max = infinity) ∧ 194 x MOD m = n ∧ 195 0 ≤ n < m ∧ 196 min ∈ ℤ ∪ {-infinity} ∧ 197 max ∈ ℤ ∪ {infinity} ∧ 198 m ∈ ℤ ∪ {infinity} ∧ 199 n ∈ ℤ} 200 201where `min`, `max`, `m`, and `n` are parameters. The special values `infinity` 202and `-infinity` for `max` and `min` indicate that there is no known upper or 203lower bound, and the special value `infinity` for `m` indicates that `x` has a 204constant value equal to `n`. 205 206This will have to change to something like: 207 208 {x ∈ ℤ ∪ {undefined} | ((can_be_undefined ∧ x = undefined) ∨ 209 (min ≤ x ≤ max ∧ x MOD m = n)) ∧ 210 0 ≤ n < m ∧ 211 can_be_undefined ∈ ∧ 212 min ∈ ℤ ∪ {-infinity, undefined} ∧ 213 max ∈ ℤ ∪ {infinity, undefined} ∧ 214 m ∈ ℤ ∪ {infinity} ∧ 215 n ∈ ℤ ∪ {undefined}} 216 217where `can_be_undefined` is a new boolean parameter. 218 219This also leaks out into boolean types, which are currently either {true, 220false}, {true}, or {false}: they can now be {true, false}, {true}, {false}, 221{true, false, undefined}, {true, undefined}, {false, undefined}, or 222{undefined}. 223 224 225### Expression Bound Calculations 226 227#### Division 228 229For an expression x of the form `l // r`, Emboss must figure out the five 230parameters of the type of x (`x_min`, `x_max`, `x_m`, `x_n`, 231`x_can_be_undefined`) from the five parameters of each of the types of `l` and 232`r`. 233 234 x_can_be_undefined = l_can_be_undefined ∨ 235 r_can_be_undefined ∨ 236 r_min ≤ 0 ≤ r_max 237 238That is, `x` can be undefined if either input is undefined, or if there could 239be division by zero. 240 241If `r_min = 0 = r_max`, then: 242 243 x_max = undefined 244 x_min = undefined 245 x_m = infinity 246 x_n = undefined 247 248Otherwise, the remaining parameters can be computed by parts, although for 249`x_max` and `x_min` it is simpler to simply check all pairs of {`l_min`, `-1`, 250`1`, `l_max`} // {`r_min`, `-1`, `1`, `r_max`} (removing zeroes from the 251right-hand side, and removing `-1` and `1` if they do not fall between the 252`min` and `max`). 253 254`x_m` is: 255 256 infinity if l_m = r_m = infinity else 257 GCD(l_m, r_n) if r_m = infinity else 258 1 259 260`x_n` is 261 262 l_n // r_n if l_m = r_m = infinity else 263 (l_n // r_n) MOD GCD(l_m, r_n) if r_m = infinity else 264 0 265 266 267#### Modulus 268 269Modulus is somewhat simpler. 270 271 x_can_be_undefined = l_can_be_undefined ∨ 272 r_can_be_undefined ∨ 273 r_min ≤ 0 ≤ r_max 274 275If `l_m = r_m = infinity`, `x_m = infinity` and `x_min = x_max = x_n = l_n % 276r_n`. 277 278Otherwise: 279 280 x_max = max(0, r_max - 1) 281 x_min = min(0, r_min + 1) 282 x_m = 1 283 x_n = 0 284 285Note that, as with some other bounds calculations in Emboss, there are some 286special cases where `l` or `r` is a very small set where it is technically 287possible to find a narrower bound: for example, if 6 ≤ l ≤ 7, l % 4 could have 288the bounds `x_max = 3` and `x_min = 2`. However, Emboss does not need to find 289the absolute tightest bound; it only needs to find a reasonable bound. 290 291 292## Other Notes 293 294### `IsComplete()` 295 296The *intention* of `IsComplete()` is that it should return `true` iff adding 297more bytes cannot change the result of `Ok()` -- that is, iff the structure is 298"complete" in the sense that there are enough bytes to hold the structure, even 299if the structure is broken in some other way. 300 301If the size of the structure is a dynamic value which involves division by 302zero, the second definition of `IsComplete()` becomes ill-defined, in that it 303becomes impossible to know if there are enough bytes for the structure: 304 305 struct Foo: 306 0 [+2] UInt divisor 307 2 [+512 // divisor] UInt:8[] payload 308 309If `divisor == 0`, the size of `payload` is undefined, so the structure's 310completeness is undefined. 311 312On the other hand, if `divisor == 0`, then the structure can never be `Ok()`, 313so in that sense it is 'complete': there is no way to add more bytes and get a 314structure that is functional. There may be another way for the client software 315to recover, such as scanning an input stream for a new start-of-message marker. 316 317I (bolms@) think that the best option is: 318 3191. Add a new method `Maybe<bool> IsUndefined()` to virtual field views (and 320 maybe to `UIntView`, `IntView`, and `FlagView`). 3212. If `IntrinsicSizeInBytes().IsUndefined()`, `IsComplete()` should return 322 `true`. (Note that `IntrinsicSizeInBytes()` is just the way the 323 `$size_in_bytes` implicit virtual field is spelled in C++.) 324 * This will require a bunch of plumbing into the whole expression 325 generation logic -- probably changing `Maybe<T>` to have a 'why not' 326 reason when there is no `T`. 3273. Otherwise, proceed with the current `IsComplete()` logic. 3284. Update the documentation for `IntrinsicSizeInBytes()` and `IsComplete()` to 329 note this caveat. 330 331In practice, it is rare to have a dynamic value that could be zero as a divisor 332in a field position or length: such divisions are almost always either division 333by a constant or division by one of a small set of known divisors, often powers 334of 2 (and in many of those cases, a shift operation would fit better). 335