1*01826a49SYabin CuiZstandard Compression Format 2*01826a49SYabin Cui============================ 3*01826a49SYabin Cui 4*01826a49SYabin Cui### Notices 5*01826a49SYabin Cui 6*01826a49SYabin CuiCopyright (c) Meta Platforms, Inc. and affiliates. 7*01826a49SYabin Cui 8*01826a49SYabin CuiPermission is granted to copy and distribute this document 9*01826a49SYabin Cuifor any purpose and without charge, 10*01826a49SYabin Cuiincluding translations into other languages 11*01826a49SYabin Cuiand incorporation into compilations, 12*01826a49SYabin Cuiprovided that the copyright notice and this notice are preserved, 13*01826a49SYabin Cuiand that any substantive changes or deletions from the original 14*01826a49SYabin Cuiare clearly marked. 15*01826a49SYabin CuiDistribution of this document is unlimited. 16*01826a49SYabin Cui 17*01826a49SYabin Cui### Version 18*01826a49SYabin Cui 19*01826a49SYabin Cui0.4.0 (2023-06-05) 20*01826a49SYabin Cui 21*01826a49SYabin Cui 22*01826a49SYabin CuiIntroduction 23*01826a49SYabin Cui------------ 24*01826a49SYabin Cui 25*01826a49SYabin CuiThe purpose of this document is to define a lossless compressed data format, 26*01826a49SYabin Cuithat is independent of CPU type, operating system, 27*01826a49SYabin Cuifile system and character set, suitable for 28*01826a49SYabin Cuifile compression, pipe and streaming compression, 29*01826a49SYabin Cuiusing the [Zstandard algorithm](https://facebook.github.io/zstd/). 30*01826a49SYabin CuiThe text of the specification assumes a basic background in programming 31*01826a49SYabin Cuiat the level of bits and other primitive data representations. 32*01826a49SYabin Cui 33*01826a49SYabin CuiThe data can be produced or consumed, 34*01826a49SYabin Cuieven for an arbitrarily long sequentially presented input data stream, 35*01826a49SYabin Cuiusing only an a priori bounded amount of intermediate storage, 36*01826a49SYabin Cuiand hence can be used in data communications. 37*01826a49SYabin CuiThe format uses the Zstandard compression method, 38*01826a49SYabin Cuiand optional [xxHash-64 checksum method](https://cyan4973.github.io/xxHash/), 39*01826a49SYabin Cuifor detection of data corruption. 40*01826a49SYabin Cui 41*01826a49SYabin CuiThe data format defined by this specification 42*01826a49SYabin Cuidoes not attempt to allow random access to compressed data. 43*01826a49SYabin Cui 44*01826a49SYabin CuiUnless otherwise indicated below, 45*01826a49SYabin Cuia compliant compressor must produce data sets 46*01826a49SYabin Cuithat conform to the specifications presented here. 47*01826a49SYabin CuiIt doesn’t need to support all options though. 48*01826a49SYabin Cui 49*01826a49SYabin CuiA compliant decompressor must be able to decompress 50*01826a49SYabin Cuiat least one working set of parameters 51*01826a49SYabin Cuithat conforms to the specifications presented here. 52*01826a49SYabin CuiIt may also ignore informative fields, such as checksum. 53*01826a49SYabin CuiWhenever it does not support a parameter defined in the compressed stream, 54*01826a49SYabin Cuiit must produce a non-ambiguous error code and associated error message 55*01826a49SYabin Cuiexplaining which parameter is unsupported. 56*01826a49SYabin Cui 57*01826a49SYabin CuiThis specification is intended for use by implementers of software 58*01826a49SYabin Cuito compress data into Zstandard format and/or decompress data from Zstandard format. 59*01826a49SYabin CuiThe Zstandard format is supported by an open source reference implementation, 60*01826a49SYabin Cuiwritten in portable C, and available at : https://github.com/facebook/zstd . 61*01826a49SYabin Cui 62*01826a49SYabin Cui 63*01826a49SYabin Cui### Overall conventions 64*01826a49SYabin CuiIn this document: 65*01826a49SYabin Cui- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. 66*01826a49SYabin Cui- the naming convention for identifiers is `Mixed_Case_With_Underscores` 67*01826a49SYabin Cui 68*01826a49SYabin Cui### Definitions 69*01826a49SYabin CuiContent compressed by Zstandard is transformed into a Zstandard __frame__. 70*01826a49SYabin CuiMultiple frames can be appended into a single file or stream. 71*01826a49SYabin CuiA frame is completely independent, has a defined beginning and end, 72*01826a49SYabin Cuiand a set of parameters which tells the decoder how to decompress it. 73*01826a49SYabin Cui 74*01826a49SYabin CuiA frame encapsulates one or multiple __blocks__. 75*01826a49SYabin CuiEach block contains arbitrary content, which is described by its header, 76*01826a49SYabin Cuiand has a guaranteed maximum content size, which depends on frame parameters. 77*01826a49SYabin CuiUnlike frames, each block depends on previous blocks for proper decoding. 78*01826a49SYabin CuiHowever, each block can be decompressed without waiting for its successor, 79*01826a49SYabin Cuiallowing streaming operations. 80*01826a49SYabin Cui 81*01826a49SYabin CuiOverview 82*01826a49SYabin Cui--------- 83*01826a49SYabin Cui- [Frames](#frames) 84*01826a49SYabin Cui - [Zstandard frames](#zstandard-frames) 85*01826a49SYabin Cui - [Blocks](#blocks) 86*01826a49SYabin Cui - [Literals Section](#literals-section) 87*01826a49SYabin Cui - [Sequences Section](#sequences-section) 88*01826a49SYabin Cui - [Sequence Execution](#sequence-execution) 89*01826a49SYabin Cui - [Skippable frames](#skippable-frames) 90*01826a49SYabin Cui- [Entropy Encoding](#entropy-encoding) 91*01826a49SYabin Cui - [FSE](#fse) 92*01826a49SYabin Cui - [Huffman Coding](#huffman-coding) 93*01826a49SYabin Cui- [Dictionary Format](#dictionary-format) 94*01826a49SYabin Cui 95*01826a49SYabin CuiFrames 96*01826a49SYabin Cui------ 97*01826a49SYabin CuiZstandard compressed data is made of one or more __frames__. 98*01826a49SYabin CuiEach frame is independent and can be decompressed independently of other frames. 99*01826a49SYabin CuiThe decompressed content of multiple concatenated frames is the concatenation of 100*01826a49SYabin Cuieach frame decompressed content. 101*01826a49SYabin Cui 102*01826a49SYabin CuiThere are two frame formats defined by Zstandard: 103*01826a49SYabin Cui Zstandard frames and Skippable frames. 104*01826a49SYabin CuiZstandard frames contain compressed data, while 105*01826a49SYabin Cuiskippable frames contain custom user metadata. 106*01826a49SYabin Cui 107*01826a49SYabin Cui## Zstandard frames 108*01826a49SYabin CuiThe structure of a single Zstandard frame is following: 109*01826a49SYabin Cui 110*01826a49SYabin Cui| `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] | 111*01826a49SYabin Cui|:--------------:|:--------------:|:----------:| ------------------ |:--------------------:| 112*01826a49SYabin Cui| 4 bytes | 2-14 bytes | n bytes | | 0-4 bytes | 113*01826a49SYabin Cui 114*01826a49SYabin Cui__`Magic_Number`__ 115*01826a49SYabin Cui 116*01826a49SYabin Cui4 Bytes, __little-endian__ format. 117*01826a49SYabin CuiValue : 0xFD2FB528 118*01826a49SYabin CuiNote: This value was selected to be less probable to find at the beginning of some random file. 119*01826a49SYabin CuiIt avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.), 120*01826a49SYabin Cuicontains byte values outside of ASCII range, 121*01826a49SYabin Cuiand doesn't map into UTF8 space. 122*01826a49SYabin CuiIt reduces the chances that a text file represent this value by accident. 123*01826a49SYabin Cui 124*01826a49SYabin Cui__`Frame_Header`__ 125*01826a49SYabin Cui 126*01826a49SYabin Cui2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header). 127*01826a49SYabin Cui 128*01826a49SYabin Cui__`Data_Block`__ 129*01826a49SYabin Cui 130*01826a49SYabin CuiDetailed in [`Blocks`](#blocks). 131*01826a49SYabin CuiThat’s where compressed data is stored. 132*01826a49SYabin Cui 133*01826a49SYabin Cui__`Content_Checksum`__ 134*01826a49SYabin Cui 135*01826a49SYabin CuiAn optional 32-bit checksum, only present if `Content_Checksum_flag` is set. 136*01826a49SYabin CuiThe content checksum is the result 137*01826a49SYabin Cuiof [xxh64() hash function](https://cyan4973.github.io/xxHash/) 138*01826a49SYabin Cuidigesting the original (decoded) data as input, and a seed of zero. 139*01826a49SYabin CuiThe low 4 bytes of the checksum are stored in __little-endian__ format. 140*01826a49SYabin Cui 141*01826a49SYabin Cui### `Frame_Header` 142*01826a49SYabin Cui 143*01826a49SYabin CuiThe `Frame_Header` has a variable size, with a minimum of 2 bytes, 144*01826a49SYabin Cuiand up to 14 bytes depending on optional parameters. 145*01826a49SYabin CuiThe structure of `Frame_Header` is following: 146*01826a49SYabin Cui 147*01826a49SYabin Cui| `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] | 148*01826a49SYabin Cui| ------------------------- | --------------------- | ----------------- | ---------------------- | 149*01826a49SYabin Cui| 1 byte | 0-1 byte | 0-4 bytes | 0-8 bytes | 150*01826a49SYabin Cui 151*01826a49SYabin Cui#### `Frame_Header_Descriptor` 152*01826a49SYabin Cui 153*01826a49SYabin CuiThe first header's byte is called the `Frame_Header_Descriptor`. 154*01826a49SYabin CuiIt describes which other fields are present. 155*01826a49SYabin CuiDecoding this byte is enough to tell the size of `Frame_Header`. 156*01826a49SYabin Cui 157*01826a49SYabin Cui| Bit number | Field name | 158*01826a49SYabin Cui| ---------- | ---------- | 159*01826a49SYabin Cui| 7-6 | `Frame_Content_Size_flag` | 160*01826a49SYabin Cui| 5 | `Single_Segment_flag` | 161*01826a49SYabin Cui| 4 | `Unused_bit` | 162*01826a49SYabin Cui| 3 | `Reserved_bit` | 163*01826a49SYabin Cui| 2 | `Content_Checksum_flag` | 164*01826a49SYabin Cui| 1-0 | `Dictionary_ID_flag` | 165*01826a49SYabin Cui 166*01826a49SYabin CuiIn this table, bit 7 is the highest bit, while bit 0 is the lowest one. 167*01826a49SYabin Cui 168*01826a49SYabin Cui__`Frame_Content_Size_flag`__ 169*01826a49SYabin Cui 170*01826a49SYabin CuiThis is a 2-bits flag (`= Frame_Header_Descriptor >> 6`), 171*01826a49SYabin Cuispecifying if `Frame_Content_Size` (the decompressed data size) 172*01826a49SYabin Cuiis provided within the header. 173*01826a49SYabin Cui`Flag_Value` provides `FCS_Field_Size`, 174*01826a49SYabin Cuiwhich is the number of bytes used by `Frame_Content_Size` 175*01826a49SYabin Cuiaccording to the following table: 176*01826a49SYabin Cui 177*01826a49SYabin Cui| `Flag_Value` | 0 | 1 | 2 | 3 | 178*01826a49SYabin Cui| -------------- | ------ | --- | --- | --- | 179*01826a49SYabin Cui|`FCS_Field_Size`| 0 or 1 | 2 | 4 | 8 | 180*01826a49SYabin Cui 181*01826a49SYabin CuiWhen `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` : 182*01826a49SYabin Cuiif `Single_Segment_flag` is set, `FCS_Field_Size` is 1. 183*01826a49SYabin CuiOtherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided. 184*01826a49SYabin Cui 185*01826a49SYabin Cui__`Single_Segment_flag`__ 186*01826a49SYabin Cui 187*01826a49SYabin CuiIf this flag is set, 188*01826a49SYabin Cuidata must be regenerated within a single continuous memory segment. 189*01826a49SYabin Cui 190*01826a49SYabin CuiIn this case, `Window_Descriptor` byte is skipped, 191*01826a49SYabin Cuibut `Frame_Content_Size` is necessarily present. 192*01826a49SYabin CuiAs a consequence, the decoder must allocate a memory segment 193*01826a49SYabin Cuiof size equal or larger than `Frame_Content_Size`. 194*01826a49SYabin Cui 195*01826a49SYabin CuiIn order to preserve the decoder from unreasonable memory requirements, 196*01826a49SYabin Cuia decoder is allowed to reject a compressed frame 197*01826a49SYabin Cuiwhich requests a memory size beyond decoder's authorized range. 198*01826a49SYabin Cui 199*01826a49SYabin CuiFor broader compatibility, decoders are recommended to support 200*01826a49SYabin Cuimemory sizes of at least 8 MB. 201*01826a49SYabin CuiThis is only a recommendation, 202*01826a49SYabin Cuieach decoder is free to support higher or lower limits, 203*01826a49SYabin Cuidepending on local limitations. 204*01826a49SYabin Cui 205*01826a49SYabin Cui__`Unused_bit`__ 206*01826a49SYabin Cui 207*01826a49SYabin CuiA decoder compliant with this specification version shall not interpret this bit. 208*01826a49SYabin CuiIt might be used in any future version, 209*01826a49SYabin Cuito signal a property which is transparent to properly decode the frame. 210*01826a49SYabin CuiAn encoder compliant with this specification version must set this bit to zero. 211*01826a49SYabin Cui 212*01826a49SYabin Cui__`Reserved_bit`__ 213*01826a49SYabin Cui 214*01826a49SYabin CuiThis bit is reserved for some future feature. 215*01826a49SYabin CuiIts value _must be zero_. 216*01826a49SYabin CuiA decoder compliant with this specification version must ensure it is not set. 217*01826a49SYabin CuiThis bit may be used in a future revision, 218*01826a49SYabin Cuito signal a feature that must be interpreted to decode the frame correctly. 219*01826a49SYabin Cui 220*01826a49SYabin Cui__`Content_Checksum_flag`__ 221*01826a49SYabin Cui 222*01826a49SYabin CuiIf this flag is set, a 32-bits `Content_Checksum` will be present at frame's end. 223*01826a49SYabin CuiSee `Content_Checksum` paragraph. 224*01826a49SYabin Cui 225*01826a49SYabin Cui__`Dictionary_ID_flag`__ 226*01826a49SYabin Cui 227*01826a49SYabin CuiThis is a 2-bits flag (`= FHD & 3`), 228*01826a49SYabin Cuitelling if a dictionary ID is provided within the header. 229*01826a49SYabin CuiIt also specifies the size of this field as `DID_Field_Size`. 230*01826a49SYabin Cui 231*01826a49SYabin Cui|`Flag_Value` | 0 | 1 | 2 | 3 | 232*01826a49SYabin Cui| -------------- | --- | --- | --- | --- | 233*01826a49SYabin Cui|`DID_Field_Size`| 0 | 1 | 2 | 4 | 234*01826a49SYabin Cui 235*01826a49SYabin Cui#### `Window_Descriptor` 236*01826a49SYabin Cui 237*01826a49SYabin CuiProvides guarantees on minimum memory buffer required to decompress a frame. 238*01826a49SYabin CuiThis information is important for decoders to allocate enough memory. 239*01826a49SYabin Cui 240*01826a49SYabin CuiThe `Window_Descriptor` byte is optional. 241*01826a49SYabin CuiWhen `Single_Segment_flag` is set, `Window_Descriptor` is not present. 242*01826a49SYabin CuiIn this case, `Window_Size` is `Frame_Content_Size`, 243*01826a49SYabin Cuiwhich can be any value from 0 to 2^64-1 bytes (16 ExaBytes). 244*01826a49SYabin Cui 245*01826a49SYabin Cui| Bit numbers | 7-3 | 2-0 | 246*01826a49SYabin Cui| ----------- | ---------- | ---------- | 247*01826a49SYabin Cui| Field name | `Exponent` | `Mantissa` | 248*01826a49SYabin Cui 249*01826a49SYabin CuiThe minimum memory buffer size is called `Window_Size`. 250*01826a49SYabin CuiIt is described by the following formulas : 251*01826a49SYabin Cui``` 252*01826a49SYabin CuiwindowLog = 10 + Exponent; 253*01826a49SYabin CuiwindowBase = 1 << windowLog; 254*01826a49SYabin CuiwindowAdd = (windowBase / 8) * Mantissa; 255*01826a49SYabin CuiWindow_Size = windowBase + windowAdd; 256*01826a49SYabin Cui``` 257*01826a49SYabin CuiThe minimum `Window_Size` is 1 KB. 258*01826a49SYabin CuiThe maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB. 259*01826a49SYabin Cui 260*01826a49SYabin CuiIn general, larger `Window_Size` tend to improve compression ratio, 261*01826a49SYabin Cuibut at the cost of memory usage. 262*01826a49SYabin Cui 263*01826a49SYabin CuiTo properly decode compressed data, 264*01826a49SYabin Cuia decoder will need to allocate a buffer of at least `Window_Size` bytes. 265*01826a49SYabin Cui 266*01826a49SYabin CuiIn order to preserve decoder from unreasonable memory requirements, 267*01826a49SYabin Cuia decoder is allowed to reject a compressed frame 268*01826a49SYabin Cuiwhich requests a memory size beyond decoder's authorized range. 269*01826a49SYabin Cui 270*01826a49SYabin CuiFor improved interoperability, 271*01826a49SYabin Cuiit's recommended for decoders to support `Window_Size` of up to 8 MB, 272*01826a49SYabin Cuiand it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB. 273*01826a49SYabin CuiIt's merely a recommendation though, 274*01826a49SYabin Cuidecoders are free to support larger or lower limits, 275*01826a49SYabin Cuidepending on local limitations. 276*01826a49SYabin Cui 277*01826a49SYabin Cui#### `Dictionary_ID` 278*01826a49SYabin Cui 279*01826a49SYabin CuiThis is a variable size field, which contains 280*01826a49SYabin Cuithe ID of the dictionary required to properly decode the frame. 281*01826a49SYabin Cui`Dictionary_ID` field is optional. When it's not present, 282*01826a49SYabin Cuiit's up to the decoder to know which dictionary to use. 283*01826a49SYabin Cui 284*01826a49SYabin Cui`Dictionary_ID` field size is provided by `DID_Field_Size`. 285*01826a49SYabin Cui`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`. 286*01826a49SYabin Cui1 byte can represent an ID 0-255. 287*01826a49SYabin Cui2 bytes can represent an ID 0-65535. 288*01826a49SYabin Cui4 bytes can represent an ID 0-4294967295. 289*01826a49SYabin CuiFormat is __little-endian__. 290*01826a49SYabin Cui 291*01826a49SYabin CuiIt's allowed to represent a small ID (for example `13`) 292*01826a49SYabin Cuiwith a large 4-bytes dictionary ID, even if it is less efficient. 293*01826a49SYabin Cui 294*01826a49SYabin CuiA value of `0` has same meaning as no `Dictionary_ID`, 295*01826a49SYabin Cuiin which case the frame may or may not need a dictionary to be decoded, 296*01826a49SYabin Cuiand the ID of such a dictionary is not specified. 297*01826a49SYabin CuiThe decoder must know this information by other means. 298*01826a49SYabin Cui 299*01826a49SYabin Cui#### `Frame_Content_Size` 300*01826a49SYabin Cui 301*01826a49SYabin CuiThis is the original (uncompressed) size. This information is optional. 302*01826a49SYabin Cui`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`. 303*01826a49SYabin Cui`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`. 304*01826a49SYabin Cui`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes. 305*01826a49SYabin Cui 306*01826a49SYabin Cui| `FCS_Field_Size` | Range | 307*01826a49SYabin Cui| ---------------- | ---------- | 308*01826a49SYabin Cui| 0 | unknown | 309*01826a49SYabin Cui| 1 | 0 - 255 | 310*01826a49SYabin Cui| 2 | 256 - 65791| 311*01826a49SYabin Cui| 4 | 0 - 2^32-1 | 312*01826a49SYabin Cui| 8 | 0 - 2^64-1 | 313*01826a49SYabin Cui 314*01826a49SYabin Cui`Frame_Content_Size` format is __little-endian__. 315*01826a49SYabin CuiWhen `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly. 316*01826a49SYabin CuiWhen `FCS_Field_Size` is 2, _the offset of 256 is added_. 317*01826a49SYabin CuiIt's allowed to represent a small size (for example `18`) using any compatible variant. 318*01826a49SYabin Cui 319*01826a49SYabin Cui 320*01826a49SYabin CuiBlocks 321*01826a49SYabin Cui------- 322*01826a49SYabin Cui 323*01826a49SYabin CuiAfter `Magic_Number` and `Frame_Header`, there are some number of blocks. 324*01826a49SYabin CuiEach frame must have at least one block, 325*01826a49SYabin Cuibut there is no upper limit on the number of blocks per frame. 326*01826a49SYabin Cui 327*01826a49SYabin CuiThe structure of a block is as follows: 328*01826a49SYabin Cui 329*01826a49SYabin Cui| `Block_Header` | `Block_Content` | 330*01826a49SYabin Cui|:--------------:|:---------------:| 331*01826a49SYabin Cui| 3 bytes | n bytes | 332*01826a49SYabin Cui 333*01826a49SYabin Cui__`Block_Header`__ 334*01826a49SYabin Cui 335*01826a49SYabin Cui`Block_Header` uses 3 bytes, written using __little-endian__ convention. 336*01826a49SYabin CuiIt contains 3 fields : 337*01826a49SYabin Cui 338*01826a49SYabin Cui| `Last_Block` | `Block_Type` | `Block_Size` | 339*01826a49SYabin Cui|:------------:|:------------:|:------------:| 340*01826a49SYabin Cui| bit 0 | bits 1-2 | bits 3-23 | 341*01826a49SYabin Cui 342*01826a49SYabin Cui__`Last_Block`__ 343*01826a49SYabin Cui 344*01826a49SYabin CuiThe lowest bit signals if this block is the last one. 345*01826a49SYabin CuiThe frame will end after this last block. 346*01826a49SYabin CuiIt may be followed by an optional `Content_Checksum` 347*01826a49SYabin Cui(see [Zstandard Frames](#zstandard-frames)). 348*01826a49SYabin Cui 349*01826a49SYabin Cui__`Block_Type`__ 350*01826a49SYabin Cui 351*01826a49SYabin CuiThe next 2 bits represent the `Block_Type`. 352*01826a49SYabin Cui`Block_Type` influences the meaning of `Block_Size`. 353*01826a49SYabin CuiThere are 4 block types : 354*01826a49SYabin Cui 355*01826a49SYabin Cui| Value | 0 | 1 | 2 | 3 | 356*01826a49SYabin Cui| ------------ | ----------- | ----------- | ------------------ | --------- | 357*01826a49SYabin Cui| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`| 358*01826a49SYabin Cui 359*01826a49SYabin Cui- `Raw_Block` - this is an uncompressed block. 360*01826a49SYabin Cui `Block_Content` contains `Block_Size` bytes. 361*01826a49SYabin Cui 362*01826a49SYabin Cui- `RLE_Block` - this is a single byte, repeated `Block_Size` times. 363*01826a49SYabin Cui `Block_Content` consists of a single byte. 364*01826a49SYabin Cui On the decompression side, this byte must be repeated `Block_Size` times. 365*01826a49SYabin Cui 366*01826a49SYabin Cui- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks), 367*01826a49SYabin Cui explained later on. 368*01826a49SYabin Cui `Block_Size` is the length of `Block_Content`, the compressed data. 369*01826a49SYabin Cui The decompressed size is not known, 370*01826a49SYabin Cui but its maximum possible value is guaranteed (see below) 371*01826a49SYabin Cui 372*01826a49SYabin Cui- `Reserved` - this is not a block. 373*01826a49SYabin Cui This value cannot be used with current version of this specification. 374*01826a49SYabin Cui If such a value is present, it is considered corrupted data. 375*01826a49SYabin Cui 376*01826a49SYabin Cui__`Block_Size`__ 377*01826a49SYabin Cui 378*01826a49SYabin CuiThe upper 21 bits of `Block_Header` represent the `Block_Size`. 379*01826a49SYabin Cui 380*01826a49SYabin CuiWhen `Block_Type` is `Compressed_Block` or `Raw_Block`, 381*01826a49SYabin Cui`Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`). 382*01826a49SYabin Cui 383*01826a49SYabin CuiWhen `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1, 384*01826a49SYabin Cui`Block_Size` represents the number of times this byte must be repeated. 385*01826a49SYabin Cui 386*01826a49SYabin Cui`Block_Size` is limited by `Block_Maximum_Size` (see below). 387*01826a49SYabin Cui 388*01826a49SYabin Cui__`Block_Content`__ and __`Block_Maximum_Size`__ 389*01826a49SYabin Cui 390*01826a49SYabin CuiThe size of `Block_Content` is limited by `Block_Maximum_Size`, 391*01826a49SYabin Cuiwhich is the smallest of: 392*01826a49SYabin Cui- `Window_Size` 393*01826a49SYabin Cui- 128 KB 394*01826a49SYabin Cui 395*01826a49SYabin Cui`Block_Maximum_Size` is constant for a given frame. 396*01826a49SYabin CuiThis maximum is applicable to both the decompressed size 397*01826a49SYabin Cuiand the compressed size of any block in the frame. 398*01826a49SYabin Cui 399*01826a49SYabin CuiThe reasoning for this limit is that a decoder can read this information 400*01826a49SYabin Cuiat the beginning of a frame and use it to allocate buffers. 401*01826a49SYabin CuiThe guarantees on the size of blocks ensure that 402*01826a49SYabin Cuithe buffers will be large enough for any following block of the valid frame. 403*01826a49SYabin Cui 404*01826a49SYabin Cui 405*01826a49SYabin CuiCompressed Blocks 406*01826a49SYabin Cui----------------- 407*01826a49SYabin CuiTo decompress a compressed block, the compressed size must be provided 408*01826a49SYabin Cuifrom `Block_Size` field within `Block_Header`. 409*01826a49SYabin Cui 410*01826a49SYabin CuiA compressed block consists of 2 sections : 411*01826a49SYabin Cui- [Literals Section](#literals-section) 412*01826a49SYabin Cui- [Sequences Section](#sequences-section) 413*01826a49SYabin Cui 414*01826a49SYabin CuiThe results of the two sections are then combined to produce the decompressed 415*01826a49SYabin Cuidata in [Sequence Execution](#sequence-execution) 416*01826a49SYabin Cui 417*01826a49SYabin Cui#### Prerequisites 418*01826a49SYabin CuiTo decode a compressed block, the following elements are necessary : 419*01826a49SYabin Cui- Previous decoded data, up to a distance of `Window_Size`, 420*01826a49SYabin Cui or beginning of the Frame, whichever is smaller. 421*01826a49SYabin Cui- List of "recent offsets" from previous `Compressed_Block`. 422*01826a49SYabin Cui- The previous Huffman tree, required by `Treeless_Literals_Block` type 423*01826a49SYabin Cui- Previous FSE decoding tables, required by `Repeat_Mode` 424*01826a49SYabin Cui for each symbol type (literals lengths, match lengths, offsets) 425*01826a49SYabin Cui 426*01826a49SYabin CuiNote that decoding tables aren't always from the previous `Compressed_Block`. 427*01826a49SYabin Cui 428*01826a49SYabin Cui- Every decoding table can come from a dictionary. 429*01826a49SYabin Cui- The Huffman tree comes from the previous `Compressed_Literals_Block`. 430*01826a49SYabin Cui 431*01826a49SYabin CuiLiterals Section 432*01826a49SYabin Cui---------------- 433*01826a49SYabin CuiAll literals are regrouped in the first part of the block. 434*01826a49SYabin CuiThey can be decoded first, and then copied during [Sequence Execution], 435*01826a49SYabin Cuior they can be decoded on the flow during [Sequence Execution]. 436*01826a49SYabin Cui 437*01826a49SYabin CuiLiterals can be stored uncompressed or compressed using Huffman prefix codes. 438*01826a49SYabin CuiWhen compressed, a tree description may optionally be present, 439*01826a49SYabin Cuifollowed by 1 or 4 streams. 440*01826a49SYabin Cui 441*01826a49SYabin Cui| `Literals_Section_Header` | [`Huffman_Tree_Description`] | [jumpTable] | Stream1 | [Stream2] | [Stream3] | [Stream4] | 442*01826a49SYabin Cui| ------------------------- | ---------------------------- | ----------- | ------- | --------- | --------- | --------- | 443*01826a49SYabin Cui 444*01826a49SYabin Cui 445*01826a49SYabin Cui### `Literals_Section_Header` 446*01826a49SYabin Cui 447*01826a49SYabin CuiHeader is in charge of describing how literals are packed. 448*01826a49SYabin CuiIt's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes, 449*01826a49SYabin Cuiusing __little-endian__ convention. 450*01826a49SYabin Cui 451*01826a49SYabin Cui| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] | 452*01826a49SYabin Cui| --------------------- | ------------- | ------------------ | ------------------- | 453*01826a49SYabin Cui| 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits | 454*01826a49SYabin Cui 455*01826a49SYabin CuiIn this representation, bits on the left are the lowest bits. 456*01826a49SYabin Cui 457*01826a49SYabin Cui__`Literals_Block_Type`__ 458*01826a49SYabin Cui 459*01826a49SYabin CuiThis field uses 2 lowest bits of first byte, describing 4 different block types : 460*01826a49SYabin Cui 461*01826a49SYabin Cui| `Literals_Block_Type` | Value | 462*01826a49SYabin Cui| --------------------------- | ----- | 463*01826a49SYabin Cui| `Raw_Literals_Block` | 0 | 464*01826a49SYabin Cui| `RLE_Literals_Block` | 1 | 465*01826a49SYabin Cui| `Compressed_Literals_Block` | 2 | 466*01826a49SYabin Cui| `Treeless_Literals_Block` | 3 | 467*01826a49SYabin Cui 468*01826a49SYabin Cui- `Raw_Literals_Block` - Literals are stored uncompressed. 469*01826a49SYabin Cui- `RLE_Literals_Block` - Literals consist of a single byte value 470*01826a49SYabin Cui repeated `Regenerated_Size` times. 471*01826a49SYabin Cui- `Compressed_Literals_Block` - This is a standard Huffman-compressed block, 472*01826a49SYabin Cui starting with a Huffman tree description. 473*01826a49SYabin Cui In this mode, there are at least 2 different literals represented in the Huffman tree description. 474*01826a49SYabin Cui See details below. 475*01826a49SYabin Cui- `Treeless_Literals_Block` - This is a Huffman-compressed block, 476*01826a49SYabin Cui using Huffman tree _from previous Huffman-compressed literals block_. 477*01826a49SYabin Cui `Huffman_Tree_Description` will be skipped. 478*01826a49SYabin Cui Note: If this mode is triggered without any previous Huffman-table in the frame 479*01826a49SYabin Cui (or [dictionary](#dictionary-format)), this should be treated as data corruption. 480*01826a49SYabin Cui 481*01826a49SYabin Cui__`Size_Format`__ 482*01826a49SYabin Cui 483*01826a49SYabin Cui`Size_Format` is divided into 2 families : 484*01826a49SYabin Cui 485*01826a49SYabin Cui- For `Raw_Literals_Block` and `RLE_Literals_Block`, 486*01826a49SYabin Cui it's only necessary to decode `Regenerated_Size`. 487*01826a49SYabin Cui There is no `Compressed_Size` field. 488*01826a49SYabin Cui- For `Compressed_Block` and `Treeless_Literals_Block`, 489*01826a49SYabin Cui it's required to decode both `Compressed_Size` 490*01826a49SYabin Cui and `Regenerated_Size` (the decompressed size). 491*01826a49SYabin Cui It's also necessary to decode the number of streams (1 or 4). 492*01826a49SYabin Cui 493*01826a49SYabin CuiFor values spanning several bytes, convention is __little-endian__. 494*01826a49SYabin Cui 495*01826a49SYabin Cui__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ : 496*01826a49SYabin Cui 497*01826a49SYabin Cui`Size_Format` uses 1 _or_ 2 bits. 498*01826a49SYabin CuiIts value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3` 499*01826a49SYabin Cui 500*01826a49SYabin Cui- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit. 501*01826a49SYabin Cui `Regenerated_Size` uses 5 bits (0-31). 502*01826a49SYabin Cui `Literals_Section_Header` uses 1 byte. 503*01826a49SYabin Cui `Regenerated_Size = Literals_Section_Header[0]>>3` 504*01826a49SYabin Cui- `Size_Format` == 01 : `Size_Format` uses 2 bits. 505*01826a49SYabin Cui `Regenerated_Size` uses 12 bits (0-4095). 506*01826a49SYabin Cui `Literals_Section_Header` uses 2 bytes. 507*01826a49SYabin Cui `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)` 508*01826a49SYabin Cui- `Size_Format` == 11 : `Size_Format` uses 2 bits. 509*01826a49SYabin Cui `Regenerated_Size` uses 20 bits (0-1048575). 510*01826a49SYabin Cui `Literals_Section_Header` uses 3 bytes. 511*01826a49SYabin Cui `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)` 512*01826a49SYabin Cui 513*01826a49SYabin CuiOnly Stream1 is present for these cases. 514*01826a49SYabin CuiNote : it's allowed to represent a short value (for example `27`) 515*01826a49SYabin Cuiusing a long format, even if it's less efficient. 516*01826a49SYabin Cui 517*01826a49SYabin Cui__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ : 518*01826a49SYabin Cui 519*01826a49SYabin Cui`Size_Format` always uses 2 bits. 520*01826a49SYabin Cui 521*01826a49SYabin Cui- `Size_Format` == 00 : _A single stream_. 522*01826a49SYabin Cui Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023). 523*01826a49SYabin Cui `Literals_Section_Header` uses 3 bytes. 524*01826a49SYabin Cui- `Size_Format` == 01 : 4 streams. 525*01826a49SYabin Cui Both `Regenerated_Size` and `Compressed_Size` use 10 bits (6-1023). 526*01826a49SYabin Cui `Literals_Section_Header` uses 3 bytes. 527*01826a49SYabin Cui- `Size_Format` == 10 : 4 streams. 528*01826a49SYabin Cui Both `Regenerated_Size` and `Compressed_Size` use 14 bits (6-16383). 529*01826a49SYabin Cui `Literals_Section_Header` uses 4 bytes. 530*01826a49SYabin Cui- `Size_Format` == 11 : 4 streams. 531*01826a49SYabin Cui Both `Regenerated_Size` and `Compressed_Size` use 18 bits (6-262143). 532*01826a49SYabin Cui `Literals_Section_Header` uses 5 bytes. 533*01826a49SYabin Cui 534*01826a49SYabin CuiBoth `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention. 535*01826a49SYabin CuiNote: `Compressed_Size` __includes__ the size of the Huffman Tree description 536*01826a49SYabin Cui_when_ it is present. 537*01826a49SYabin CuiNote 2: `Compressed_Size` can never be `==0`. 538*01826a49SYabin CuiEven in single-stream scenario, assuming an empty content, it must be `>=1`, 539*01826a49SYabin Cuisince it contains at least the final end bit flag. 540*01826a49SYabin CuiIn 4-streams scenario, a valid `Compressed_Size` is necessarily `>= 10` 541*01826a49SYabin Cui(6 bytes for the jump table, + 4x1 bytes for the 4 streams). 542*01826a49SYabin Cui 543*01826a49SYabin Cui4 streams is faster than 1 stream in decompression speed, 544*01826a49SYabin Cuiby exploiting instruction level parallelism. 545*01826a49SYabin CuiBut it's also more expensive, 546*01826a49SYabin Cuicosting on average ~7.3 bytes more than the 1 stream mode, mostly from the jump table. 547*01826a49SYabin Cui 548*01826a49SYabin CuiIn general, use the 4 streams mode when there are more literals to decode, 549*01826a49SYabin Cuito favor higher decompression speeds. 550*01826a49SYabin CuiNote that beyond >1KB of literals, the 4 streams mode is compulsory. 551*01826a49SYabin Cui 552*01826a49SYabin CuiNote that a minimum of 6 bytes is required for the 4 streams mode. 553*01826a49SYabin CuiThat's a technical minimum, but it's not recommended to employ the 4 streams mode 554*01826a49SYabin Cuifor such a small quantity, that would be wasteful. 555*01826a49SYabin CuiA more practical lower bound would be around ~256 bytes. 556*01826a49SYabin Cui 557*01826a49SYabin Cui#### Raw Literals Block 558*01826a49SYabin CuiThe data in Stream1 is `Regenerated_Size` bytes long, 559*01826a49SYabin Cuiit contains the raw literals data to be used during [Sequence Execution]. 560*01826a49SYabin Cui 561*01826a49SYabin Cui#### RLE Literals Block 562*01826a49SYabin CuiStream1 consists of a single byte which should be repeated `Regenerated_Size` times 563*01826a49SYabin Cuito generate the decoded literals. 564*01826a49SYabin Cui 565*01826a49SYabin Cui#### Compressed Literals Block and Treeless Literals Block 566*01826a49SYabin CuiBoth of these modes contain Huffman encoded data. 567*01826a49SYabin Cui 568*01826a49SYabin CuiFor `Treeless_Literals_Block`, 569*01826a49SYabin Cuithe Huffman table comes from previously compressed literals block, 570*01826a49SYabin Cuior from a dictionary. 571*01826a49SYabin Cui 572*01826a49SYabin Cui 573*01826a49SYabin Cui### `Huffman_Tree_Description` 574*01826a49SYabin CuiThis section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`). 575*01826a49SYabin CuiThe tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256. 576*01826a49SYabin CuiThe format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description). 577*01826a49SYabin CuiThe size of `Huffman_Tree_Description` is determined during decoding process, 578*01826a49SYabin Cuiit must be used to determine where streams begin. 579*01826a49SYabin Cui`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`. 580*01826a49SYabin Cui 581*01826a49SYabin Cui 582*01826a49SYabin Cui### Jump Table 583*01826a49SYabin CuiThe Jump Table is only present when there are 4 Huffman-coded streams. 584*01826a49SYabin Cui 585*01826a49SYabin CuiReminder : Huffman compressed data consists of either 1 or 4 streams. 586*01826a49SYabin Cui 587*01826a49SYabin CuiIf only one stream is present, it is a single bitstream occupying the entire 588*01826a49SYabin Cuiremaining portion of the literals block, encoded as described in 589*01826a49SYabin Cui[Huffman-Coded Streams](#huffman-coded-streams). 590*01826a49SYabin Cui 591*01826a49SYabin CuiIf there are four streams, `Literals_Section_Header` only provided 592*01826a49SYabin Cuienough information to know the decompressed and compressed sizes 593*01826a49SYabin Cuiof all four streams _combined_. 594*01826a49SYabin CuiThe decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`, 595*01826a49SYabin Cuiexcept for the last stream which may be up to 3 bytes smaller, 596*01826a49SYabin Cuito reach a total decompressed size as specified in `Regenerated_Size`. 597*01826a49SYabin Cui 598*01826a49SYabin CuiThe compressed size of each stream is provided explicitly in the Jump Table. 599*01826a49SYabin CuiJump Table is 6 bytes long, and consists of three 2-byte __little-endian__ fields, 600*01826a49SYabin Cuidescribing the compressed sizes of the first three streams. 601*01826a49SYabin Cui`Stream4_Size` is computed from `Total_Streams_Size` minus sizes of other streams: 602*01826a49SYabin Cui 603*01826a49SYabin Cui`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`. 604*01826a49SYabin Cui 605*01826a49SYabin Cui`Stream4_Size` is necessarily `>= 1`. Therefore, 606*01826a49SYabin Cuiif `Total_Streams_Size < Stream1_Size + Stream2_Size + Stream3_Size + 6 + 1`, 607*01826a49SYabin Cuidata is considered corrupted. 608*01826a49SYabin Cui 609*01826a49SYabin CuiEach of these 4 bitstreams is then decoded independently as a Huffman-Coded stream, 610*01826a49SYabin Cuias described in [Huffman-Coded Streams](#huffman-coded-streams) 611*01826a49SYabin Cui 612*01826a49SYabin Cui 613*01826a49SYabin CuiSequences Section 614*01826a49SYabin Cui----------------- 615*01826a49SYabin CuiA compressed block is a succession of _sequences_ . 616*01826a49SYabin CuiA sequence is a literal copy command, followed by a match copy command. 617*01826a49SYabin CuiA literal copy command specifies a length. 618*01826a49SYabin CuiIt is the number of bytes to be copied (or extracted) from the Literals Section. 619*01826a49SYabin CuiA match copy command specifies an offset and a length. 620*01826a49SYabin Cui 621*01826a49SYabin CuiWhen all _sequences_ are decoded, 622*01826a49SYabin Cuiif there are literals left in the _literals section_, 623*01826a49SYabin Cuithese bytes are added at the end of the block. 624*01826a49SYabin Cui 625*01826a49SYabin CuiThis is described in more detail in [Sequence Execution](#sequence-execution). 626*01826a49SYabin Cui 627*01826a49SYabin CuiThe `Sequences_Section` regroup all symbols required to decode commands. 628*01826a49SYabin CuiThere are 3 symbol types : literals lengths, offsets and match lengths. 629*01826a49SYabin CuiThey are encoded together, interleaved, in a single _bitstream_. 630*01826a49SYabin Cui 631*01826a49SYabin CuiThe `Sequences_Section` starts by a header, 632*01826a49SYabin Cuifollowed by optional probability tables for each symbol type, 633*01826a49SYabin Cuifollowed by the bitstream. 634*01826a49SYabin Cui 635*01826a49SYabin Cui| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream | 636*01826a49SYabin Cui| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- | 637*01826a49SYabin Cui 638*01826a49SYabin CuiTo decode the `Sequences_Section`, it's required to know its size. 639*01826a49SYabin CuiIts size is deduced from the size of `Literals_Section`: 640*01826a49SYabin Cui`Sequences_Section_Size = Block_Size - Literals_Section_Size`. 641*01826a49SYabin Cui 642*01826a49SYabin Cui 643*01826a49SYabin Cui#### `Sequences_Section_Header` 644*01826a49SYabin Cui 645*01826a49SYabin CuiConsists of 2 items: 646*01826a49SYabin Cui- `Number_of_Sequences` 647*01826a49SYabin Cui- Symbol compression modes 648*01826a49SYabin Cui 649*01826a49SYabin Cui__`Number_of_Sequences`__ 650*01826a49SYabin Cui 651*01826a49SYabin CuiThis is a variable size field using between 1 and 3 bytes. 652*01826a49SYabin CuiLet's call its first byte `byte0`. 653*01826a49SYabin Cui- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte. 654*01826a49SYabin Cui- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0 - 0x80) << 8) + byte1`. Uses 2 bytes. 655*01826a49SYabin Cui Note that the 2 bytes format fully overlaps the 1 byte format. 656*01826a49SYabin Cui- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00`. Uses 3 bytes. 657*01826a49SYabin Cui 658*01826a49SYabin Cui`if (Number_of_Sequences == 0)` : there are no sequences. 659*01826a49SYabin Cui The sequence section stops immediately, 660*01826a49SYabin Cui FSE tables used in `Repeat_Mode` aren't updated. 661*01826a49SYabin Cui Block's decompressed content is defined solely by the Literals Section content. 662*01826a49SYabin Cui 663*01826a49SYabin Cui__Symbol compression modes__ 664*01826a49SYabin Cui 665*01826a49SYabin CuiThis is a single byte, defining the compression mode of each symbol type. 666*01826a49SYabin Cui 667*01826a49SYabin Cui|Bit number| 7-6 | 5-4 | 3-2 | 1-0 | 668*01826a49SYabin Cui| -------- | ----------------------- | -------------- | -------------------- | ---------- | 669*01826a49SYabin Cui|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` | 670*01826a49SYabin Cui 671*01826a49SYabin CuiThe last field, `Reserved`, must be all-zeroes. 672*01826a49SYabin Cui 673*01826a49SYabin Cui`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of 674*01826a49SYabin Cuiliterals lengths, offsets, and match lengths symbols respectively. 675*01826a49SYabin Cui 676*01826a49SYabin CuiThey follow the same enumeration : 677*01826a49SYabin Cui 678*01826a49SYabin Cui| Value | 0 | 1 | 2 | 3 | 679*01826a49SYabin Cui| ------------------ | ----------------- | ---------- | --------------------- | ------------- | 680*01826a49SYabin Cui| `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` | 681*01826a49SYabin Cui 682*01826a49SYabin Cui- `Predefined_Mode` : A predefined FSE distribution table is used, defined in 683*01826a49SYabin Cui [default distributions](#default-distributions). 684*01826a49SYabin Cui No distribution table will be present. 685*01826a49SYabin Cui- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value. 686*01826a49SYabin Cui This symbol will be used for all sequences. 687*01826a49SYabin Cui- `FSE_Compressed_Mode` : standard FSE compression. 688*01826a49SYabin Cui A distribution table will be present. 689*01826a49SYabin Cui The format of this distribution table is described in [FSE Table Description](#fse-table-description). 690*01826a49SYabin Cui Note that the maximum allowed accuracy log for literals length and match length tables is 9, 691*01826a49SYabin Cui and the maximum accuracy log for the offsets table is 8. 692*01826a49SYabin Cui `FSE_Compressed_Mode` must not be used when only one symbol is present, 693*01826a49SYabin Cui `RLE_Mode` should be used instead (although any other mode will work). 694*01826a49SYabin Cui- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again, 695*01826a49SYabin Cui or if this is the first block, table in the dictionary will be used. 696*01826a49SYabin Cui Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated. 697*01826a49SYabin Cui It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`. 698*01826a49SYabin Cui No distribution table will be present. 699*01826a49SYabin Cui If this mode is used without any previous sequence table in the frame 700*01826a49SYabin Cui (nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption. 701*01826a49SYabin Cui 702*01826a49SYabin Cui#### The codes for literals lengths, match lengths, and offsets. 703*01826a49SYabin Cui 704*01826a49SYabin CuiEach symbol is a _code_ in its own context, 705*01826a49SYabin Cuiwhich specifies `Baseline` and `Number_of_Bits` to add. 706*01826a49SYabin Cui_Codes_ are FSE compressed, 707*01826a49SYabin Cuiand interleaved with raw additional bits in the same bitstream. 708*01826a49SYabin Cui 709*01826a49SYabin Cui##### Literals length codes 710*01826a49SYabin Cui 711*01826a49SYabin CuiLiterals length codes are values ranging from `0` to `35` included. 712*01826a49SYabin CuiThey define lengths from 0 to 131071 bytes. 713*01826a49SYabin CuiThe literals length is equal to the decoded `Baseline` plus 714*01826a49SYabin Cuithe result of reading `Number_of_Bits` bits from the bitstream, 715*01826a49SYabin Cuias a __little-endian__ value. 716*01826a49SYabin Cui 717*01826a49SYabin Cui| `Literals_Length_Code` | 0-15 | 718*01826a49SYabin Cui| ---------------------- | ---------------------- | 719*01826a49SYabin Cui| length | `Literals_Length_Code` | 720*01826a49SYabin Cui| `Number_of_Bits` | 0 | 721*01826a49SYabin Cui 722*01826a49SYabin Cui| `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 723*01826a49SYabin Cui| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 724*01826a49SYabin Cui| `Baseline` | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 | 725*01826a49SYabin Cui| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 726*01826a49SYabin Cui 727*01826a49SYabin Cui| `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 728*01826a49SYabin Cui| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 729*01826a49SYabin Cui| `Baseline` | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | 730*01826a49SYabin Cui| `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 731*01826a49SYabin Cui 732*01826a49SYabin Cui| `Literals_Length_Code` | 32 | 33 | 34 | 35 | 733*01826a49SYabin Cui| ---------------------- | ---- | ---- | ---- | ---- | 734*01826a49SYabin Cui| `Baseline` | 8192 |16384 |32768 |65536 | 735*01826a49SYabin Cui| `Number_of_Bits` | 13 | 14 | 15 | 16 | 736*01826a49SYabin Cui 737*01826a49SYabin Cui 738*01826a49SYabin Cui##### Match length codes 739*01826a49SYabin Cui 740*01826a49SYabin CuiMatch length codes are values ranging from `0` to `52` included. 741*01826a49SYabin CuiThey define lengths from 3 to 131074 bytes. 742*01826a49SYabin CuiThe match length is equal to the decoded `Baseline` plus 743*01826a49SYabin Cuithe result of reading `Number_of_Bits` bits from the bitstream, 744*01826a49SYabin Cuias a __little-endian__ value. 745*01826a49SYabin Cui 746*01826a49SYabin Cui| `Match_Length_Code` | 0-31 | 747*01826a49SYabin Cui| ------------------- | ----------------------- | 748*01826a49SYabin Cui| value | `Match_Length_Code` + 3 | 749*01826a49SYabin Cui| `Number_of_Bits` | 0 | 750*01826a49SYabin Cui 751*01826a49SYabin Cui| `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 752*01826a49SYabin Cui| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 753*01826a49SYabin Cui| `Baseline` | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 | 754*01826a49SYabin Cui| `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | 755*01826a49SYabin Cui 756*01826a49SYabin Cui| `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 757*01826a49SYabin Cui| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | 758*01826a49SYabin Cui| `Baseline` | 67 | 83 | 99 | 131 | 259 | 515 | 1027 | 2051 | 759*01826a49SYabin Cui| `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 | 760*01826a49SYabin Cui 761*01826a49SYabin Cui| `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 | 762*01826a49SYabin Cui| ------------------- | ---- | ---- | ---- | ---- | ---- | 763*01826a49SYabin Cui| `Baseline` | 4099 | 8195 |16387 |32771 |65539 | 764*01826a49SYabin Cui| `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 | 765*01826a49SYabin Cui 766*01826a49SYabin Cui##### Offset codes 767*01826a49SYabin Cui 768*01826a49SYabin CuiOffset codes are values ranging from `0` to `N`. 769*01826a49SYabin Cui 770*01826a49SYabin CuiA decoder is free to limit its maximum `N` supported. 771*01826a49SYabin CuiRecommendation is to support at least up to `22`. 772*01826a49SYabin CuiFor information, at the time of this writing. 773*01826a49SYabin Cuithe reference decoder supports a maximum `N` value of `31`. 774*01826a49SYabin Cui 775*01826a49SYabin CuiAn offset code is also the number of additional bits to read in __little-endian__ fashion, 776*01826a49SYabin Cuiand can be translated into an `Offset_Value` using the following formulas : 777*01826a49SYabin Cui 778*01826a49SYabin Cui``` 779*01826a49SYabin CuiOffset_Value = (1 << offsetCode) + readNBits(offsetCode); 780*01826a49SYabin Cuiif (Offset_Value > 3) offset = Offset_Value - 3; 781*01826a49SYabin Cui``` 782*01826a49SYabin CuiIt means that maximum `Offset_Value` is `(2^(N+1))-1` 783*01826a49SYabin Cuisupporting back-reference distances up to `(2^(N+1))-4`, 784*01826a49SYabin Cuibut is limited by [maximum back-reference distance](#window_descriptor). 785*01826a49SYabin Cui 786*01826a49SYabin Cui`Offset_Value` from 1 to 3 are special : they define "repeat codes". 787*01826a49SYabin CuiThis is described in more detail in [Repeat Offsets](#repeat-offsets). 788*01826a49SYabin Cui 789*01826a49SYabin Cui#### Decoding Sequences 790*01826a49SYabin CuiFSE bitstreams are read in reverse direction than written. In zstd, 791*01826a49SYabin Cuithe compressor writes bits forward into a block and the decompressor 792*01826a49SYabin Cuimust read the bitstream _backwards_. 793*01826a49SYabin Cui 794*01826a49SYabin CuiTo find the start of the bitstream it is therefore necessary to 795*01826a49SYabin Cuiknow the offset of the last byte of the block which can be found 796*01826a49SYabin Cuiby counting `Block_Size` bytes after the block header. 797*01826a49SYabin Cui 798*01826a49SYabin CuiAfter writing the last bit containing information, the compressor 799*01826a49SYabin Cuiwrites a single `1`-bit and then fills the byte with 0-7 `0` bits of 800*01826a49SYabin Cuipadding. The last byte of the compressed bitstream cannot be `0` for 801*01826a49SYabin Cuithat reason. 802*01826a49SYabin Cui 803*01826a49SYabin CuiWhen decompressing, the last byte containing the padding is the first 804*01826a49SYabin Cuibyte to read. The decompressor needs to skip 0-7 initial `0`-bits and 805*01826a49SYabin Cuithe first `1`-bit it occurs. Afterwards, the useful part of the bitstream 806*01826a49SYabin Cuibegins. 807*01826a49SYabin Cui 808*01826a49SYabin CuiFSE decoding requires a 'state' to be carried from symbol to symbol. 809*01826a49SYabin CuiFor more explanation on FSE decoding, see the [FSE section](#fse). 810*01826a49SYabin Cui 811*01826a49SYabin CuiFor sequence decoding, a separate state keeps track of each 812*01826a49SYabin Cuiliteral lengths, offsets, and match lengths symbols. 813*01826a49SYabin CuiSome FSE primitives are also used. 814*01826a49SYabin CuiFor more details on the operation of these primitives, see the [FSE section](#fse). 815*01826a49SYabin Cui 816*01826a49SYabin Cui##### Starting states 817*01826a49SYabin CuiThe bitstream starts with initial FSE state values, 818*01826a49SYabin Cuieach using the required number of bits in their respective _accuracy_, 819*01826a49SYabin Cuidecoded previously from their normalized distribution. 820*01826a49SYabin Cui 821*01826a49SYabin CuiIt starts by `Literals_Length_State`, 822*01826a49SYabin Cuifollowed by `Offset_State`, 823*01826a49SYabin Cuiand finally `Match_Length_State`. 824*01826a49SYabin Cui 825*01826a49SYabin CuiReminder : always keep in mind that all values are read _backward_, 826*01826a49SYabin Cuiso the 'start' of the bitstream is at the highest position in memory, 827*01826a49SYabin Cuiimmediately before the last `1`-bit for padding. 828*01826a49SYabin Cui 829*01826a49SYabin CuiAfter decoding the starting states, a single sequence is decoded 830*01826a49SYabin Cui`Number_Of_Sequences` times. 831*01826a49SYabin CuiThese sequences are decoded in order from first to last. 832*01826a49SYabin CuiSince the compressor writes the bitstream in the forward direction, 833*01826a49SYabin Cuithis means the compressor must encode the sequences starting with the last 834*01826a49SYabin Cuione and ending with the first. 835*01826a49SYabin Cui 836*01826a49SYabin Cui##### Decoding a sequence 837*01826a49SYabin CuiFor each of the symbol types, the FSE state can be used to determine the appropriate code. 838*01826a49SYabin CuiThe code then defines the `Baseline` and `Number_of_Bits` to read for each type. 839*01826a49SYabin CuiSee the [description of the codes] for how to determine these values. 840*01826a49SYabin Cui 841*01826a49SYabin Cui[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets 842*01826a49SYabin Cui 843*01826a49SYabin CuiDecoding starts by reading the `Number_of_Bits` required to decode `Offset`. 844*01826a49SYabin CuiIt then does the same for `Match_Length`, and then for `Literals_Length`. 845*01826a49SYabin CuiThis sequence is then used for [sequence execution](#sequence-execution). 846*01826a49SYabin Cui 847*01826a49SYabin CuiIf it is not the last sequence in the block, 848*01826a49SYabin Cuithe next operation is to update states. 849*01826a49SYabin CuiUsing the rules pre-calculated in the decoding tables, 850*01826a49SYabin Cui`Literals_Length_State` is updated, 851*01826a49SYabin Cuifollowed by `Match_Length_State`, 852*01826a49SYabin Cuiand then `Offset_State`. 853*01826a49SYabin CuiSee the [FSE section](#fse) for details on how to update states from the bitstream. 854*01826a49SYabin Cui 855*01826a49SYabin CuiThis operation will be repeated `Number_of_Sequences` times. 856*01826a49SYabin CuiAt the end, the bitstream shall be entirely consumed, 857*01826a49SYabin Cuiotherwise the bitstream is considered corrupted. 858*01826a49SYabin Cui 859*01826a49SYabin Cui#### Default Distributions 860*01826a49SYabin CuiIf `Predefined_Mode` is selected for a symbol type, 861*01826a49SYabin Cuiits FSE decoding table is generated from a predefined distribution table defined here. 862*01826a49SYabin CuiFor details on how to convert this distribution into a decoding table, see the [FSE section]. 863*01826a49SYabin Cui 864*01826a49SYabin Cui[FSE section]: #from-normalized-distribution-to-decoding-tables 865*01826a49SYabin Cui 866*01826a49SYabin Cui##### Literals Length 867*01826a49SYabin CuiThe decoding table uses an accuracy log of 6 bits (64 states). 868*01826a49SYabin Cui``` 869*01826a49SYabin Cuishort literalsLength_defaultDistribution[36] = 870*01826a49SYabin Cui { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 871*01826a49SYabin Cui 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, 872*01826a49SYabin Cui -1,-1,-1,-1 }; 873*01826a49SYabin Cui``` 874*01826a49SYabin Cui 875*01826a49SYabin Cui##### Match Length 876*01826a49SYabin CuiThe decoding table uses an accuracy log of 6 bits (64 states). 877*01826a49SYabin Cui``` 878*01826a49SYabin Cuishort matchLengths_defaultDistribution[53] = 879*01826a49SYabin Cui { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 880*01826a49SYabin Cui 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 881*01826a49SYabin Cui 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, 882*01826a49SYabin Cui -1,-1,-1,-1,-1 }; 883*01826a49SYabin Cui``` 884*01826a49SYabin Cui 885*01826a49SYabin Cui##### Offset Codes 886*01826a49SYabin CuiThe decoding table uses an accuracy log of 5 bits (32 states), 887*01826a49SYabin Cuiand supports a maximum `N` value of 28, allowing offset values up to 536,870,908 . 888*01826a49SYabin Cui 889*01826a49SYabin CuiIf any sequence in the compressed block requires a larger offset than this, 890*01826a49SYabin Cuiit's not possible to use the default distribution to represent it. 891*01826a49SYabin Cui``` 892*01826a49SYabin Cuishort offsetCodes_defaultDistribution[29] = 893*01826a49SYabin Cui { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 894*01826a49SYabin Cui 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 }; 895*01826a49SYabin Cui``` 896*01826a49SYabin Cui 897*01826a49SYabin Cui 898*01826a49SYabin CuiSequence Execution 899*01826a49SYabin Cui------------------ 900*01826a49SYabin CuiOnce literals and sequences have been decoded, 901*01826a49SYabin Cuithey are combined to produce the decoded content of a block. 902*01826a49SYabin Cui 903*01826a49SYabin CuiEach sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`), 904*01826a49SYabin Cuidecoded as described in the [Sequences Section](#sequences-section). 905*01826a49SYabin CuiTo execute a sequence, first copy `literals_length` bytes 906*01826a49SYabin Cuifrom the decoded literals to the output. 907*01826a49SYabin Cui 908*01826a49SYabin CuiThen `match_length` bytes are copied from previous decoded data. 909*01826a49SYabin CuiThe offset to copy from is determined by `offset_value`: 910*01826a49SYabin Cuiif `offset_value > 3`, then the offset is `offset_value - 3`. 911*01826a49SYabin CuiIf `offset_value` is from 1-3, the offset is a special repeat offset value. 912*01826a49SYabin CuiSee the [repeat offset](#repeat-offsets) section for how the offset is determined 913*01826a49SYabin Cuiin this case. 914*01826a49SYabin Cui 915*01826a49SYabin CuiThe offset is defined as from the current position, so an offset of 6 916*01826a49SYabin Cuiand a match length of 3 means that 3 bytes should be copied from 6 bytes back. 917*01826a49SYabin CuiNote that all offsets leading to previously decoded data 918*01826a49SYabin Cuimust be smaller than `Window_Size` defined in `Frame_Header_Descriptor`. 919*01826a49SYabin Cui 920*01826a49SYabin Cui#### Repeat offsets 921*01826a49SYabin CuiAs seen in [Sequence Execution](#sequence-execution), 922*01826a49SYabin Cuithe first 3 values define a repeated offset and we will call them 923*01826a49SYabin Cui`Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`. 924*01826a49SYabin CuiThey are sorted in recency order, with `Repeated_Offset1` meaning "most recent one". 925*01826a49SYabin Cui 926*01826a49SYabin CuiIf `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc. 927*01826a49SYabin Cui 928*01826a49SYabin CuiThere is an exception though, when current sequence's `literals_length = 0`. 929*01826a49SYabin CuiIn this case, repeated offsets are shifted by one, 930*01826a49SYabin Cuiso an `offset_value` of 1 means `Repeated_Offset2`, 931*01826a49SYabin Cuian `offset_value` of 2 means `Repeated_Offset3`, 932*01826a49SYabin Cuiand an `offset_value` of 3 means `Repeated_Offset1 - 1`. 933*01826a49SYabin Cui 934*01826a49SYabin CuiIn the final case, if `Repeated_Offset1 - 1` evaluates to 0, then the 935*01826a49SYabin Cuidata is considered corrupted. 936*01826a49SYabin Cui 937*01826a49SYabin CuiFor the first block, the starting offset history is populated with following values : 938*01826a49SYabin Cui`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8, 939*01826a49SYabin Cuiunless a dictionary is used, in which case they come from the dictionary. 940*01826a49SYabin Cui 941*01826a49SYabin CuiThen each block gets its starting offset history from the ending values of the most recent `Compressed_Block`. 942*01826a49SYabin CuiNote that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history. 943*01826a49SYabin Cui 944*01826a49SYabin Cui[Offset Codes]: #offset-codes 945*01826a49SYabin Cui 946*01826a49SYabin Cui###### Offset updates rules 947*01826a49SYabin Cui 948*01826a49SYabin CuiDuring the execution of the sequences of a `Compressed_Block`, the 949*01826a49SYabin Cui`Repeated_Offsets`' values are kept up to date, so that they always represent 950*01826a49SYabin Cuithe three most-recently used offsets. In order to achieve that, they are 951*01826a49SYabin Cuiupdated after executing each sequence in the following way: 952*01826a49SYabin Cui 953*01826a49SYabin CuiWhen the sequence's `offset_value` does not refer to one of the 954*01826a49SYabin Cui`Repeated_Offsets`--when it has value greater than 3, or when it has value 3 955*01826a49SYabin Cuiand the sequence's `literals_length` is zero--the `Repeated_Offsets`' values 956*01826a49SYabin Cuiare shifted back one, and `Repeated_Offset1` takes on the value of the 957*01826a49SYabin Cuijust-used offset. 958*01826a49SYabin Cui 959*01826a49SYabin CuiOtherwise, when the sequence's `offset_value` refers to one of the 960*01826a49SYabin Cui`Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the 961*01826a49SYabin Cuisequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered 962*01826a49SYabin Cuiso that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and 963*01826a49SYabin Cuithe existing values are pushed back from the first `Repeated_Offset` through to 964*01826a49SYabin Cuithe `Repeated_Offset` selected by the `offset_value`. This effectively performs 965*01826a49SYabin Cuia single-stepped wrapping rotation of the values of these offsets, so that 966*01826a49SYabin Cuitheir order again reflects the recency of their use. 967*01826a49SYabin Cui 968*01826a49SYabin CuiThe following table shows the values of the `Repeated_Offsets` as a series of 969*01826a49SYabin Cuisequences are applied to them: 970*01826a49SYabin Cui 971*01826a49SYabin Cui| `offset_value` | `literals_length` | `Repeated_Offset1` | `Repeated_Offset2` | `Repeated_Offset3` | Comment | 972*01826a49SYabin Cui|:--------------:|:-----------------:|:------------------:|:------------------:|:------------------:|:-----------------------:| 973*01826a49SYabin Cui| | | 1 | 4 | 8 | starting values | 974*01826a49SYabin Cui| 1114 | 11 | 1111 | 1 | 4 | non-repeat | 975*01826a49SYabin Cui| 1 | 22 | 1111 | 1 | 4 | repeat 1: no change | 976*01826a49SYabin Cui| 2225 | 22 | 2222 | 1111 | 1 | non-repeat | 977*01826a49SYabin Cui| 1114 | 111 | 1111 | 2222 | 1111 | non-repeat | 978*01826a49SYabin Cui| 3336 | 33 | 3333 | 1111 | 2222 | non-repeat | 979*01826a49SYabin Cui| 2 | 22 | 1111 | 3333 | 2222 | repeat 2: swap 1 & 2 | 980*01826a49SYabin Cui| 3 | 33 | 2222 | 1111 | 3333 | repeat 3: rotate 3 to 1 | 981*01826a49SYabin Cui| 3 | 0 | 2221 | 2222 | 1111 | special case : insert `repeat1 - 1` | 982*01826a49SYabin Cui| 1 | 0 | 2222 | 2221 | 1111 | == repeat 2 | 983*01826a49SYabin Cui 984*01826a49SYabin Cui 985*01826a49SYabin CuiSkippable Frames 986*01826a49SYabin Cui---------------- 987*01826a49SYabin Cui 988*01826a49SYabin Cui| `Magic_Number` | `Frame_Size` | `User_Data` | 989*01826a49SYabin Cui|:--------------:|:------------:|:-----------:| 990*01826a49SYabin Cui| 4 bytes | 4 bytes | n bytes | 991*01826a49SYabin Cui 992*01826a49SYabin CuiSkippable frames allow the insertion of user-defined metadata 993*01826a49SYabin Cuiinto a flow of concatenated frames. 994*01826a49SYabin Cui 995*01826a49SYabin CuiSkippable frames defined in this specification are compatible with [LZ4] ones. 996*01826a49SYabin Cui 997*01826a49SYabin Cui[LZ4]:https://lz4.github.io/lz4/ 998*01826a49SYabin Cui 999*01826a49SYabin CuiFrom a compliant decoder perspective, skippable frames need just be skipped, 1000*01826a49SYabin Cuiand their content ignored, resuming decoding after the skippable frame. 1001*01826a49SYabin Cui 1002*01826a49SYabin CuiIt can be noted that a skippable frame 1003*01826a49SYabin Cuican be used to watermark a stream of concatenated frames 1004*01826a49SYabin Cuiembedding any kind of tracking information (even just a UUID). 1005*01826a49SYabin CuiUsers wary of such possibility should scan the stream of concatenated frames 1006*01826a49SYabin Cuiin an attempt to detect such frame for analysis or removal. 1007*01826a49SYabin Cui 1008*01826a49SYabin Cui__`Magic_Number`__ 1009*01826a49SYabin Cui 1010*01826a49SYabin Cui4 Bytes, __little-endian__ format. 1011*01826a49SYabin CuiValue : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F. 1012*01826a49SYabin CuiAll 16 values are valid to identify a skippable frame. 1013*01826a49SYabin CuiThis specification doesn't detail any specific tagging for skippable frames. 1014*01826a49SYabin Cui 1015*01826a49SYabin Cui__`Frame_Size`__ 1016*01826a49SYabin Cui 1017*01826a49SYabin CuiThis is the size, in bytes, of the following `User_Data` 1018*01826a49SYabin Cui(without including the magic number nor the size field itself). 1019*01826a49SYabin CuiThis field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits. 1020*01826a49SYabin CuiThis means `User_Data` can’t be bigger than (2^32-1) bytes. 1021*01826a49SYabin Cui 1022*01826a49SYabin Cui__`User_Data`__ 1023*01826a49SYabin Cui 1024*01826a49SYabin CuiThe `User_Data` can be anything. Data will just be skipped by the decoder. 1025*01826a49SYabin Cui 1026*01826a49SYabin Cui 1027*01826a49SYabin Cui 1028*01826a49SYabin CuiEntropy Encoding 1029*01826a49SYabin Cui---------------- 1030*01826a49SYabin CuiTwo types of entropy encoding are used by the Zstandard format: 1031*01826a49SYabin CuiFSE, and Huffman coding. 1032*01826a49SYabin CuiHuffman is used to compress literals, 1033*01826a49SYabin Cuiwhile FSE is used for all other symbols 1034*01826a49SYabin Cui(`Literals_Length_Code`, `Match_Length_Code`, offset codes) 1035*01826a49SYabin Cuiand to compress Huffman headers. 1036*01826a49SYabin Cui 1037*01826a49SYabin Cui 1038*01826a49SYabin CuiFSE 1039*01826a49SYabin Cui--- 1040*01826a49SYabin CuiFSE, short for Finite State Entropy, is an entropy codec based on [ANS]. 1041*01826a49SYabin CuiFSE encoding/decoding involves a state that is carried over between symbols, 1042*01826a49SYabin Cuiso decoding must be done in the opposite direction as encoding. 1043*01826a49SYabin CuiTherefore, all FSE bitstreams are read from end to beginning. 1044*01826a49SYabin CuiNote that the order of the bits in the stream is not reversed, 1045*01826a49SYabin Cuiwe just read the elements in the reverse order they are written. 1046*01826a49SYabin Cui 1047*01826a49SYabin CuiFor additional details on FSE, see [Finite State Entropy]. 1048*01826a49SYabin Cui 1049*01826a49SYabin Cui[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/ 1050*01826a49SYabin Cui 1051*01826a49SYabin CuiFSE decoding involves a decoding table which has a power of 2 size, and contain three elements: 1052*01826a49SYabin Cui`Symbol`, `Num_Bits`, and `Baseline`. 1053*01826a49SYabin CuiThe `log2` of the table size is its `Accuracy_Log`. 1054*01826a49SYabin CuiAn FSE state value represents an index in this table. 1055*01826a49SYabin Cui 1056*01826a49SYabin CuiTo obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value. 1057*01826a49SYabin CuiThe next symbol in the stream is the `Symbol` indicated in the table for that state. 1058*01826a49SYabin CuiTo obtain the next state value, 1059*01826a49SYabin Cuithe decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`. 1060*01826a49SYabin Cui 1061*01826a49SYabin Cui[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems 1062*01826a49SYabin Cui 1063*01826a49SYabin Cui### FSE Table Description 1064*01826a49SYabin CuiTo decode FSE streams, it is necessary to construct the decoding table. 1065*01826a49SYabin CuiThe Zstandard format encodes FSE table descriptions as follows: 1066*01826a49SYabin Cui 1067*01826a49SYabin CuiAn FSE distribution table describes the probabilities of all symbols 1068*01826a49SYabin Cuifrom `0` to the last present one (included) 1069*01826a49SYabin Cuion a normalized scale of `1 << Accuracy_Log` . 1070*01826a49SYabin CuiNote that there must be two or more symbols with nonzero probability. 1071*01826a49SYabin Cui 1072*01826a49SYabin CuiIt's a bitstream which is read forward, in __little-endian__ fashion. 1073*01826a49SYabin CuiIt's not necessary to know bitstream exact size, 1074*01826a49SYabin Cuiit will be discovered and reported by the decoding process. 1075*01826a49SYabin Cui 1076*01826a49SYabin CuiThe bitstream starts by reporting on which scale it operates. 1077*01826a49SYabin CuiLet's `low4Bits` designate the lowest 4 bits of the first byte : 1078*01826a49SYabin Cui`Accuracy_Log = low4bits + 5`. 1079*01826a49SYabin Cui 1080*01826a49SYabin CuiThen follows each symbol value, from `0` to last present one. 1081*01826a49SYabin CuiThe number of bits used by each field is variable. 1082*01826a49SYabin CuiIt depends on : 1083*01826a49SYabin Cui 1084*01826a49SYabin Cui- Remaining probabilities + 1 : 1085*01826a49SYabin Cui __example__ : 1086*01826a49SYabin Cui Presuming an `Accuracy_Log` of 8, 1087*01826a49SYabin Cui and presuming 100 probabilities points have already been distributed, 1088*01826a49SYabin Cui the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive). 1089*01826a49SYabin Cui Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)` 1090*01826a49SYabin Cui is the smallest integer `T` that satisfies `(1 << T) > N`. 1091*01826a49SYabin Cui 1092*01826a49SYabin Cui- Value decoded : small values use 1 less bit : 1093*01826a49SYabin Cui __example__ : 1094*01826a49SYabin Cui Presuming values from 0 to 157 (inclusive) are possible, 1095*01826a49SYabin Cui 255-157 = 98 values are remaining in an 8-bits field. 1096*01826a49SYabin Cui They are used this way : 1097*01826a49SYabin Cui first 98 values (hence from 0 to 97) use only 7 bits, 1098*01826a49SYabin Cui values from 98 to 157 use 8 bits. 1099*01826a49SYabin Cui This is achieved through this scheme : 1100*01826a49SYabin Cui 1101*01826a49SYabin Cui | Value read | Value decoded | Number of bits used | 1102*01826a49SYabin Cui | ---------- | ------------- | ------------------- | 1103*01826a49SYabin Cui | 0 - 97 | 0 - 97 | 7 | 1104*01826a49SYabin Cui | 98 - 127 | 98 - 127 | 8 | 1105*01826a49SYabin Cui | 128 - 225 | 0 - 97 | 7 | 1106*01826a49SYabin Cui | 226 - 255 | 128 - 157 | 8 | 1107*01826a49SYabin Cui 1108*01826a49SYabin CuiSymbols probabilities are read one by one, in order. 1109*01826a49SYabin Cui 1110*01826a49SYabin CuiProbability is obtained from Value decoded by following formula : 1111*01826a49SYabin Cui`Proba = value - 1` 1112*01826a49SYabin Cui 1113*01826a49SYabin CuiIt means value `0` becomes negative probability `-1`. 1114*01826a49SYabin Cui`-1` is a special probability, which means "less than 1". 1115*01826a49SYabin CuiIts effect on distribution table is described in the [next section]. 1116*01826a49SYabin CuiFor the purpose of calculating total allocated probability points, it counts as one. 1117*01826a49SYabin Cui 1118*01826a49SYabin Cui[next section]:#from-normalized-distribution-to-decoding-tables 1119*01826a49SYabin Cui 1120*01826a49SYabin CuiWhen a symbol has a __probability__ of `zero`, 1121*01826a49SYabin Cuiit is followed by a 2-bits repeat flag. 1122*01826a49SYabin CuiThis repeat flag tells how many probabilities of zeroes follow the current one. 1123*01826a49SYabin CuiIt provides a number ranging from 0 to 3. 1124*01826a49SYabin CuiIf it is a 3, another 2-bits repeat flag follows, and so on. 1125*01826a49SYabin Cui 1126*01826a49SYabin CuiWhen last symbol reaches cumulated total of `1 << Accuracy_Log`, 1127*01826a49SYabin Cuidecoding is complete. 1128*01826a49SYabin CuiIf the last symbol makes cumulated total go above `1 << Accuracy_Log`, 1129*01826a49SYabin Cuidistribution is considered corrupted. 1130*01826a49SYabin CuiIf this process results in a non-zero probability for a value outside of the 1131*01826a49SYabin Cuivalid range of values that the FSE table is defined for, even if that value is 1132*01826a49SYabin Cuinot used, then the data is considered corrupted. 1133*01826a49SYabin Cui 1134*01826a49SYabin CuiThen the decoder can tell how many bytes were used in this process, 1135*01826a49SYabin Cuiand how many symbols are present. 1136*01826a49SYabin CuiThe bitstream consumes a round number of bytes. 1137*01826a49SYabin CuiAny remaining bit within the last byte is just unused. 1138*01826a49SYabin Cui 1139*01826a49SYabin Cui#### From normalized distribution to decoding tables 1140*01826a49SYabin Cui 1141*01826a49SYabin CuiThe distribution of normalized probabilities is enough 1142*01826a49SYabin Cuito create a unique decoding table. 1143*01826a49SYabin Cui 1144*01826a49SYabin CuiIt follows the following build rule : 1145*01826a49SYabin Cui 1146*01826a49SYabin CuiThe table has a size of `Table_Size = 1 << Accuracy_Log`. 1147*01826a49SYabin CuiEach cell describes the symbol decoded, 1148*01826a49SYabin Cuiand instructions to get the next state (`Number_of_Bits` and `Baseline`). 1149*01826a49SYabin Cui 1150*01826a49SYabin CuiSymbols are scanned in their natural order for "less than 1" probabilities. 1151*01826a49SYabin CuiSymbols with this probability are being attributed a single cell, 1152*01826a49SYabin Cuistarting from the end of the table and retreating. 1153*01826a49SYabin CuiThese symbols define a full state reset, reading `Accuracy_Log` bits. 1154*01826a49SYabin Cui 1155*01826a49SYabin CuiThen, all remaining symbols, sorted in natural order, are allocated cells. 1156*01826a49SYabin CuiStarting from symbol `0` (if it exists), and table position `0`, 1157*01826a49SYabin Cuieach symbol gets allocated as many cells as its probability. 1158*01826a49SYabin CuiCell allocation is spread, not linear : 1159*01826a49SYabin Cuieach successor position follows this rule : 1160*01826a49SYabin Cui 1161*01826a49SYabin Cui``` 1162*01826a49SYabin Cuiposition += (tableSize>>1) + (tableSize>>3) + 3; 1163*01826a49SYabin Cuiposition &= tableSize-1; 1164*01826a49SYabin Cui``` 1165*01826a49SYabin Cui 1166*01826a49SYabin CuiA position is skipped if already occupied by a "less than 1" probability symbol. 1167*01826a49SYabin Cui`position` does not reset between symbols, it simply iterates through 1168*01826a49SYabin Cuieach position in the table, switching to the next symbol when enough 1169*01826a49SYabin Cuistates have been allocated to the current one. 1170*01826a49SYabin Cui 1171*01826a49SYabin CuiThe process guarantees that the table is entirely filled. 1172*01826a49SYabin CuiEach cell corresponds to a state value, which contains the symbol being decoded. 1173*01826a49SYabin Cui 1174*01826a49SYabin CuiTo add the `Number_of_Bits` and `Baseline` required to retrieve next state, 1175*01826a49SYabin Cuiit's first necessary to sort all occurrences of each symbol in state order. 1176*01826a49SYabin CuiLower states will need 1 more bit than higher ones. 1177*01826a49SYabin CuiThe process is repeated for each symbol. 1178*01826a49SYabin Cui 1179*01826a49SYabin Cui__Example__ : 1180*01826a49SYabin CuiPresuming a symbol has a probability of 5, 1181*01826a49SYabin Cuiit receives 5 cells, corresponding to 5 state values. 1182*01826a49SYabin CuiThese state values are then sorted in natural order. 1183*01826a49SYabin Cui 1184*01826a49SYabin CuiNext power of 2 after 5 is 8. 1185*01826a49SYabin CuiSpace of probabilities must be divided into 8 equal parts. 1186*01826a49SYabin CuiPresuming the `Accuracy_Log` is 7, it defines a space of 128 states. 1187*01826a49SYabin CuiDivided by 8, each share is 16 large. 1188*01826a49SYabin Cui 1189*01826a49SYabin CuiIn order to reach 8 shares, 8-5=3 lowest states will count "double", 1190*01826a49SYabin Cuidoubling their shares (32 in width), hence requiring one more bit. 1191*01826a49SYabin Cui 1192*01826a49SYabin CuiBaseline is assigned starting from the higher states using fewer bits, 1193*01826a49SYabin Cuiincreasing at each state, then resuming at the first state, 1194*01826a49SYabin Cuieach state takes its allocated width from Baseline. 1195*01826a49SYabin Cui 1196*01826a49SYabin Cui| state order | 0 | 1 | 2 | 3 | 4 | 1197*01826a49SYabin Cui| ---------------- | ----- | ----- | ------ | ---- | ------ | 1198*01826a49SYabin Cui| state value | 1 | 39 | 77 | 84 | 122 | 1199*01826a49SYabin Cui| width | 32 | 32 | 32 | 16 | 16 | 1200*01826a49SYabin Cui| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 | 1201*01826a49SYabin Cui| range number | 2 | 4 | 6 | 0 | 1 | 1202*01826a49SYabin Cui| `Baseline` | 32 | 64 | 96 | 0 | 16 | 1203*01826a49SYabin Cui| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | 1204*01826a49SYabin Cui 1205*01826a49SYabin CuiDuring decoding, the next state value is determined from current state value, 1206*01826a49SYabin Cuiby reading the required `Number_of_Bits`, and adding the specified `Baseline`. 1207*01826a49SYabin Cui 1208*01826a49SYabin CuiSee [Appendix A] for the results of this process applied to the default distributions. 1209*01826a49SYabin Cui 1210*01826a49SYabin Cui[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes 1211*01826a49SYabin Cui 1212*01826a49SYabin Cui 1213*01826a49SYabin CuiHuffman Coding 1214*01826a49SYabin Cui-------------- 1215*01826a49SYabin CuiZstandard Huffman-coded streams are read backwards, 1216*01826a49SYabin Cuisimilar to the FSE bitstreams. 1217*01826a49SYabin CuiTherefore, to find the start of the bitstream, it is required to 1218*01826a49SYabin Cuiknow the offset of the last byte of the Huffman-coded stream. 1219*01826a49SYabin Cui 1220*01826a49SYabin CuiAfter writing the last bit containing information, the compressor 1221*01826a49SYabin Cuiwrites a single `1`-bit and then fills the byte with 0-7 `0` bits of 1222*01826a49SYabin Cuipadding. The last byte of the compressed bitstream cannot be `0` for 1223*01826a49SYabin Cuithat reason. 1224*01826a49SYabin Cui 1225*01826a49SYabin CuiWhen decompressing, the last byte containing the padding is the first 1226*01826a49SYabin Cuibyte to read. The decompressor needs to skip 0-7 initial `0`-bits and 1227*01826a49SYabin Cuithe first `1`-bit it occurs. Afterwards, the useful part of the bitstream 1228*01826a49SYabin Cuibegins. 1229*01826a49SYabin Cui 1230*01826a49SYabin CuiThe bitstream contains Huffman-coded symbols in __little-endian__ order, 1231*01826a49SYabin Cuiwith the codes defined by the method below. 1232*01826a49SYabin Cui 1233*01826a49SYabin Cui### Huffman Tree Description 1234*01826a49SYabin Cui 1235*01826a49SYabin CuiPrefix coding represents symbols from an a priori known alphabet 1236*01826a49SYabin Cuiby bit sequences (codewords), one codeword for each symbol, 1237*01826a49SYabin Cuiin a manner such that different symbols may be represented 1238*01826a49SYabin Cuiby bit sequences of different lengths, 1239*01826a49SYabin Cuibut a parser can always parse an encoded string 1240*01826a49SYabin Cuiunambiguously symbol-by-symbol. 1241*01826a49SYabin Cui 1242*01826a49SYabin CuiGiven an alphabet with known symbol frequencies, 1243*01826a49SYabin Cuithe Huffman algorithm allows the construction of an optimal prefix code 1244*01826a49SYabin Cuiusing the fewest bits of any possible prefix codes for that alphabet. 1245*01826a49SYabin Cui 1246*01826a49SYabin CuiPrefix code must not exceed a maximum code length. 1247*01826a49SYabin CuiMore bits improve accuracy but cost more header size, 1248*01826a49SYabin Cuiand require more memory or more complex decoding operations. 1249*01826a49SYabin CuiThis specification limits maximum code length to 11 bits. 1250*01826a49SYabin Cui 1251*01826a49SYabin Cui#### Representation 1252*01826a49SYabin Cui 1253*01826a49SYabin CuiAll literal values from zero (included) to last present one (excluded) 1254*01826a49SYabin Cuiare represented by `Weight` with values from `0` to `Max_Number_of_Bits`. 1255*01826a49SYabin CuiTransformation from `Weight` to `Number_of_Bits` follows this formula : 1256*01826a49SYabin Cui``` 1257*01826a49SYabin CuiNumber_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 1258*01826a49SYabin Cui``` 1259*01826a49SYabin CuiWhen a literal value is not present, it receives a `Weight` of 0. 1260*01826a49SYabin CuiThe least frequent symbol receives a `Weight` of 1. 1261*01826a49SYabin CuiIf no literal has a `Weight` of 1, then the data is considered corrupted. 1262*01826a49SYabin CuiIf there are not at least two literals with non-zero `Weight`, then the data 1263*01826a49SYabin Cuiis considered corrupted. 1264*01826a49SYabin CuiThe most frequent symbol receives a `Weight` anywhere between 1 and 11 (max). 1265*01826a49SYabin CuiThe last symbol's `Weight` is deduced from previously retrieved Weights, 1266*01826a49SYabin Cuiby completing to the nearest power of 2. It's necessarily non 0. 1267*01826a49SYabin CuiIf it's not possible to reach a clean power of 2 with a single `Weight` value, 1268*01826a49SYabin Cuithe Huffman Tree Description is considered invalid. 1269*01826a49SYabin CuiThis final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. 1270*01826a49SYabin Cui`Max_Number_of_Bits` must be <= 11, 1271*01826a49SYabin Cuiotherwise the representation is considered corrupted. 1272*01826a49SYabin Cui 1273*01826a49SYabin Cui__Example__ : 1274*01826a49SYabin CuiLet's presume the following Huffman tree must be described : 1275*01826a49SYabin Cui 1276*01826a49SYabin Cui| literal value | 0 | 1 | 2 | 3 | 4 | 5 | 1277*01826a49SYabin Cui| ---------------- | --- | --- | --- | --- | --- | --- | 1278*01826a49SYabin Cui| `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 | 1279*01826a49SYabin Cui 1280*01826a49SYabin CuiThe tree depth is 4, since its longest elements uses 4 bits 1281*01826a49SYabin Cui(longest elements are the one with smallest frequency). 1282*01826a49SYabin CuiLiteral value `5` will not be listed, as it can be determined from previous values 0-4, 1283*01826a49SYabin Cuinor will values above `5` as they are all 0. 1284*01826a49SYabin CuiValues from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`. 1285*01826a49SYabin CuiWeight formula is : 1286*01826a49SYabin Cui``` 1287*01826a49SYabin CuiWeight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0 1288*01826a49SYabin Cui``` 1289*01826a49SYabin CuiIt gives the following series of weights : 1290*01826a49SYabin Cui 1291*01826a49SYabin Cui| literal value | 0 | 1 | 2 | 3 | 4 | 1292*01826a49SYabin Cui| ------------- | --- | --- | --- | --- | --- | 1293*01826a49SYabin Cui| `Weight` | 4 | 3 | 2 | 0 | 1 | 1294*01826a49SYabin Cui 1295*01826a49SYabin CuiThe decoder will do the inverse operation : 1296*01826a49SYabin Cuihaving collected weights of literal symbols from `0` to `4`, 1297*01826a49SYabin Cuiit knows the last literal, `5`, is present with a non-zero `Weight`. 1298*01826a49SYabin CuiThe `Weight` of `5` can be determined by advancing to the next power of 2. 1299*01826a49SYabin CuiThe sum of `2^(Weight-1)` (excluding 0's) is : 1300*01826a49SYabin Cui`8 + 4 + 2 + 0 + 1 = 15`. 1301*01826a49SYabin CuiNearest larger power of 2 value is 16. 1302*01826a49SYabin CuiTherefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`. 1303*01826a49SYabin Cui 1304*01826a49SYabin Cui#### Huffman Tree header 1305*01826a49SYabin Cui 1306*01826a49SYabin CuiThis is a single byte value (0-255), 1307*01826a49SYabin Cuiwhich describes how the series of weights is encoded. 1308*01826a49SYabin Cui 1309*01826a49SYabin Cui- if `headerByte` < 128 : 1310*01826a49SYabin Cui the series of weights is compressed using FSE (see below). 1311*01826a49SYabin Cui The length of the FSE-compressed series is equal to `headerByte` (0-127). 1312*01826a49SYabin Cui 1313*01826a49SYabin Cui- if `headerByte` >= 128 : 1314*01826a49SYabin Cui + the series of weights uses a direct representation, 1315*01826a49SYabin Cui where each `Weight` is encoded directly as a 4 bits field (0-15). 1316*01826a49SYabin Cui + They are encoded forward, 2 weights to a byte, 1317*01826a49SYabin Cui first weight taking the top four bits and second one taking the bottom four. 1318*01826a49SYabin Cui * e.g. the following operations could be used to read the weights: 1319*01826a49SYabin Cui `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc. 1320*01826a49SYabin Cui + The full representation occupies `Ceiling(Number_of_Weights/2)` bytes, 1321*01826a49SYabin Cui meaning it uses only full bytes even if `Number_of_Weights` is odd. 1322*01826a49SYabin Cui + `Number_of_Weights = headerByte - 127`. 1323*01826a49SYabin Cui * Note that maximum `Number_of_Weights` is 255-127 = 128, 1324*01826a49SYabin Cui therefore, only up to 128 `Weight` can be encoded using direct representation. 1325*01826a49SYabin Cui * Since the last non-zero `Weight` is _not_ encoded, 1326*01826a49SYabin Cui this scheme is compatible with alphabet sizes of up to 129 symbols, 1327*01826a49SYabin Cui hence including literal symbol 128. 1328*01826a49SYabin Cui * If any literal symbol > 128 has a non-zero `Weight`, 1329*01826a49SYabin Cui direct representation is not possible. 1330*01826a49SYabin Cui In such case, it's necessary to use FSE compression. 1331*01826a49SYabin Cui 1332*01826a49SYabin Cui 1333*01826a49SYabin Cui#### Finite State Entropy (FSE) compression of Huffman weights 1334*01826a49SYabin Cui 1335*01826a49SYabin CuiIn this case, the series of Huffman weights is compressed using FSE compression. 1336*01826a49SYabin CuiIt's a single bitstream with 2 interleaved states, 1337*01826a49SYabin Cuisharing a single distribution table. 1338*01826a49SYabin Cui 1339*01826a49SYabin CuiTo decode an FSE bitstream, it is necessary to know its compressed size. 1340*01826a49SYabin CuiCompressed size is provided by `headerByte`. 1341*01826a49SYabin CuiIt's also necessary to know its _maximum possible_ decompressed size, 1342*01826a49SYabin Cuiwhich is `255`, since literal values span from `0` to `255`, 1343*01826a49SYabin Cuiand last symbol's `Weight` is not represented. 1344*01826a49SYabin Cui 1345*01826a49SYabin CuiAn FSE bitstream starts by a header, describing probabilities distribution. 1346*01826a49SYabin CuiIt will create a Decoding Table. 1347*01826a49SYabin CuiFor a list of Huffman weights, the maximum accuracy log is 6 bits. 1348*01826a49SYabin CuiFor more description see the [FSE header description](#fse-table-description) 1349*01826a49SYabin Cui 1350*01826a49SYabin CuiThe Huffman header compression uses 2 states, 1351*01826a49SYabin Cuiwhich share the same FSE distribution table. 1352*01826a49SYabin CuiThe first state (`State1`) encodes the even indexed symbols, 1353*01826a49SYabin Cuiand the second (`State2`) encodes the odd indexed symbols. 1354*01826a49SYabin Cui`State1` is initialized first, and then `State2`, and they take turns 1355*01826a49SYabin Cuidecoding a single symbol and updating their state. 1356*01826a49SYabin CuiFor more details on these FSE operations, see the [FSE section](#fse). 1357*01826a49SYabin Cui 1358*01826a49SYabin CuiThe number of symbols to decode is determined 1359*01826a49SYabin Cuiby tracking bitStream overflow condition: 1360*01826a49SYabin CuiIf updating state after decoding a symbol would require more bits than 1361*01826a49SYabin Cuiremain in the stream, it is assumed that extra bits are 0. Then, 1362*01826a49SYabin Cuisymbols for each of the final states are decoded and the process is complete. 1363*01826a49SYabin Cui 1364*01826a49SYabin CuiIf this process would produce more weights than the maximum number of decoded 1365*01826a49SYabin Cuiweights (255), then the data is considered corrupted. 1366*01826a49SYabin Cui 1367*01826a49SYabin Cui#### Conversion from weights to Huffman prefix codes 1368*01826a49SYabin Cui 1369*01826a49SYabin CuiAll present symbols shall now have a `Weight` value. 1370*01826a49SYabin CuiIt is possible to transform weights into `Number_of_Bits`, using this formula: 1371*01826a49SYabin Cui``` 1372*01826a49SYabin CuiNumber_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0 1373*01826a49SYabin Cui``` 1374*01826a49SYabin CuiSymbols are sorted by `Weight`. 1375*01826a49SYabin CuiWithin same `Weight`, symbols keep natural sequential order. 1376*01826a49SYabin CuiSymbols with a `Weight` of zero are removed. 1377*01826a49SYabin CuiThen, starting from lowest `Weight`, prefix codes are distributed in sequential order. 1378*01826a49SYabin Cui 1379*01826a49SYabin Cui__Example__ : 1380*01826a49SYabin CuiLet's presume the following list of weights has been decoded : 1381*01826a49SYabin Cui 1382*01826a49SYabin Cui| Literal | 0 | 1 | 2 | 3 | 4 | 5 | 1383*01826a49SYabin Cui| -------- | --- | --- | --- | --- | --- | --- | 1384*01826a49SYabin Cui| `Weight` | 4 | 3 | 2 | 0 | 1 | 1 | 1385*01826a49SYabin Cui 1386*01826a49SYabin CuiSorted by weight and then natural sequential order, 1387*01826a49SYabin Cuiit gives the following distribution : 1388*01826a49SYabin Cui 1389*01826a49SYabin Cui| Literal | 3 | 4 | 5 | 2 | 1 | 0 | 1390*01826a49SYabin Cui| ---------------- | --- | --- | --- | --- | --- | ---- | 1391*01826a49SYabin Cui| `Weight` | 0 | 1 | 1 | 2 | 3 | 4 | 1392*01826a49SYabin Cui| `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 | 1393*01826a49SYabin Cui| prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | 1394*01826a49SYabin Cui 1395*01826a49SYabin Cui### Huffman-coded Streams 1396*01826a49SYabin Cui 1397*01826a49SYabin CuiGiven a Huffman decoding table, 1398*01826a49SYabin Cuiit's possible to decode a Huffman-coded stream. 1399*01826a49SYabin Cui 1400*01826a49SYabin CuiEach bitstream must be read _backward_, 1401*01826a49SYabin Cuithat is starting from the end down to the beginning. 1402*01826a49SYabin CuiTherefore it's necessary to know the size of each bitstream. 1403*01826a49SYabin Cui 1404*01826a49SYabin CuiIt's also necessary to know exactly which _bit_ is the last one. 1405*01826a49SYabin CuiThis is detected by a final bit flag : 1406*01826a49SYabin Cuithe highest bit of latest byte is a final-bit-flag. 1407*01826a49SYabin CuiConsequently, a last byte of `0` is not possible. 1408*01826a49SYabin CuiAnd the final-bit-flag itself is not part of the useful bitstream. 1409*01826a49SYabin CuiHence, the last byte contains between 0 and 7 useful bits. 1410*01826a49SYabin Cui 1411*01826a49SYabin CuiStarting from the end, 1412*01826a49SYabin Cuiit's possible to read the bitstream in a __little-endian__ fashion, 1413*01826a49SYabin Cuikeeping track of already used bits. Since the bitstream is encoded in reverse 1414*01826a49SYabin Cuiorder, starting from the end read symbols in forward order. 1415*01826a49SYabin Cui 1416*01826a49SYabin CuiFor example, if the literal sequence "0145" was encoded using above prefix code, 1417*01826a49SYabin Cuiit would be encoded (in reverse order) as: 1418*01826a49SYabin Cui 1419*01826a49SYabin Cui|Symbol | 5 | 4 | 1 | 0 | Padding | 1420*01826a49SYabin Cui|--------|------|------|----|---|---------| 1421*01826a49SYabin Cui|Encoding|`0000`|`0001`|`01`|`1`| `00001` | 1422*01826a49SYabin Cui 1423*01826a49SYabin CuiResulting in following 2-bytes bitstream : 1424*01826a49SYabin Cui``` 1425*01826a49SYabin Cui00010000 00001101 1426*01826a49SYabin Cui``` 1427*01826a49SYabin Cui 1428*01826a49SYabin CuiHere is an alternative representation with the symbol codes separated by underscore: 1429*01826a49SYabin Cui``` 1430*01826a49SYabin Cui0001_0000 00001_1_01 1431*01826a49SYabin Cui``` 1432*01826a49SYabin Cui 1433*01826a49SYabin CuiReading highest `Max_Number_of_Bits` bits, 1434*01826a49SYabin Cuiit's possible to compare extracted value to decoding table, 1435*01826a49SYabin Cuidetermining the symbol to decode and number of bits to discard. 1436*01826a49SYabin Cui 1437*01826a49SYabin CuiThe process continues up to reading the required number of symbols per stream. 1438*01826a49SYabin CuiIf a bitstream is not entirely and exactly consumed, 1439*01826a49SYabin Cuihence reaching exactly its beginning position with _all_ bits consumed, 1440*01826a49SYabin Cuithe decoding process is considered faulty. 1441*01826a49SYabin Cui 1442*01826a49SYabin Cui 1443*01826a49SYabin CuiDictionary Format 1444*01826a49SYabin Cui----------------- 1445*01826a49SYabin Cui 1446*01826a49SYabin CuiZstandard is compatible with "raw content" dictionaries, 1447*01826a49SYabin Cuifree of any format restriction, except that they must be at least 8 bytes. 1448*01826a49SYabin CuiThese dictionaries function as if they were just the `Content` part 1449*01826a49SYabin Cuiof a formatted dictionary. 1450*01826a49SYabin Cui 1451*01826a49SYabin CuiBut dictionaries created by `zstd --train` follow a format, described here. 1452*01826a49SYabin Cui 1453*01826a49SYabin Cui__Pre-requisites__ : a dictionary has a size, 1454*01826a49SYabin Cui defined either by a buffer limit, or a file size. 1455*01826a49SYabin Cui 1456*01826a49SYabin Cui| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` | 1457*01826a49SYabin Cui| -------------- | --------------- | ---------------- | --------- | 1458*01826a49SYabin Cui 1459*01826a49SYabin Cui__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format 1460*01826a49SYabin Cui 1461*01826a49SYabin Cui__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format. 1462*01826a49SYabin Cui `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`). 1463*01826a49SYabin Cui It's used by decoders to check if they use the correct dictionary. 1464*01826a49SYabin Cui 1465*01826a49SYabin Cui_Reserved ranges :_ 1466*01826a49SYabin CuiIf the dictionary is going to be distributed in a public environment, 1467*01826a49SYabin Cuithe following ranges of `Dictionary_ID` are reserved for some future registrar 1468*01826a49SYabin Cuiand shall not be used : 1469*01826a49SYabin Cui 1470*01826a49SYabin Cui - low range : <= 32767 1471*01826a49SYabin Cui - high range : >= (2^31) 1472*01826a49SYabin Cui 1473*01826a49SYabin CuiOutside of these ranges, any value of `Dictionary_ID` 1474*01826a49SYabin Cuiwhich is both `>= 32768` and `< (1<<31)` can be used freely, 1475*01826a49SYabin Cuieven in public environment. 1476*01826a49SYabin Cui 1477*01826a49SYabin Cui 1478*01826a49SYabin Cui__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks]. 1479*01826a49SYabin Cui See the relevant [FSE](#fse-table-description) 1480*01826a49SYabin Cui and [Huffman](#huffman-tree-description) sections for how to decode these tables. 1481*01826a49SYabin Cui They are stored in following order : 1482*01826a49SYabin Cui Huffman tables for literals, FSE table for offsets, 1483*01826a49SYabin Cui FSE table for match lengths, and FSE table for literals lengths. 1484*01826a49SYabin Cui These tables populate the Repeat Stats literals mode and 1485*01826a49SYabin Cui Repeat distribution mode for sequence decoding. 1486*01826a49SYabin Cui It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`), 1487*01826a49SYabin Cui stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes. 1488*01826a49SYabin Cui Each recent offset must have a value <= dictionary content size, and cannot equal 0. 1489*01826a49SYabin Cui 1490*01826a49SYabin Cui__`Content`__ : The rest of the dictionary is its content. 1491*01826a49SYabin Cui The content act as a "past" in front of data to compress or decompress, 1492*01826a49SYabin Cui so it can be referenced in sequence commands. 1493*01826a49SYabin Cui As long as the amount of data decoded from this frame is less than or 1494*01826a49SYabin Cui equal to `Window_Size`, sequence commands may specify offsets longer 1495*01826a49SYabin Cui than the total length of decoded output so far to reference back to the 1496*01826a49SYabin Cui dictionary, even parts of the dictionary with offsets larger than `Window_Size`. 1497*01826a49SYabin Cui After the total output has surpassed `Window_Size` however, 1498*01826a49SYabin Cui this is no longer allowed and the dictionary is no longer accessible. 1499*01826a49SYabin Cui 1500*01826a49SYabin Cui[compressed blocks]: #the-format-of-compressed_block 1501*01826a49SYabin Cui 1502*01826a49SYabin CuiIf a dictionary is provided by an external source, 1503*01826a49SYabin Cuiit should be loaded with great care, its content considered untrusted. 1504*01826a49SYabin Cui 1505*01826a49SYabin Cui 1506*01826a49SYabin Cui 1507*01826a49SYabin CuiAppendix A - Decoding tables for predefined codes 1508*01826a49SYabin Cui------------------------------------------------- 1509*01826a49SYabin Cui 1510*01826a49SYabin CuiThis appendix contains FSE decoding tables 1511*01826a49SYabin Cuifor the predefined literal length, match length, and offset codes. 1512*01826a49SYabin CuiThe tables have been constructed using the algorithm as given above in chapter 1513*01826a49SYabin Cui"from normalized distribution to decoding tables". 1514*01826a49SYabin CuiThe tables here can be used as examples 1515*01826a49SYabin Cuito crosscheck that an implementation build its decoding tables correctly. 1516*01826a49SYabin Cui 1517*01826a49SYabin Cui#### Literal Length Code: 1518*01826a49SYabin Cui 1519*01826a49SYabin Cui| State | Symbol | Number_Of_Bits | Base | 1520*01826a49SYabin Cui| ----- | ------ | -------------- | ---- | 1521*01826a49SYabin Cui| 0 | 0 | 4 | 0 | 1522*01826a49SYabin Cui| 1 | 0 | 4 | 16 | 1523*01826a49SYabin Cui| 2 | 1 | 5 | 32 | 1524*01826a49SYabin Cui| 3 | 3 | 5 | 0 | 1525*01826a49SYabin Cui| 4 | 4 | 5 | 0 | 1526*01826a49SYabin Cui| 5 | 6 | 5 | 0 | 1527*01826a49SYabin Cui| 6 | 7 | 5 | 0 | 1528*01826a49SYabin Cui| 7 | 9 | 5 | 0 | 1529*01826a49SYabin Cui| 8 | 10 | 5 | 0 | 1530*01826a49SYabin Cui| 9 | 12 | 5 | 0 | 1531*01826a49SYabin Cui| 10 | 14 | 6 | 0 | 1532*01826a49SYabin Cui| 11 | 16 | 5 | 0 | 1533*01826a49SYabin Cui| 12 | 18 | 5 | 0 | 1534*01826a49SYabin Cui| 13 | 19 | 5 | 0 | 1535*01826a49SYabin Cui| 14 | 21 | 5 | 0 | 1536*01826a49SYabin Cui| 15 | 22 | 5 | 0 | 1537*01826a49SYabin Cui| 16 | 24 | 5 | 0 | 1538*01826a49SYabin Cui| 17 | 25 | 5 | 32 | 1539*01826a49SYabin Cui| 18 | 26 | 5 | 0 | 1540*01826a49SYabin Cui| 19 | 27 | 6 | 0 | 1541*01826a49SYabin Cui| 20 | 29 | 6 | 0 | 1542*01826a49SYabin Cui| 21 | 31 | 6 | 0 | 1543*01826a49SYabin Cui| 22 | 0 | 4 | 32 | 1544*01826a49SYabin Cui| 23 | 1 | 4 | 0 | 1545*01826a49SYabin Cui| 24 | 2 | 5 | 0 | 1546*01826a49SYabin Cui| 25 | 4 | 5 | 32 | 1547*01826a49SYabin Cui| 26 | 5 | 5 | 0 | 1548*01826a49SYabin Cui| 27 | 7 | 5 | 32 | 1549*01826a49SYabin Cui| 28 | 8 | 5 | 0 | 1550*01826a49SYabin Cui| 29 | 10 | 5 | 32 | 1551*01826a49SYabin Cui| 30 | 11 | 5 | 0 | 1552*01826a49SYabin Cui| 31 | 13 | 6 | 0 | 1553*01826a49SYabin Cui| 32 | 16 | 5 | 32 | 1554*01826a49SYabin Cui| 33 | 17 | 5 | 0 | 1555*01826a49SYabin Cui| 34 | 19 | 5 | 32 | 1556*01826a49SYabin Cui| 35 | 20 | 5 | 0 | 1557*01826a49SYabin Cui| 36 | 22 | 5 | 32 | 1558*01826a49SYabin Cui| 37 | 23 | 5 | 0 | 1559*01826a49SYabin Cui| 38 | 25 | 4 | 0 | 1560*01826a49SYabin Cui| 39 | 25 | 4 | 16 | 1561*01826a49SYabin Cui| 40 | 26 | 5 | 32 | 1562*01826a49SYabin Cui| 41 | 28 | 6 | 0 | 1563*01826a49SYabin Cui| 42 | 30 | 6 | 0 | 1564*01826a49SYabin Cui| 43 | 0 | 4 | 48 | 1565*01826a49SYabin Cui| 44 | 1 | 4 | 16 | 1566*01826a49SYabin Cui| 45 | 2 | 5 | 32 | 1567*01826a49SYabin Cui| 46 | 3 | 5 | 32 | 1568*01826a49SYabin Cui| 47 | 5 | 5 | 32 | 1569*01826a49SYabin Cui| 48 | 6 | 5 | 32 | 1570*01826a49SYabin Cui| 49 | 8 | 5 | 32 | 1571*01826a49SYabin Cui| 50 | 9 | 5 | 32 | 1572*01826a49SYabin Cui| 51 | 11 | 5 | 32 | 1573*01826a49SYabin Cui| 52 | 12 | 5 | 32 | 1574*01826a49SYabin Cui| 53 | 15 | 6 | 0 | 1575*01826a49SYabin Cui| 54 | 17 | 5 | 32 | 1576*01826a49SYabin Cui| 55 | 18 | 5 | 32 | 1577*01826a49SYabin Cui| 56 | 20 | 5 | 32 | 1578*01826a49SYabin Cui| 57 | 21 | 5 | 32 | 1579*01826a49SYabin Cui| 58 | 23 | 5 | 32 | 1580*01826a49SYabin Cui| 59 | 24 | 5 | 32 | 1581*01826a49SYabin Cui| 60 | 35 | 6 | 0 | 1582*01826a49SYabin Cui| 61 | 34 | 6 | 0 | 1583*01826a49SYabin Cui| 62 | 33 | 6 | 0 | 1584*01826a49SYabin Cui| 63 | 32 | 6 | 0 | 1585*01826a49SYabin Cui 1586*01826a49SYabin Cui#### Match Length Code: 1587*01826a49SYabin Cui 1588*01826a49SYabin Cui| State | Symbol | Number_Of_Bits | Base | 1589*01826a49SYabin Cui| ----- | ------ | -------------- | ---- | 1590*01826a49SYabin Cui| 0 | 0 | 6 | 0 | 1591*01826a49SYabin Cui| 1 | 1 | 4 | 0 | 1592*01826a49SYabin Cui| 2 | 2 | 5 | 32 | 1593*01826a49SYabin Cui| 3 | 3 | 5 | 0 | 1594*01826a49SYabin Cui| 4 | 5 | 5 | 0 | 1595*01826a49SYabin Cui| 5 | 6 | 5 | 0 | 1596*01826a49SYabin Cui| 6 | 8 | 5 | 0 | 1597*01826a49SYabin Cui| 7 | 10 | 6 | 0 | 1598*01826a49SYabin Cui| 8 | 13 | 6 | 0 | 1599*01826a49SYabin Cui| 9 | 16 | 6 | 0 | 1600*01826a49SYabin Cui| 10 | 19 | 6 | 0 | 1601*01826a49SYabin Cui| 11 | 22 | 6 | 0 | 1602*01826a49SYabin Cui| 12 | 25 | 6 | 0 | 1603*01826a49SYabin Cui| 13 | 28 | 6 | 0 | 1604*01826a49SYabin Cui| 14 | 31 | 6 | 0 | 1605*01826a49SYabin Cui| 15 | 33 | 6 | 0 | 1606*01826a49SYabin Cui| 16 | 35 | 6 | 0 | 1607*01826a49SYabin Cui| 17 | 37 | 6 | 0 | 1608*01826a49SYabin Cui| 18 | 39 | 6 | 0 | 1609*01826a49SYabin Cui| 19 | 41 | 6 | 0 | 1610*01826a49SYabin Cui| 20 | 43 | 6 | 0 | 1611*01826a49SYabin Cui| 21 | 45 | 6 | 0 | 1612*01826a49SYabin Cui| 22 | 1 | 4 | 16 | 1613*01826a49SYabin Cui| 23 | 2 | 4 | 0 | 1614*01826a49SYabin Cui| 24 | 3 | 5 | 32 | 1615*01826a49SYabin Cui| 25 | 4 | 5 | 0 | 1616*01826a49SYabin Cui| 26 | 6 | 5 | 32 | 1617*01826a49SYabin Cui| 27 | 7 | 5 | 0 | 1618*01826a49SYabin Cui| 28 | 9 | 6 | 0 | 1619*01826a49SYabin Cui| 29 | 12 | 6 | 0 | 1620*01826a49SYabin Cui| 30 | 15 | 6 | 0 | 1621*01826a49SYabin Cui| 31 | 18 | 6 | 0 | 1622*01826a49SYabin Cui| 32 | 21 | 6 | 0 | 1623*01826a49SYabin Cui| 33 | 24 | 6 | 0 | 1624*01826a49SYabin Cui| 34 | 27 | 6 | 0 | 1625*01826a49SYabin Cui| 35 | 30 | 6 | 0 | 1626*01826a49SYabin Cui| 36 | 32 | 6 | 0 | 1627*01826a49SYabin Cui| 37 | 34 | 6 | 0 | 1628*01826a49SYabin Cui| 38 | 36 | 6 | 0 | 1629*01826a49SYabin Cui| 39 | 38 | 6 | 0 | 1630*01826a49SYabin Cui| 40 | 40 | 6 | 0 | 1631*01826a49SYabin Cui| 41 | 42 | 6 | 0 | 1632*01826a49SYabin Cui| 42 | 44 | 6 | 0 | 1633*01826a49SYabin Cui| 43 | 1 | 4 | 32 | 1634*01826a49SYabin Cui| 44 | 1 | 4 | 48 | 1635*01826a49SYabin Cui| 45 | 2 | 4 | 16 | 1636*01826a49SYabin Cui| 46 | 4 | 5 | 32 | 1637*01826a49SYabin Cui| 47 | 5 | 5 | 32 | 1638*01826a49SYabin Cui| 48 | 7 | 5 | 32 | 1639*01826a49SYabin Cui| 49 | 8 | 5 | 32 | 1640*01826a49SYabin Cui| 50 | 11 | 6 | 0 | 1641*01826a49SYabin Cui| 51 | 14 | 6 | 0 | 1642*01826a49SYabin Cui| 52 | 17 | 6 | 0 | 1643*01826a49SYabin Cui| 53 | 20 | 6 | 0 | 1644*01826a49SYabin Cui| 54 | 23 | 6 | 0 | 1645*01826a49SYabin Cui| 55 | 26 | 6 | 0 | 1646*01826a49SYabin Cui| 56 | 29 | 6 | 0 | 1647*01826a49SYabin Cui| 57 | 52 | 6 | 0 | 1648*01826a49SYabin Cui| 58 | 51 | 6 | 0 | 1649*01826a49SYabin Cui| 59 | 50 | 6 | 0 | 1650*01826a49SYabin Cui| 60 | 49 | 6 | 0 | 1651*01826a49SYabin Cui| 61 | 48 | 6 | 0 | 1652*01826a49SYabin Cui| 62 | 47 | 6 | 0 | 1653*01826a49SYabin Cui| 63 | 46 | 6 | 0 | 1654*01826a49SYabin Cui 1655*01826a49SYabin Cui#### Offset Code: 1656*01826a49SYabin Cui 1657*01826a49SYabin Cui| State | Symbol | Number_Of_Bits | Base | 1658*01826a49SYabin Cui| ----- | ------ | -------------- | ---- | 1659*01826a49SYabin Cui| 0 | 0 | 5 | 0 | 1660*01826a49SYabin Cui| 1 | 6 | 4 | 0 | 1661*01826a49SYabin Cui| 2 | 9 | 5 | 0 | 1662*01826a49SYabin Cui| 3 | 15 | 5 | 0 | 1663*01826a49SYabin Cui| 4 | 21 | 5 | 0 | 1664*01826a49SYabin Cui| 5 | 3 | 5 | 0 | 1665*01826a49SYabin Cui| 6 | 7 | 4 | 0 | 1666*01826a49SYabin Cui| 7 | 12 | 5 | 0 | 1667*01826a49SYabin Cui| 8 | 18 | 5 | 0 | 1668*01826a49SYabin Cui| 9 | 23 | 5 | 0 | 1669*01826a49SYabin Cui| 10 | 5 | 5 | 0 | 1670*01826a49SYabin Cui| 11 | 8 | 4 | 0 | 1671*01826a49SYabin Cui| 12 | 14 | 5 | 0 | 1672*01826a49SYabin Cui| 13 | 20 | 5 | 0 | 1673*01826a49SYabin Cui| 14 | 2 | 5 | 0 | 1674*01826a49SYabin Cui| 15 | 7 | 4 | 16 | 1675*01826a49SYabin Cui| 16 | 11 | 5 | 0 | 1676*01826a49SYabin Cui| 17 | 17 | 5 | 0 | 1677*01826a49SYabin Cui| 18 | 22 | 5 | 0 | 1678*01826a49SYabin Cui| 19 | 4 | 5 | 0 | 1679*01826a49SYabin Cui| 20 | 8 | 4 | 16 | 1680*01826a49SYabin Cui| 21 | 13 | 5 | 0 | 1681*01826a49SYabin Cui| 22 | 19 | 5 | 0 | 1682*01826a49SYabin Cui| 23 | 1 | 5 | 0 | 1683*01826a49SYabin Cui| 24 | 6 | 4 | 16 | 1684*01826a49SYabin Cui| 25 | 10 | 5 | 0 | 1685*01826a49SYabin Cui| 26 | 16 | 5 | 0 | 1686*01826a49SYabin Cui| 27 | 28 | 5 | 0 | 1687*01826a49SYabin Cui| 28 | 27 | 5 | 0 | 1688*01826a49SYabin Cui| 29 | 26 | 5 | 0 | 1689*01826a49SYabin Cui| 30 | 25 | 5 | 0 | 1690*01826a49SYabin Cui| 31 | 24 | 5 | 0 | 1691*01826a49SYabin Cui 1692*01826a49SYabin Cui 1693*01826a49SYabin Cui 1694*01826a49SYabin CuiAppendix B - Resources for implementers 1695*01826a49SYabin Cui------------------------------------------------- 1696*01826a49SYabin Cui 1697*01826a49SYabin CuiAn open source reference implementation is available on : 1698*01826a49SYabin Cuihttps://github.com/facebook/zstd 1699*01826a49SYabin Cui 1700*01826a49SYabin CuiThe project contains a frame generator, called [decodeCorpus], 1701*01826a49SYabin Cuiwhich can be used by any 3rd-party implementation 1702*01826a49SYabin Cuito verify that a tested decoder is compliant with the specification. 1703*01826a49SYabin Cui 1704*01826a49SYabin Cui[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing 1705*01826a49SYabin Cui 1706*01826a49SYabin Cui`decodeCorpus` generates random valid frames. 1707*01826a49SYabin CuiA compliant decoder should be able to decode them all, 1708*01826a49SYabin Cuior at least provide a meaningful error code explaining for which reason it cannot 1709*01826a49SYabin Cui(memory limit restrictions for example). 1710*01826a49SYabin Cui 1711*01826a49SYabin Cui 1712*01826a49SYabin CuiVersion changes 1713*01826a49SYabin Cui--------------- 1714*01826a49SYabin Cui- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov 1715*01826a49SYabin Cui- 0.3.9 : clarifications for Huffman-compressed literal sizes. 1716*01826a49SYabin Cui- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions. 1717*01826a49SYabin Cui- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878 1718*01826a49SYabin Cui- 0.3.6 : clarifications for Dictionary_ID 1719*01826a49SYabin Cui- 0.3.5 : clarifications for Block_Maximum_Size 1720*01826a49SYabin Cui- 0.3.4 : clarifications for FSE decoding table 1721*01826a49SYabin Cui- 0.3.3 : clarifications for field Block_Size 1722*01826a49SYabin Cui- 0.3.2 : remove additional block size restriction on compressed blocks 1723*01826a49SYabin Cui- 0.3.1 : minor clarification regarding offset history update rules 1724*01826a49SYabin Cui- 0.3.0 : minor edits to match RFC8478 1725*01826a49SYabin Cui- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz 1726*01826a49SYabin Cui- 0.2.8 : clarifications for IETF RFC discuss 1727*01826a49SYabin Cui- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell 1728*01826a49SYabin Cui- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz 1729*01826a49SYabin Cui- 0.2.5 : minor typos and clarifications 1730*01826a49SYabin Cui- 0.2.4 : section restructuring, by Sean Purcell 1731*01826a49SYabin Cui- 0.2.3 : clarified several details, by Sean Purcell 1732*01826a49SYabin Cui- 0.2.2 : added predefined codes, by Johannes Rudolph 1733*01826a49SYabin Cui- 0.2.1 : clarify field names, by Przemyslaw Skibinski 1734*01826a49SYabin Cui- 0.2.0 : numerous format adjustments for zstd v0.8+ 1735*01826a49SYabin Cui- 0.1.2 : limit Huffman tree depth to 11 bits 1736*01826a49SYabin Cui- 0.1.1 : reserved dictID ranges 1737*01826a49SYabin Cui- 0.1.0 : initial release 1738