xref: /aosp_15_r20/external/zstd/doc/zstd_compression_format.md (revision 01826a4963a0d8a59bc3812d29bdf0fb76416722)
1*01826a49SYabin CuiZstandard Compression Format
2*01826a49SYabin Cui============================
3*01826a49SYabin Cui
4*01826a49SYabin Cui### Notices
5*01826a49SYabin Cui
6*01826a49SYabin CuiCopyright (c) Meta Platforms, Inc. and affiliates.
7*01826a49SYabin Cui
8*01826a49SYabin CuiPermission is granted to copy and distribute this document
9*01826a49SYabin Cuifor any purpose and without charge,
10*01826a49SYabin Cuiincluding translations into other languages
11*01826a49SYabin Cuiand incorporation into compilations,
12*01826a49SYabin Cuiprovided that the copyright notice and this notice are preserved,
13*01826a49SYabin Cuiand that any substantive changes or deletions from the original
14*01826a49SYabin Cuiare clearly marked.
15*01826a49SYabin CuiDistribution of this document is unlimited.
16*01826a49SYabin Cui
17*01826a49SYabin Cui### Version
18*01826a49SYabin Cui
19*01826a49SYabin Cui0.4.0 (2023-06-05)
20*01826a49SYabin Cui
21*01826a49SYabin Cui
22*01826a49SYabin CuiIntroduction
23*01826a49SYabin Cui------------
24*01826a49SYabin Cui
25*01826a49SYabin CuiThe purpose of this document is to define a lossless compressed data format,
26*01826a49SYabin Cuithat is independent of CPU type, operating system,
27*01826a49SYabin Cuifile system and character set, suitable for
28*01826a49SYabin Cuifile compression, pipe and streaming compression,
29*01826a49SYabin Cuiusing the [Zstandard algorithm](https://facebook.github.io/zstd/).
30*01826a49SYabin CuiThe text of the specification assumes a basic background in programming
31*01826a49SYabin Cuiat the level of bits and other primitive data representations.
32*01826a49SYabin Cui
33*01826a49SYabin CuiThe data can be produced or consumed,
34*01826a49SYabin Cuieven for an arbitrarily long sequentially presented input data stream,
35*01826a49SYabin Cuiusing only an a priori bounded amount of intermediate storage,
36*01826a49SYabin Cuiand hence can be used in data communications.
37*01826a49SYabin CuiThe format uses the Zstandard compression method,
38*01826a49SYabin Cuiand optional [xxHash-64 checksum method](https://cyan4973.github.io/xxHash/),
39*01826a49SYabin Cuifor detection of data corruption.
40*01826a49SYabin Cui
41*01826a49SYabin CuiThe data format defined by this specification
42*01826a49SYabin Cuidoes not attempt to allow random access to compressed data.
43*01826a49SYabin Cui
44*01826a49SYabin CuiUnless otherwise indicated below,
45*01826a49SYabin Cuia compliant compressor must produce data sets
46*01826a49SYabin Cuithat conform to the specifications presented here.
47*01826a49SYabin CuiIt doesn’t need to support all options though.
48*01826a49SYabin Cui
49*01826a49SYabin CuiA compliant decompressor must be able to decompress
50*01826a49SYabin Cuiat least one working set of parameters
51*01826a49SYabin Cuithat conforms to the specifications presented here.
52*01826a49SYabin CuiIt may also ignore informative fields, such as checksum.
53*01826a49SYabin CuiWhenever it does not support a parameter defined in the compressed stream,
54*01826a49SYabin Cuiit must produce a non-ambiguous error code and associated error message
55*01826a49SYabin Cuiexplaining which parameter is unsupported.
56*01826a49SYabin Cui
57*01826a49SYabin CuiThis specification is intended for use by implementers of software
58*01826a49SYabin Cuito compress data into Zstandard format and/or decompress data from Zstandard format.
59*01826a49SYabin CuiThe Zstandard format is supported by an open source reference implementation,
60*01826a49SYabin Cuiwritten in portable C, and available at : https://github.com/facebook/zstd .
61*01826a49SYabin Cui
62*01826a49SYabin Cui
63*01826a49SYabin Cui### Overall conventions
64*01826a49SYabin CuiIn this document:
65*01826a49SYabin Cui- square brackets i.e. `[` and `]` are used to indicate optional fields or parameters.
66*01826a49SYabin Cui- the naming convention for identifiers is `Mixed_Case_With_Underscores`
67*01826a49SYabin Cui
68*01826a49SYabin Cui### Definitions
69*01826a49SYabin CuiContent compressed by Zstandard is transformed into a Zstandard __frame__.
70*01826a49SYabin CuiMultiple frames can be appended into a single file or stream.
71*01826a49SYabin CuiA frame is completely independent, has a defined beginning and end,
72*01826a49SYabin Cuiand a set of parameters which tells the decoder how to decompress it.
73*01826a49SYabin Cui
74*01826a49SYabin CuiA frame encapsulates one or multiple __blocks__.
75*01826a49SYabin CuiEach block contains arbitrary content, which is described by its header,
76*01826a49SYabin Cuiand has a guaranteed maximum content size, which depends on frame parameters.
77*01826a49SYabin CuiUnlike frames, each block depends on previous blocks for proper decoding.
78*01826a49SYabin CuiHowever, each block can be decompressed without waiting for its successor,
79*01826a49SYabin Cuiallowing streaming operations.
80*01826a49SYabin Cui
81*01826a49SYabin CuiOverview
82*01826a49SYabin Cui---------
83*01826a49SYabin Cui- [Frames](#frames)
84*01826a49SYabin Cui  - [Zstandard frames](#zstandard-frames)
85*01826a49SYabin Cui    - [Blocks](#blocks)
86*01826a49SYabin Cui      - [Literals Section](#literals-section)
87*01826a49SYabin Cui      - [Sequences Section](#sequences-section)
88*01826a49SYabin Cui      - [Sequence Execution](#sequence-execution)
89*01826a49SYabin Cui  - [Skippable frames](#skippable-frames)
90*01826a49SYabin Cui- [Entropy Encoding](#entropy-encoding)
91*01826a49SYabin Cui  - [FSE](#fse)
92*01826a49SYabin Cui  - [Huffman Coding](#huffman-coding)
93*01826a49SYabin Cui- [Dictionary Format](#dictionary-format)
94*01826a49SYabin Cui
95*01826a49SYabin CuiFrames
96*01826a49SYabin Cui------
97*01826a49SYabin CuiZstandard compressed data is made of one or more __frames__.
98*01826a49SYabin CuiEach frame is independent and can be decompressed independently of other frames.
99*01826a49SYabin CuiThe decompressed content of multiple concatenated frames is the concatenation of
100*01826a49SYabin Cuieach frame decompressed content.
101*01826a49SYabin Cui
102*01826a49SYabin CuiThere are two frame formats defined by Zstandard:
103*01826a49SYabin Cui  Zstandard frames and Skippable frames.
104*01826a49SYabin CuiZstandard frames contain compressed data, while
105*01826a49SYabin Cuiskippable frames contain custom user metadata.
106*01826a49SYabin Cui
107*01826a49SYabin Cui## Zstandard frames
108*01826a49SYabin CuiThe structure of a single Zstandard frame is following:
109*01826a49SYabin Cui
110*01826a49SYabin Cui| `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] |
111*01826a49SYabin Cui|:--------------:|:--------------:|:----------:| ------------------ |:--------------------:|
112*01826a49SYabin Cui|  4 bytes       |  2-14 bytes    |  n bytes   |                    |     0-4 bytes        |
113*01826a49SYabin Cui
114*01826a49SYabin Cui__`Magic_Number`__
115*01826a49SYabin Cui
116*01826a49SYabin Cui4 Bytes, __little-endian__ format.
117*01826a49SYabin CuiValue : 0xFD2FB528
118*01826a49SYabin CuiNote: This value was selected to be less probable to find at the beginning of some random file.
119*01826a49SYabin CuiIt avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.),
120*01826a49SYabin Cuicontains byte values outside of ASCII range,
121*01826a49SYabin Cuiand doesn't map into UTF8 space.
122*01826a49SYabin CuiIt reduces the chances that a text file represent this value by accident.
123*01826a49SYabin Cui
124*01826a49SYabin Cui__`Frame_Header`__
125*01826a49SYabin Cui
126*01826a49SYabin Cui2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header).
127*01826a49SYabin Cui
128*01826a49SYabin Cui__`Data_Block`__
129*01826a49SYabin Cui
130*01826a49SYabin CuiDetailed in [`Blocks`](#blocks).
131*01826a49SYabin CuiThat’s where compressed data is stored.
132*01826a49SYabin Cui
133*01826a49SYabin Cui__`Content_Checksum`__
134*01826a49SYabin Cui
135*01826a49SYabin CuiAn optional 32-bit checksum, only present if `Content_Checksum_flag` is set.
136*01826a49SYabin CuiThe content checksum is the result
137*01826a49SYabin Cuiof [xxh64() hash function](https://cyan4973.github.io/xxHash/)
138*01826a49SYabin Cuidigesting the original (decoded) data as input, and a seed of zero.
139*01826a49SYabin CuiThe low 4 bytes of the checksum are stored in __little-endian__ format.
140*01826a49SYabin Cui
141*01826a49SYabin Cui### `Frame_Header`
142*01826a49SYabin Cui
143*01826a49SYabin CuiThe `Frame_Header` has a variable size, with a minimum of 2 bytes,
144*01826a49SYabin Cuiand up to 14 bytes depending on optional parameters.
145*01826a49SYabin CuiThe structure of `Frame_Header` is following:
146*01826a49SYabin Cui
147*01826a49SYabin Cui| `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] |
148*01826a49SYabin Cui| ------------------------- | --------------------- | ----------------- | ---------------------- |
149*01826a49SYabin Cui| 1 byte                    | 0-1 byte              | 0-4 bytes         | 0-8 bytes              |
150*01826a49SYabin Cui
151*01826a49SYabin Cui#### `Frame_Header_Descriptor`
152*01826a49SYabin Cui
153*01826a49SYabin CuiThe first header's byte is called the `Frame_Header_Descriptor`.
154*01826a49SYabin CuiIt describes which other fields are present.
155*01826a49SYabin CuiDecoding this byte is enough to tell the size of `Frame_Header`.
156*01826a49SYabin Cui
157*01826a49SYabin Cui| Bit number | Field name                |
158*01826a49SYabin Cui| ---------- | ----------                |
159*01826a49SYabin Cui| 7-6        | `Frame_Content_Size_flag` |
160*01826a49SYabin Cui| 5          | `Single_Segment_flag`     |
161*01826a49SYabin Cui| 4          | `Unused_bit`              |
162*01826a49SYabin Cui| 3          | `Reserved_bit`            |
163*01826a49SYabin Cui| 2          | `Content_Checksum_flag`   |
164*01826a49SYabin Cui| 1-0        | `Dictionary_ID_flag`      |
165*01826a49SYabin Cui
166*01826a49SYabin CuiIn this table, bit 7 is the highest bit, while bit 0 is the lowest one.
167*01826a49SYabin Cui
168*01826a49SYabin Cui__`Frame_Content_Size_flag`__
169*01826a49SYabin Cui
170*01826a49SYabin CuiThis is a 2-bits flag (`= Frame_Header_Descriptor >> 6`),
171*01826a49SYabin Cuispecifying if `Frame_Content_Size` (the decompressed data size)
172*01826a49SYabin Cuiis provided within the header.
173*01826a49SYabin Cui`Flag_Value` provides `FCS_Field_Size`,
174*01826a49SYabin Cuiwhich is the number of bytes used by `Frame_Content_Size`
175*01826a49SYabin Cuiaccording to the following table:
176*01826a49SYabin Cui
177*01826a49SYabin Cui|  `Flag_Value`  |    0   |  1  |  2  |  3  |
178*01826a49SYabin Cui| -------------- | ------ | --- | --- | --- |
179*01826a49SYabin Cui|`FCS_Field_Size`| 0 or 1 |  2  |  4  |  8  |
180*01826a49SYabin Cui
181*01826a49SYabin CuiWhen `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` :
182*01826a49SYabin Cuiif `Single_Segment_flag` is set, `FCS_Field_Size` is 1.
183*01826a49SYabin CuiOtherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided.
184*01826a49SYabin Cui
185*01826a49SYabin Cui__`Single_Segment_flag`__
186*01826a49SYabin Cui
187*01826a49SYabin CuiIf this flag is set,
188*01826a49SYabin Cuidata must be regenerated within a single continuous memory segment.
189*01826a49SYabin Cui
190*01826a49SYabin CuiIn this case, `Window_Descriptor` byte is skipped,
191*01826a49SYabin Cuibut `Frame_Content_Size` is necessarily present.
192*01826a49SYabin CuiAs a consequence, the decoder must allocate a memory segment
193*01826a49SYabin Cuiof size equal or larger than `Frame_Content_Size`.
194*01826a49SYabin Cui
195*01826a49SYabin CuiIn order to preserve the decoder from unreasonable memory requirements,
196*01826a49SYabin Cuia decoder is allowed to reject a compressed frame
197*01826a49SYabin Cuiwhich requests a memory size beyond decoder's authorized range.
198*01826a49SYabin Cui
199*01826a49SYabin CuiFor broader compatibility, decoders are recommended to support
200*01826a49SYabin Cuimemory sizes of at least 8 MB.
201*01826a49SYabin CuiThis is only a recommendation,
202*01826a49SYabin Cuieach decoder is free to support higher or lower limits,
203*01826a49SYabin Cuidepending on local limitations.
204*01826a49SYabin Cui
205*01826a49SYabin Cui__`Unused_bit`__
206*01826a49SYabin Cui
207*01826a49SYabin CuiA decoder compliant with this specification version shall not interpret this bit.
208*01826a49SYabin CuiIt might be used in any future version,
209*01826a49SYabin Cuito signal a property which is transparent to properly decode the frame.
210*01826a49SYabin CuiAn encoder compliant with this specification version must set this bit to zero.
211*01826a49SYabin Cui
212*01826a49SYabin Cui__`Reserved_bit`__
213*01826a49SYabin Cui
214*01826a49SYabin CuiThis bit is reserved for some future feature.
215*01826a49SYabin CuiIts value _must be zero_.
216*01826a49SYabin CuiA decoder compliant with this specification version must ensure it is not set.
217*01826a49SYabin CuiThis bit may be used in a future revision,
218*01826a49SYabin Cuito signal a feature that must be interpreted to decode the frame correctly.
219*01826a49SYabin Cui
220*01826a49SYabin Cui__`Content_Checksum_flag`__
221*01826a49SYabin Cui
222*01826a49SYabin CuiIf this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
223*01826a49SYabin CuiSee `Content_Checksum` paragraph.
224*01826a49SYabin Cui
225*01826a49SYabin Cui__`Dictionary_ID_flag`__
226*01826a49SYabin Cui
227*01826a49SYabin CuiThis is a 2-bits flag (`= FHD & 3`),
228*01826a49SYabin Cuitelling if a dictionary ID is provided within the header.
229*01826a49SYabin CuiIt also specifies the size of this field as `DID_Field_Size`.
230*01826a49SYabin Cui
231*01826a49SYabin Cui|`Flag_Value`    |  0  |  1  |  2  |  3  |
232*01826a49SYabin Cui| -------------- | --- | --- | --- | --- |
233*01826a49SYabin Cui|`DID_Field_Size`|  0  |  1  |  2  |  4  |
234*01826a49SYabin Cui
235*01826a49SYabin Cui#### `Window_Descriptor`
236*01826a49SYabin Cui
237*01826a49SYabin CuiProvides guarantees on minimum memory buffer required to decompress a frame.
238*01826a49SYabin CuiThis information is important for decoders to allocate enough memory.
239*01826a49SYabin Cui
240*01826a49SYabin CuiThe `Window_Descriptor` byte is optional.
241*01826a49SYabin CuiWhen `Single_Segment_flag` is set, `Window_Descriptor` is not present.
242*01826a49SYabin CuiIn this case, `Window_Size` is `Frame_Content_Size`,
243*01826a49SYabin Cuiwhich can be any value from 0 to 2^64-1 bytes (16 ExaBytes).
244*01826a49SYabin Cui
245*01826a49SYabin Cui| Bit numbers |     7-3    |     2-0    |
246*01826a49SYabin Cui| ----------- | ---------- | ---------- |
247*01826a49SYabin Cui| Field name  | `Exponent` | `Mantissa` |
248*01826a49SYabin Cui
249*01826a49SYabin CuiThe minimum memory buffer size is called `Window_Size`.
250*01826a49SYabin CuiIt is described by the following formulas :
251*01826a49SYabin Cui```
252*01826a49SYabin CuiwindowLog = 10 + Exponent;
253*01826a49SYabin CuiwindowBase = 1 << windowLog;
254*01826a49SYabin CuiwindowAdd = (windowBase / 8) * Mantissa;
255*01826a49SYabin CuiWindow_Size = windowBase + windowAdd;
256*01826a49SYabin Cui```
257*01826a49SYabin CuiThe minimum `Window_Size` is 1 KB.
258*01826a49SYabin CuiThe maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.
259*01826a49SYabin Cui
260*01826a49SYabin CuiIn general, larger `Window_Size` tend to improve compression ratio,
261*01826a49SYabin Cuibut at the cost of memory usage.
262*01826a49SYabin Cui
263*01826a49SYabin CuiTo properly decode compressed data,
264*01826a49SYabin Cuia decoder will need to allocate a buffer of at least `Window_Size` bytes.
265*01826a49SYabin Cui
266*01826a49SYabin CuiIn order to preserve decoder from unreasonable memory requirements,
267*01826a49SYabin Cuia decoder is allowed to reject a compressed frame
268*01826a49SYabin Cuiwhich requests a memory size beyond decoder's authorized range.
269*01826a49SYabin Cui
270*01826a49SYabin CuiFor improved interoperability,
271*01826a49SYabin Cuiit's recommended for decoders to support `Window_Size` of up to 8 MB,
272*01826a49SYabin Cuiand it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB.
273*01826a49SYabin CuiIt's merely a recommendation though,
274*01826a49SYabin Cuidecoders are free to support larger or lower limits,
275*01826a49SYabin Cuidepending on local limitations.
276*01826a49SYabin Cui
277*01826a49SYabin Cui#### `Dictionary_ID`
278*01826a49SYabin Cui
279*01826a49SYabin CuiThis is a variable size field, which contains
280*01826a49SYabin Cuithe ID of the dictionary required to properly decode the frame.
281*01826a49SYabin Cui`Dictionary_ID` field is optional. When it's not present,
282*01826a49SYabin Cuiit's up to the decoder to know which dictionary to use.
283*01826a49SYabin Cui
284*01826a49SYabin Cui`Dictionary_ID` field size is provided by `DID_Field_Size`.
285*01826a49SYabin Cui`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
286*01826a49SYabin Cui1 byte can represent an ID 0-255.
287*01826a49SYabin Cui2 bytes can represent an ID 0-65535.
288*01826a49SYabin Cui4 bytes can represent an ID 0-4294967295.
289*01826a49SYabin CuiFormat is __little-endian__.
290*01826a49SYabin Cui
291*01826a49SYabin CuiIt's allowed to represent a small ID (for example `13`)
292*01826a49SYabin Cuiwith a large 4-bytes dictionary ID, even if it is less efficient.
293*01826a49SYabin Cui
294*01826a49SYabin CuiA value of `0` has same meaning as no `Dictionary_ID`,
295*01826a49SYabin Cuiin which case the frame may or may not need a dictionary to be decoded,
296*01826a49SYabin Cuiand the ID of such a dictionary is not specified.
297*01826a49SYabin CuiThe decoder must know this information by other means.
298*01826a49SYabin Cui
299*01826a49SYabin Cui#### `Frame_Content_Size`
300*01826a49SYabin Cui
301*01826a49SYabin CuiThis is the original (uncompressed) size. This information is optional.
302*01826a49SYabin Cui`Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`.
303*01826a49SYabin Cui`FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`.
304*01826a49SYabin Cui`FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes.
305*01826a49SYabin Cui
306*01826a49SYabin Cui| `FCS_Field_Size` |    Range   |
307*01826a49SYabin Cui| ---------------- | ---------- |
308*01826a49SYabin Cui|        0         |   unknown  |
309*01826a49SYabin Cui|        1         |   0 - 255  |
310*01826a49SYabin Cui|        2         | 256 - 65791|
311*01826a49SYabin Cui|        4         | 0 - 2^32-1 |
312*01826a49SYabin Cui|        8         | 0 - 2^64-1 |
313*01826a49SYabin Cui
314*01826a49SYabin Cui`Frame_Content_Size` format is __little-endian__.
315*01826a49SYabin CuiWhen `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly.
316*01826a49SYabin CuiWhen `FCS_Field_Size` is 2, _the offset of 256 is added_.
317*01826a49SYabin CuiIt's allowed to represent a small size (for example `18`) using any compatible variant.
318*01826a49SYabin Cui
319*01826a49SYabin Cui
320*01826a49SYabin CuiBlocks
321*01826a49SYabin Cui-------
322*01826a49SYabin Cui
323*01826a49SYabin CuiAfter `Magic_Number` and `Frame_Header`, there are some number of blocks.
324*01826a49SYabin CuiEach frame must have at least one block,
325*01826a49SYabin Cuibut there is no upper limit on the number of blocks per frame.
326*01826a49SYabin Cui
327*01826a49SYabin CuiThe structure of a block is as follows:
328*01826a49SYabin Cui
329*01826a49SYabin Cui| `Block_Header` | `Block_Content` |
330*01826a49SYabin Cui|:--------------:|:---------------:|
331*01826a49SYabin Cui|    3 bytes     |     n bytes     |
332*01826a49SYabin Cui
333*01826a49SYabin Cui__`Block_Header`__
334*01826a49SYabin Cui
335*01826a49SYabin Cui`Block_Header` uses 3 bytes, written using __little-endian__ convention.
336*01826a49SYabin CuiIt contains 3 fields :
337*01826a49SYabin Cui
338*01826a49SYabin Cui| `Last_Block` | `Block_Type` | `Block_Size` |
339*01826a49SYabin Cui|:------------:|:------------:|:------------:|
340*01826a49SYabin Cui|    bit 0     |  bits 1-2    |  bits 3-23   |
341*01826a49SYabin Cui
342*01826a49SYabin Cui__`Last_Block`__
343*01826a49SYabin Cui
344*01826a49SYabin CuiThe lowest bit signals if this block is the last one.
345*01826a49SYabin CuiThe frame will end after this last block.
346*01826a49SYabin CuiIt may be followed by an optional `Content_Checksum`
347*01826a49SYabin Cui(see [Zstandard Frames](#zstandard-frames)).
348*01826a49SYabin Cui
349*01826a49SYabin Cui__`Block_Type`__
350*01826a49SYabin Cui
351*01826a49SYabin CuiThe next 2 bits represent the `Block_Type`.
352*01826a49SYabin Cui`Block_Type` influences the meaning of `Block_Size`.
353*01826a49SYabin CuiThere are 4 block types :
354*01826a49SYabin Cui
355*01826a49SYabin Cui|    Value     |      0      |      1      |         2          |     3     |
356*01826a49SYabin Cui| ------------ | ----------- | ----------- | ------------------ | --------- |
357*01826a49SYabin Cui| `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`|
358*01826a49SYabin Cui
359*01826a49SYabin Cui- `Raw_Block` - this is an uncompressed block.
360*01826a49SYabin Cui  `Block_Content` contains `Block_Size` bytes.
361*01826a49SYabin Cui
362*01826a49SYabin Cui- `RLE_Block` - this is a single byte, repeated `Block_Size` times.
363*01826a49SYabin Cui  `Block_Content` consists of a single byte.
364*01826a49SYabin Cui  On the decompression side, this byte must be repeated `Block_Size` times.
365*01826a49SYabin Cui
366*01826a49SYabin Cui- `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks),
367*01826a49SYabin Cui  explained later on.
368*01826a49SYabin Cui  `Block_Size` is the length of `Block_Content`, the compressed data.
369*01826a49SYabin Cui  The decompressed size is not known,
370*01826a49SYabin Cui  but its maximum possible value is guaranteed (see below)
371*01826a49SYabin Cui
372*01826a49SYabin Cui- `Reserved` - this is not a block.
373*01826a49SYabin Cui  This value cannot be used with current version of this specification.
374*01826a49SYabin Cui  If such a value is present, it is considered corrupted data.
375*01826a49SYabin Cui
376*01826a49SYabin Cui__`Block_Size`__
377*01826a49SYabin Cui
378*01826a49SYabin CuiThe upper 21 bits of `Block_Header` represent the `Block_Size`.
379*01826a49SYabin Cui
380*01826a49SYabin CuiWhen `Block_Type` is `Compressed_Block` or `Raw_Block`,
381*01826a49SYabin Cui`Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`).
382*01826a49SYabin Cui
383*01826a49SYabin CuiWhen `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1,
384*01826a49SYabin Cui`Block_Size` represents the number of times this byte must be repeated.
385*01826a49SYabin Cui
386*01826a49SYabin Cui`Block_Size` is limited by `Block_Maximum_Size` (see below).
387*01826a49SYabin Cui
388*01826a49SYabin Cui__`Block_Content`__ and __`Block_Maximum_Size`__
389*01826a49SYabin Cui
390*01826a49SYabin CuiThe size of `Block_Content` is limited by `Block_Maximum_Size`,
391*01826a49SYabin Cuiwhich is the smallest of:
392*01826a49SYabin Cui-  `Window_Size`
393*01826a49SYabin Cui-  128 KB
394*01826a49SYabin Cui
395*01826a49SYabin Cui`Block_Maximum_Size` is constant for a given frame.
396*01826a49SYabin CuiThis maximum is applicable to both the decompressed size
397*01826a49SYabin Cuiand the compressed size of any block in the frame.
398*01826a49SYabin Cui
399*01826a49SYabin CuiThe reasoning for this limit is that a decoder can read this information
400*01826a49SYabin Cuiat the beginning of a frame and use it to allocate buffers.
401*01826a49SYabin CuiThe guarantees on the size of blocks ensure that
402*01826a49SYabin Cuithe buffers will be large enough for any following block of the valid frame.
403*01826a49SYabin Cui
404*01826a49SYabin Cui
405*01826a49SYabin CuiCompressed Blocks
406*01826a49SYabin Cui-----------------
407*01826a49SYabin CuiTo decompress a compressed block, the compressed size must be provided
408*01826a49SYabin Cuifrom `Block_Size` field within `Block_Header`.
409*01826a49SYabin Cui
410*01826a49SYabin CuiA compressed block consists of 2 sections :
411*01826a49SYabin Cui- [Literals Section](#literals-section)
412*01826a49SYabin Cui- [Sequences Section](#sequences-section)
413*01826a49SYabin Cui
414*01826a49SYabin CuiThe results of the two sections are then combined to produce the decompressed
415*01826a49SYabin Cuidata in [Sequence Execution](#sequence-execution)
416*01826a49SYabin Cui
417*01826a49SYabin Cui#### Prerequisites
418*01826a49SYabin CuiTo decode a compressed block, the following elements are necessary :
419*01826a49SYabin Cui- Previous decoded data, up to a distance of `Window_Size`,
420*01826a49SYabin Cui  or beginning of the Frame, whichever is smaller.
421*01826a49SYabin Cui- List of "recent offsets" from previous `Compressed_Block`.
422*01826a49SYabin Cui- The previous Huffman tree, required by `Treeless_Literals_Block` type
423*01826a49SYabin Cui- Previous FSE decoding tables, required by `Repeat_Mode`
424*01826a49SYabin Cui  for each symbol type (literals lengths, match lengths, offsets)
425*01826a49SYabin Cui
426*01826a49SYabin CuiNote that decoding tables aren't always from the previous `Compressed_Block`.
427*01826a49SYabin Cui
428*01826a49SYabin Cui- Every decoding table can come from a dictionary.
429*01826a49SYabin Cui- The Huffman tree comes from the previous `Compressed_Literals_Block`.
430*01826a49SYabin Cui
431*01826a49SYabin CuiLiterals Section
432*01826a49SYabin Cui----------------
433*01826a49SYabin CuiAll literals are regrouped in the first part of the block.
434*01826a49SYabin CuiThey can be decoded first, and then copied during [Sequence Execution],
435*01826a49SYabin Cuior they can be decoded on the flow during [Sequence Execution].
436*01826a49SYabin Cui
437*01826a49SYabin CuiLiterals can be stored uncompressed or compressed using Huffman prefix codes.
438*01826a49SYabin CuiWhen compressed, a tree description may optionally be present,
439*01826a49SYabin Cuifollowed by 1 or 4 streams.
440*01826a49SYabin Cui
441*01826a49SYabin Cui| `Literals_Section_Header` | [`Huffman_Tree_Description`] | [jumpTable] | Stream1 | [Stream2] | [Stream3] | [Stream4] |
442*01826a49SYabin Cui| ------------------------- | ---------------------------- | ----------- | ------- | --------- | --------- | --------- |
443*01826a49SYabin Cui
444*01826a49SYabin Cui
445*01826a49SYabin Cui### `Literals_Section_Header`
446*01826a49SYabin Cui
447*01826a49SYabin CuiHeader is in charge of describing how literals are packed.
448*01826a49SYabin CuiIt's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes,
449*01826a49SYabin Cuiusing __little-endian__ convention.
450*01826a49SYabin Cui
451*01826a49SYabin Cui| `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] |
452*01826a49SYabin Cui| --------------------- | ------------- | ------------------ | ------------------- |
453*01826a49SYabin Cui|       2 bits          |  1 - 2 bits   |    5 - 20 bits     |     0 - 18 bits     |
454*01826a49SYabin Cui
455*01826a49SYabin CuiIn this representation, bits on the left are the lowest bits.
456*01826a49SYabin Cui
457*01826a49SYabin Cui__`Literals_Block_Type`__
458*01826a49SYabin Cui
459*01826a49SYabin CuiThis field uses 2 lowest bits of first byte, describing 4 different block types :
460*01826a49SYabin Cui
461*01826a49SYabin Cui| `Literals_Block_Type`       | Value |
462*01826a49SYabin Cui| --------------------------- | ----- |
463*01826a49SYabin Cui| `Raw_Literals_Block`        |   0   |
464*01826a49SYabin Cui| `RLE_Literals_Block`        |   1   |
465*01826a49SYabin Cui| `Compressed_Literals_Block` |   2   |
466*01826a49SYabin Cui| `Treeless_Literals_Block`   |   3   |
467*01826a49SYabin Cui
468*01826a49SYabin Cui- `Raw_Literals_Block` - Literals are stored uncompressed.
469*01826a49SYabin Cui- `RLE_Literals_Block` - Literals consist of a single byte value
470*01826a49SYabin Cui        repeated `Regenerated_Size` times.
471*01826a49SYabin Cui- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
472*01826a49SYabin Cui        starting with a Huffman tree description.
473*01826a49SYabin Cui        In this mode, there are at least 2 different literals represented in the Huffman tree description.
474*01826a49SYabin Cui        See details below.
475*01826a49SYabin Cui- `Treeless_Literals_Block` - This is a Huffman-compressed block,
476*01826a49SYabin Cui        using Huffman tree _from previous Huffman-compressed literals block_.
477*01826a49SYabin Cui        `Huffman_Tree_Description` will be skipped.
478*01826a49SYabin Cui        Note: If this mode is triggered without any previous Huffman-table in the frame
479*01826a49SYabin Cui        (or [dictionary](#dictionary-format)), this should be treated as data corruption.
480*01826a49SYabin Cui
481*01826a49SYabin Cui__`Size_Format`__
482*01826a49SYabin Cui
483*01826a49SYabin Cui`Size_Format` is divided into 2 families :
484*01826a49SYabin Cui
485*01826a49SYabin Cui- For `Raw_Literals_Block` and `RLE_Literals_Block`,
486*01826a49SYabin Cui  it's only necessary to decode `Regenerated_Size`.
487*01826a49SYabin Cui  There is no `Compressed_Size` field.
488*01826a49SYabin Cui- For `Compressed_Block` and `Treeless_Literals_Block`,
489*01826a49SYabin Cui  it's required to decode both `Compressed_Size`
490*01826a49SYabin Cui  and `Regenerated_Size` (the decompressed size).
491*01826a49SYabin Cui  It's also necessary to decode the number of streams (1 or 4).
492*01826a49SYabin Cui
493*01826a49SYabin CuiFor values spanning several bytes, convention is __little-endian__.
494*01826a49SYabin Cui
495*01826a49SYabin Cui__`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ :
496*01826a49SYabin Cui
497*01826a49SYabin Cui`Size_Format` uses 1 _or_ 2 bits.
498*01826a49SYabin CuiIts value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3`
499*01826a49SYabin Cui
500*01826a49SYabin Cui- `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit.
501*01826a49SYabin Cui               `Regenerated_Size` uses 5 bits (0-31).
502*01826a49SYabin Cui               `Literals_Section_Header` uses 1 byte.
503*01826a49SYabin Cui               `Regenerated_Size = Literals_Section_Header[0]>>3`
504*01826a49SYabin Cui- `Size_Format` == 01 : `Size_Format` uses 2 bits.
505*01826a49SYabin Cui               `Regenerated_Size` uses 12 bits (0-4095).
506*01826a49SYabin Cui               `Literals_Section_Header` uses 2 bytes.
507*01826a49SYabin Cui               `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)`
508*01826a49SYabin Cui- `Size_Format` == 11 : `Size_Format` uses 2 bits.
509*01826a49SYabin Cui               `Regenerated_Size` uses 20 bits (0-1048575).
510*01826a49SYabin Cui               `Literals_Section_Header` uses 3 bytes.
511*01826a49SYabin Cui               `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)`
512*01826a49SYabin Cui
513*01826a49SYabin CuiOnly Stream1 is present for these cases.
514*01826a49SYabin CuiNote : it's allowed to represent a short value (for example `27`)
515*01826a49SYabin Cuiusing a long format, even if it's less efficient.
516*01826a49SYabin Cui
517*01826a49SYabin Cui__`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ :
518*01826a49SYabin Cui
519*01826a49SYabin Cui`Size_Format` always uses 2 bits.
520*01826a49SYabin Cui
521*01826a49SYabin Cui- `Size_Format` == 00 : _A single stream_.
522*01826a49SYabin Cui               Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023).
523*01826a49SYabin Cui               `Literals_Section_Header` uses 3 bytes.
524*01826a49SYabin Cui- `Size_Format` == 01 : 4 streams.
525*01826a49SYabin Cui               Both `Regenerated_Size` and `Compressed_Size` use 10 bits (6-1023).
526*01826a49SYabin Cui               `Literals_Section_Header` uses 3 bytes.
527*01826a49SYabin Cui- `Size_Format` == 10 : 4 streams.
528*01826a49SYabin Cui               Both `Regenerated_Size` and `Compressed_Size` use 14 bits (6-16383).
529*01826a49SYabin Cui               `Literals_Section_Header` uses 4 bytes.
530*01826a49SYabin Cui- `Size_Format` == 11 : 4 streams.
531*01826a49SYabin Cui               Both `Regenerated_Size` and `Compressed_Size` use 18 bits (6-262143).
532*01826a49SYabin Cui               `Literals_Section_Header` uses 5 bytes.
533*01826a49SYabin Cui
534*01826a49SYabin CuiBoth `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention.
535*01826a49SYabin CuiNote: `Compressed_Size` __includes__ the size of the Huffman Tree description
536*01826a49SYabin Cui_when_ it is present.
537*01826a49SYabin CuiNote 2: `Compressed_Size` can never be `==0`.
538*01826a49SYabin CuiEven in single-stream scenario, assuming an empty content, it must be `>=1`,
539*01826a49SYabin Cuisince it contains at least the final end bit flag.
540*01826a49SYabin CuiIn 4-streams scenario, a valid `Compressed_Size` is necessarily `>= 10`
541*01826a49SYabin Cui(6 bytes for the jump table, + 4x1 bytes for the 4 streams).
542*01826a49SYabin Cui
543*01826a49SYabin Cui4 streams is faster than 1 stream in decompression speed,
544*01826a49SYabin Cuiby exploiting instruction level parallelism.
545*01826a49SYabin CuiBut it's also more expensive,
546*01826a49SYabin Cuicosting on average ~7.3 bytes more than the 1 stream mode, mostly from the jump table.
547*01826a49SYabin Cui
548*01826a49SYabin CuiIn general, use the 4 streams mode when there are more literals to decode,
549*01826a49SYabin Cuito favor higher decompression speeds.
550*01826a49SYabin CuiNote that beyond >1KB of literals, the 4 streams mode is compulsory.
551*01826a49SYabin Cui
552*01826a49SYabin CuiNote that a minimum of 6 bytes is required for the 4 streams mode.
553*01826a49SYabin CuiThat's a technical minimum, but it's not recommended to employ the 4 streams mode
554*01826a49SYabin Cuifor such a small quantity, that would be wasteful.
555*01826a49SYabin CuiA more practical lower bound would be around ~256 bytes.
556*01826a49SYabin Cui
557*01826a49SYabin Cui#### Raw Literals Block
558*01826a49SYabin CuiThe data in Stream1 is `Regenerated_Size` bytes long,
559*01826a49SYabin Cuiit contains the raw literals data to be used during [Sequence Execution].
560*01826a49SYabin Cui
561*01826a49SYabin Cui#### RLE Literals Block
562*01826a49SYabin CuiStream1 consists of a single byte which should be repeated `Regenerated_Size` times
563*01826a49SYabin Cuito generate the decoded literals.
564*01826a49SYabin Cui
565*01826a49SYabin Cui#### Compressed Literals Block and Treeless Literals Block
566*01826a49SYabin CuiBoth of these modes contain Huffman encoded data.
567*01826a49SYabin Cui
568*01826a49SYabin CuiFor `Treeless_Literals_Block`,
569*01826a49SYabin Cuithe Huffman table comes from previously compressed literals block,
570*01826a49SYabin Cuior from a dictionary.
571*01826a49SYabin Cui
572*01826a49SYabin Cui
573*01826a49SYabin Cui### `Huffman_Tree_Description`
574*01826a49SYabin CuiThis section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
575*01826a49SYabin CuiThe tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
576*01826a49SYabin CuiThe format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
577*01826a49SYabin CuiThe size of `Huffman_Tree_Description` is determined during decoding process,
578*01826a49SYabin Cuiit must be used to determine where streams begin.
579*01826a49SYabin Cui`Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`.
580*01826a49SYabin Cui
581*01826a49SYabin Cui
582*01826a49SYabin Cui### Jump Table
583*01826a49SYabin CuiThe Jump Table is only present when there are 4 Huffman-coded streams.
584*01826a49SYabin Cui
585*01826a49SYabin CuiReminder : Huffman compressed data consists of either 1 or 4 streams.
586*01826a49SYabin Cui
587*01826a49SYabin CuiIf only one stream is present, it is a single bitstream occupying the entire
588*01826a49SYabin Cuiremaining portion of the literals block, encoded as described in
589*01826a49SYabin Cui[Huffman-Coded Streams](#huffman-coded-streams).
590*01826a49SYabin Cui
591*01826a49SYabin CuiIf there are four streams, `Literals_Section_Header` only provided
592*01826a49SYabin Cuienough information to know the decompressed and compressed sizes
593*01826a49SYabin Cuiof all four streams _combined_.
594*01826a49SYabin CuiThe decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`,
595*01826a49SYabin Cuiexcept for the last stream which may be up to 3 bytes smaller,
596*01826a49SYabin Cuito reach a total decompressed size as specified in `Regenerated_Size`.
597*01826a49SYabin Cui
598*01826a49SYabin CuiThe compressed size of each stream is provided explicitly in the Jump Table.
599*01826a49SYabin CuiJump Table is 6 bytes long, and consists of three 2-byte __little-endian__ fields,
600*01826a49SYabin Cuidescribing the compressed sizes of the first three streams.
601*01826a49SYabin Cui`Stream4_Size` is computed from `Total_Streams_Size` minus sizes of other streams:
602*01826a49SYabin Cui
603*01826a49SYabin Cui`Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`.
604*01826a49SYabin Cui
605*01826a49SYabin Cui`Stream4_Size` is necessarily `>= 1`. Therefore,
606*01826a49SYabin Cuiif `Total_Streams_Size < Stream1_Size + Stream2_Size + Stream3_Size + 6 + 1`,
607*01826a49SYabin Cuidata is considered corrupted.
608*01826a49SYabin Cui
609*01826a49SYabin CuiEach of these 4 bitstreams is then decoded independently as a Huffman-Coded stream,
610*01826a49SYabin Cuias described in [Huffman-Coded Streams](#huffman-coded-streams)
611*01826a49SYabin Cui
612*01826a49SYabin Cui
613*01826a49SYabin CuiSequences Section
614*01826a49SYabin Cui-----------------
615*01826a49SYabin CuiA compressed block is a succession of _sequences_ .
616*01826a49SYabin CuiA sequence is a literal copy command, followed by a match copy command.
617*01826a49SYabin CuiA literal copy command specifies a length.
618*01826a49SYabin CuiIt is the number of bytes to be copied (or extracted) from the Literals Section.
619*01826a49SYabin CuiA match copy command specifies an offset and a length.
620*01826a49SYabin Cui
621*01826a49SYabin CuiWhen all _sequences_ are decoded,
622*01826a49SYabin Cuiif there are literals left in the _literals section_,
623*01826a49SYabin Cuithese bytes are added at the end of the block.
624*01826a49SYabin Cui
625*01826a49SYabin CuiThis is described in more detail in [Sequence Execution](#sequence-execution).
626*01826a49SYabin Cui
627*01826a49SYabin CuiThe `Sequences_Section` regroup all symbols required to decode commands.
628*01826a49SYabin CuiThere are 3 symbol types : literals lengths, offsets and match lengths.
629*01826a49SYabin CuiThey are encoded together, interleaved, in a single _bitstream_.
630*01826a49SYabin Cui
631*01826a49SYabin CuiThe `Sequences_Section` starts by a header,
632*01826a49SYabin Cuifollowed by optional probability tables for each symbol type,
633*01826a49SYabin Cuifollowed by the bitstream.
634*01826a49SYabin Cui
635*01826a49SYabin Cui| `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream |
636*01826a49SYabin Cui| -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- |
637*01826a49SYabin Cui
638*01826a49SYabin CuiTo decode the `Sequences_Section`, it's required to know its size.
639*01826a49SYabin CuiIts size is deduced from the size of `Literals_Section`:
640*01826a49SYabin Cui`Sequences_Section_Size = Block_Size - Literals_Section_Size`.
641*01826a49SYabin Cui
642*01826a49SYabin Cui
643*01826a49SYabin Cui#### `Sequences_Section_Header`
644*01826a49SYabin Cui
645*01826a49SYabin CuiConsists of 2 items:
646*01826a49SYabin Cui- `Number_of_Sequences`
647*01826a49SYabin Cui- Symbol compression modes
648*01826a49SYabin Cui
649*01826a49SYabin Cui__`Number_of_Sequences`__
650*01826a49SYabin Cui
651*01826a49SYabin CuiThis is a variable size field using between 1 and 3 bytes.
652*01826a49SYabin CuiLet's call its first byte `byte0`.
653*01826a49SYabin Cui- `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte.
654*01826a49SYabin Cui- `if (byte0 < 255)` : `Number_of_Sequences = ((byte0 - 0x80) << 8) + byte1`. Uses 2 bytes.
655*01826a49SYabin Cui            Note that the 2 bytes format fully overlaps the 1 byte format.
656*01826a49SYabin Cui- `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00`. Uses 3 bytes.
657*01826a49SYabin Cui
658*01826a49SYabin Cui`if (Number_of_Sequences == 0)` : there are no sequences.
659*01826a49SYabin Cui            The sequence section stops immediately,
660*01826a49SYabin Cui            FSE tables used in `Repeat_Mode` aren't updated.
661*01826a49SYabin Cui            Block's decompressed content is defined solely by the Literals Section content.
662*01826a49SYabin Cui
663*01826a49SYabin Cui__Symbol compression modes__
664*01826a49SYabin Cui
665*01826a49SYabin CuiThis is a single byte, defining the compression mode of each symbol type.
666*01826a49SYabin Cui
667*01826a49SYabin Cui|Bit number|          7-6            |      5-4       |        3-2           |     1-0    |
668*01826a49SYabin Cui| -------- | ----------------------- | -------------- | -------------------- | ---------- |
669*01826a49SYabin Cui|Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` |
670*01826a49SYabin Cui
671*01826a49SYabin CuiThe last field, `Reserved`, must be all-zeroes.
672*01826a49SYabin Cui
673*01826a49SYabin Cui`Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of
674*01826a49SYabin Cuiliterals lengths, offsets, and match lengths symbols respectively.
675*01826a49SYabin Cui
676*01826a49SYabin CuiThey follow the same enumeration :
677*01826a49SYabin Cui
678*01826a49SYabin Cui|        Value       |         0         |      1     |           2           |       3       |
679*01826a49SYabin Cui| ------------------ | ----------------- | ---------- | --------------------- | ------------- |
680*01826a49SYabin Cui| `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` |
681*01826a49SYabin Cui
682*01826a49SYabin Cui- `Predefined_Mode` : A predefined FSE distribution table is used, defined in
683*01826a49SYabin Cui          [default distributions](#default-distributions).
684*01826a49SYabin Cui          No distribution table will be present.
685*01826a49SYabin Cui- `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value.
686*01826a49SYabin Cui          This symbol will be used for all sequences.
687*01826a49SYabin Cui- `FSE_Compressed_Mode` : standard FSE compression.
688*01826a49SYabin Cui          A distribution table will be present.
689*01826a49SYabin Cui          The format of this distribution table is described in [FSE Table Description](#fse-table-description).
690*01826a49SYabin Cui          Note that the maximum allowed accuracy log for literals length and match length tables is 9,
691*01826a49SYabin Cui          and the maximum accuracy log for the offsets table is 8.
692*01826a49SYabin Cui          `FSE_Compressed_Mode` must not be used when only one symbol is present,
693*01826a49SYabin Cui          `RLE_Mode` should be used instead (although any other mode will work).
694*01826a49SYabin Cui- `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again,
695*01826a49SYabin Cui          or if this is the first block, table in the dictionary will be used.
696*01826a49SYabin Cui          Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated.
697*01826a49SYabin Cui          It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`.
698*01826a49SYabin Cui          No distribution table will be present.
699*01826a49SYabin Cui          If this mode is used without any previous sequence table in the frame
700*01826a49SYabin Cui          (nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption.
701*01826a49SYabin Cui
702*01826a49SYabin Cui#### The codes for literals lengths, match lengths, and offsets.
703*01826a49SYabin Cui
704*01826a49SYabin CuiEach symbol is a _code_ in its own context,
705*01826a49SYabin Cuiwhich specifies `Baseline` and `Number_of_Bits` to add.
706*01826a49SYabin Cui_Codes_ are FSE compressed,
707*01826a49SYabin Cuiand interleaved with raw additional bits in the same bitstream.
708*01826a49SYabin Cui
709*01826a49SYabin Cui##### Literals length codes
710*01826a49SYabin Cui
711*01826a49SYabin CuiLiterals length codes are values ranging from `0` to `35` included.
712*01826a49SYabin CuiThey define lengths from 0 to 131071 bytes.
713*01826a49SYabin CuiThe literals length is equal to the decoded `Baseline` plus
714*01826a49SYabin Cuithe result of reading `Number_of_Bits` bits from the bitstream,
715*01826a49SYabin Cuias a __little-endian__ value.
716*01826a49SYabin Cui
717*01826a49SYabin Cui| `Literals_Length_Code` |         0-15           |
718*01826a49SYabin Cui| ---------------------- | ---------------------- |
719*01826a49SYabin Cui| length                 | `Literals_Length_Code` |
720*01826a49SYabin Cui| `Number_of_Bits`       |          0             |
721*01826a49SYabin Cui
722*01826a49SYabin Cui| `Literals_Length_Code` |  16  |  17  |  18  |  19  |  20  |  21  |  22  |  23  |
723*01826a49SYabin Cui| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
724*01826a49SYabin Cui| `Baseline`             |  16  |  18  |  20  |  22  |  24  |  28  |  32  |  40  |
725*01826a49SYabin Cui| `Number_of_Bits`       |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |
726*01826a49SYabin Cui
727*01826a49SYabin Cui| `Literals_Length_Code` |  24  |  25  |  26  |  27  |  28  |  29  |  30  |  31  |
728*01826a49SYabin Cui| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
729*01826a49SYabin Cui| `Baseline`             |  48  |  64  |  128 |  256 |  512 | 1024 | 2048 | 4096 |
730*01826a49SYabin Cui| `Number_of_Bits`       |   4  |   6  |   7  |   8  |   9  |  10  |  11  |  12  |
731*01826a49SYabin Cui
732*01826a49SYabin Cui| `Literals_Length_Code` |  32  |  33  |  34  |  35  |
733*01826a49SYabin Cui| ---------------------- | ---- | ---- | ---- | ---- |
734*01826a49SYabin Cui| `Baseline`             | 8192 |16384 |32768 |65536 |
735*01826a49SYabin Cui| `Number_of_Bits`       |  13  |  14  |  15  |  16  |
736*01826a49SYabin Cui
737*01826a49SYabin Cui
738*01826a49SYabin Cui##### Match length codes
739*01826a49SYabin Cui
740*01826a49SYabin CuiMatch length codes are values ranging from `0` to `52` included.
741*01826a49SYabin CuiThey define lengths from 3 to 131074 bytes.
742*01826a49SYabin CuiThe match length is equal to the decoded `Baseline` plus
743*01826a49SYabin Cuithe result of reading `Number_of_Bits` bits from the bitstream,
744*01826a49SYabin Cuias a __little-endian__ value.
745*01826a49SYabin Cui
746*01826a49SYabin Cui| `Match_Length_Code` |         0-31            |
747*01826a49SYabin Cui| ------------------- | ----------------------- |
748*01826a49SYabin Cui| value               | `Match_Length_Code` + 3 |
749*01826a49SYabin Cui| `Number_of_Bits`    |          0              |
750*01826a49SYabin Cui
751*01826a49SYabin Cui| `Match_Length_Code` |  32  |  33  |  34  |  35  |  36  |  37  |  38  |  39  |
752*01826a49SYabin Cui| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
753*01826a49SYabin Cui| `Baseline`          |  35  |  37  |  39  |  41  |  43  |  47  |  51  |  59  |
754*01826a49SYabin Cui| `Number_of_Bits`    |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |
755*01826a49SYabin Cui
756*01826a49SYabin Cui| `Match_Length_Code` |  40  |  41  |  42  |  43  |  44  |  45  |  46  |  47  |
757*01826a49SYabin Cui| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
758*01826a49SYabin Cui| `Baseline`          |  67  |  83  |  99  |  131 |  259 |  515 | 1027 | 2051 |
759*01826a49SYabin Cui| `Number_of_Bits`    |   4  |   4  |   5  |   7  |   8  |   9  |  10  |  11  |
760*01826a49SYabin Cui
761*01826a49SYabin Cui| `Match_Length_Code` |  48  |  49  |  50  |  51  |  52  |
762*01826a49SYabin Cui| ------------------- | ---- | ---- | ---- | ---- | ---- |
763*01826a49SYabin Cui| `Baseline`          | 4099 | 8195 |16387 |32771 |65539 |
764*01826a49SYabin Cui| `Number_of_Bits`    |  12  |  13  |  14  |  15  |  16  |
765*01826a49SYabin Cui
766*01826a49SYabin Cui##### Offset codes
767*01826a49SYabin Cui
768*01826a49SYabin CuiOffset codes are values ranging from `0` to `N`.
769*01826a49SYabin Cui
770*01826a49SYabin CuiA decoder is free to limit its maximum `N` supported.
771*01826a49SYabin CuiRecommendation is to support at least up to `22`.
772*01826a49SYabin CuiFor information, at the time of this writing.
773*01826a49SYabin Cuithe reference decoder supports a maximum `N` value of `31`.
774*01826a49SYabin Cui
775*01826a49SYabin CuiAn offset code is also the number of additional bits to read in __little-endian__ fashion,
776*01826a49SYabin Cuiand can be translated into an `Offset_Value` using the following formulas :
777*01826a49SYabin Cui
778*01826a49SYabin Cui```
779*01826a49SYabin CuiOffset_Value = (1 << offsetCode) + readNBits(offsetCode);
780*01826a49SYabin Cuiif (Offset_Value > 3) offset = Offset_Value - 3;
781*01826a49SYabin Cui```
782*01826a49SYabin CuiIt means that maximum `Offset_Value` is `(2^(N+1))-1`
783*01826a49SYabin Cuisupporting back-reference distances up to `(2^(N+1))-4`,
784*01826a49SYabin Cuibut is limited by [maximum back-reference distance](#window_descriptor).
785*01826a49SYabin Cui
786*01826a49SYabin Cui`Offset_Value` from 1 to 3 are special : they define "repeat codes".
787*01826a49SYabin CuiThis is described in more detail in [Repeat Offsets](#repeat-offsets).
788*01826a49SYabin Cui
789*01826a49SYabin Cui#### Decoding Sequences
790*01826a49SYabin CuiFSE bitstreams are read in reverse direction than written. In zstd,
791*01826a49SYabin Cuithe compressor writes bits forward into a block and the decompressor
792*01826a49SYabin Cuimust read the bitstream _backwards_.
793*01826a49SYabin Cui
794*01826a49SYabin CuiTo find the start of the bitstream it is therefore necessary to
795*01826a49SYabin Cuiknow the offset of the last byte of the block which can be found
796*01826a49SYabin Cuiby counting `Block_Size` bytes after the block header.
797*01826a49SYabin Cui
798*01826a49SYabin CuiAfter writing the last bit containing information, the compressor
799*01826a49SYabin Cuiwrites a single `1`-bit and then fills the byte with 0-7 `0` bits of
800*01826a49SYabin Cuipadding. The last byte of the compressed bitstream cannot be `0` for
801*01826a49SYabin Cuithat reason.
802*01826a49SYabin Cui
803*01826a49SYabin CuiWhen decompressing, the last byte containing the padding is the first
804*01826a49SYabin Cuibyte to read. The decompressor needs to skip 0-7 initial `0`-bits and
805*01826a49SYabin Cuithe first `1`-bit it occurs. Afterwards, the useful part of the bitstream
806*01826a49SYabin Cuibegins.
807*01826a49SYabin Cui
808*01826a49SYabin CuiFSE decoding requires a 'state' to be carried from symbol to symbol.
809*01826a49SYabin CuiFor more explanation on FSE decoding, see the [FSE section](#fse).
810*01826a49SYabin Cui
811*01826a49SYabin CuiFor sequence decoding, a separate state keeps track of each
812*01826a49SYabin Cuiliteral lengths, offsets, and match lengths symbols.
813*01826a49SYabin CuiSome FSE primitives are also used.
814*01826a49SYabin CuiFor more details on the operation of these primitives, see the [FSE section](#fse).
815*01826a49SYabin Cui
816*01826a49SYabin Cui##### Starting states
817*01826a49SYabin CuiThe bitstream starts with initial FSE state values,
818*01826a49SYabin Cuieach using the required number of bits in their respective _accuracy_,
819*01826a49SYabin Cuidecoded previously from their normalized distribution.
820*01826a49SYabin Cui
821*01826a49SYabin CuiIt starts by `Literals_Length_State`,
822*01826a49SYabin Cuifollowed by `Offset_State`,
823*01826a49SYabin Cuiand finally `Match_Length_State`.
824*01826a49SYabin Cui
825*01826a49SYabin CuiReminder : always keep in mind that all values are read _backward_,
826*01826a49SYabin Cuiso the 'start' of the bitstream is at the highest position in memory,
827*01826a49SYabin Cuiimmediately before the last `1`-bit for padding.
828*01826a49SYabin Cui
829*01826a49SYabin CuiAfter decoding the starting states, a single sequence is decoded
830*01826a49SYabin Cui`Number_Of_Sequences` times.
831*01826a49SYabin CuiThese sequences are decoded in order from first to last.
832*01826a49SYabin CuiSince the compressor writes the bitstream in the forward direction,
833*01826a49SYabin Cuithis means the compressor must encode the sequences starting with the last
834*01826a49SYabin Cuione and ending with the first.
835*01826a49SYabin Cui
836*01826a49SYabin Cui##### Decoding a sequence
837*01826a49SYabin CuiFor each of the symbol types, the FSE state can be used to determine the appropriate code.
838*01826a49SYabin CuiThe code then defines the `Baseline` and `Number_of_Bits` to read for each type.
839*01826a49SYabin CuiSee the [description of the codes] for how to determine these values.
840*01826a49SYabin Cui
841*01826a49SYabin Cui[description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets
842*01826a49SYabin Cui
843*01826a49SYabin CuiDecoding starts by reading the `Number_of_Bits` required to decode `Offset`.
844*01826a49SYabin CuiIt then does the same for `Match_Length`, and then for `Literals_Length`.
845*01826a49SYabin CuiThis sequence is then used for [sequence execution](#sequence-execution).
846*01826a49SYabin Cui
847*01826a49SYabin CuiIf it is not the last sequence in the block,
848*01826a49SYabin Cuithe next operation is to update states.
849*01826a49SYabin CuiUsing the rules pre-calculated in the decoding tables,
850*01826a49SYabin Cui`Literals_Length_State` is updated,
851*01826a49SYabin Cuifollowed by `Match_Length_State`,
852*01826a49SYabin Cuiand then `Offset_State`.
853*01826a49SYabin CuiSee the [FSE section](#fse) for details on how to update states from the bitstream.
854*01826a49SYabin Cui
855*01826a49SYabin CuiThis operation will be repeated `Number_of_Sequences` times.
856*01826a49SYabin CuiAt the end, the bitstream shall be entirely consumed,
857*01826a49SYabin Cuiotherwise the bitstream is considered corrupted.
858*01826a49SYabin Cui
859*01826a49SYabin Cui#### Default Distributions
860*01826a49SYabin CuiIf `Predefined_Mode` is selected for a symbol type,
861*01826a49SYabin Cuiits FSE decoding table is generated from a predefined distribution table defined here.
862*01826a49SYabin CuiFor details on how to convert this distribution into a decoding table, see the [FSE section].
863*01826a49SYabin Cui
864*01826a49SYabin Cui[FSE section]: #from-normalized-distribution-to-decoding-tables
865*01826a49SYabin Cui
866*01826a49SYabin Cui##### Literals Length
867*01826a49SYabin CuiThe decoding table uses an accuracy log of 6 bits (64 states).
868*01826a49SYabin Cui```
869*01826a49SYabin Cuishort literalsLength_defaultDistribution[36] =
870*01826a49SYabin Cui        { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
871*01826a49SYabin Cui          2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
872*01826a49SYabin Cui         -1,-1,-1,-1 };
873*01826a49SYabin Cui```
874*01826a49SYabin Cui
875*01826a49SYabin Cui##### Match Length
876*01826a49SYabin CuiThe decoding table uses an accuracy log of 6 bits (64 states).
877*01826a49SYabin Cui```
878*01826a49SYabin Cuishort matchLengths_defaultDistribution[53] =
879*01826a49SYabin Cui        { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
880*01826a49SYabin Cui          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
881*01826a49SYabin Cui          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
882*01826a49SYabin Cui         -1,-1,-1,-1,-1 };
883*01826a49SYabin Cui```
884*01826a49SYabin Cui
885*01826a49SYabin Cui##### Offset Codes
886*01826a49SYabin CuiThe decoding table uses an accuracy log of 5 bits (32 states),
887*01826a49SYabin Cuiand supports a maximum `N` value of 28, allowing offset values up to 536,870,908 .
888*01826a49SYabin Cui
889*01826a49SYabin CuiIf any sequence in the compressed block requires a larger offset than this,
890*01826a49SYabin Cuiit's not possible to use the default distribution to represent it.
891*01826a49SYabin Cui```
892*01826a49SYabin Cuishort offsetCodes_defaultDistribution[29] =
893*01826a49SYabin Cui        { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
894*01826a49SYabin Cui          1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 };
895*01826a49SYabin Cui```
896*01826a49SYabin Cui
897*01826a49SYabin Cui
898*01826a49SYabin CuiSequence Execution
899*01826a49SYabin Cui------------------
900*01826a49SYabin CuiOnce literals and sequences have been decoded,
901*01826a49SYabin Cuithey are combined to produce the decoded content of a block.
902*01826a49SYabin Cui
903*01826a49SYabin CuiEach sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`),
904*01826a49SYabin Cuidecoded as described in the [Sequences Section](#sequences-section).
905*01826a49SYabin CuiTo execute a sequence, first copy `literals_length` bytes
906*01826a49SYabin Cuifrom the decoded literals to the output.
907*01826a49SYabin Cui
908*01826a49SYabin CuiThen `match_length` bytes are copied from previous decoded data.
909*01826a49SYabin CuiThe offset to copy from is determined by `offset_value`:
910*01826a49SYabin Cuiif `offset_value > 3`, then the offset is `offset_value - 3`.
911*01826a49SYabin CuiIf `offset_value` is from 1-3, the offset is a special repeat offset value.
912*01826a49SYabin CuiSee the [repeat offset](#repeat-offsets) section for how the offset is determined
913*01826a49SYabin Cuiin this case.
914*01826a49SYabin Cui
915*01826a49SYabin CuiThe offset is defined as from the current position, so an offset of 6
916*01826a49SYabin Cuiand a match length of 3 means that 3 bytes should be copied from 6 bytes back.
917*01826a49SYabin CuiNote that all offsets leading to previously decoded data
918*01826a49SYabin Cuimust be smaller than `Window_Size` defined in `Frame_Header_Descriptor`.
919*01826a49SYabin Cui
920*01826a49SYabin Cui#### Repeat offsets
921*01826a49SYabin CuiAs seen in [Sequence Execution](#sequence-execution),
922*01826a49SYabin Cuithe first 3 values define a repeated offset and we will call them
923*01826a49SYabin Cui`Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`.
924*01826a49SYabin CuiThey are sorted in recency order, with `Repeated_Offset1` meaning "most recent one".
925*01826a49SYabin Cui
926*01826a49SYabin CuiIf `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc.
927*01826a49SYabin Cui
928*01826a49SYabin CuiThere is an exception though, when current sequence's `literals_length = 0`.
929*01826a49SYabin CuiIn this case, repeated offsets are shifted by one,
930*01826a49SYabin Cuiso an `offset_value` of 1 means `Repeated_Offset2`,
931*01826a49SYabin Cuian `offset_value` of 2 means `Repeated_Offset3`,
932*01826a49SYabin Cuiand an `offset_value` of 3 means `Repeated_Offset1 - 1`.
933*01826a49SYabin Cui
934*01826a49SYabin CuiIn the final case, if `Repeated_Offset1 - 1` evaluates to 0, then the
935*01826a49SYabin Cuidata is considered corrupted.
936*01826a49SYabin Cui
937*01826a49SYabin CuiFor the first block, the starting offset history is populated with following values :
938*01826a49SYabin Cui`Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8,
939*01826a49SYabin Cuiunless a dictionary is used, in which case they come from the dictionary.
940*01826a49SYabin Cui
941*01826a49SYabin CuiThen each block gets its starting offset history from the ending values of the most recent `Compressed_Block`.
942*01826a49SYabin CuiNote that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history.
943*01826a49SYabin Cui
944*01826a49SYabin Cui[Offset Codes]: #offset-codes
945*01826a49SYabin Cui
946*01826a49SYabin Cui###### Offset updates rules
947*01826a49SYabin Cui
948*01826a49SYabin CuiDuring the execution of the sequences of a `Compressed_Block`, the
949*01826a49SYabin Cui`Repeated_Offsets`' values are kept up to date, so that they always represent
950*01826a49SYabin Cuithe three most-recently used offsets. In order to achieve that, they are
951*01826a49SYabin Cuiupdated after executing each sequence in the following way:
952*01826a49SYabin Cui
953*01826a49SYabin CuiWhen the sequence's `offset_value` does not refer to one of the
954*01826a49SYabin Cui`Repeated_Offsets`--when it has value greater than 3, or when it has value 3
955*01826a49SYabin Cuiand the sequence's `literals_length` is zero--the `Repeated_Offsets`' values
956*01826a49SYabin Cuiare shifted back one, and `Repeated_Offset1` takes on the value of the
957*01826a49SYabin Cuijust-used offset.
958*01826a49SYabin Cui
959*01826a49SYabin CuiOtherwise, when the sequence's `offset_value` refers to one of the
960*01826a49SYabin Cui`Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the
961*01826a49SYabin Cuisequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered
962*01826a49SYabin Cuiso that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and
963*01826a49SYabin Cuithe existing values are pushed back from the first `Repeated_Offset` through to
964*01826a49SYabin Cuithe `Repeated_Offset` selected by the `offset_value`. This effectively performs
965*01826a49SYabin Cuia single-stepped wrapping rotation of the values of these offsets, so that
966*01826a49SYabin Cuitheir order again reflects the recency of their use.
967*01826a49SYabin Cui
968*01826a49SYabin CuiThe following table shows the values of the `Repeated_Offsets` as a series of
969*01826a49SYabin Cuisequences are applied to them:
970*01826a49SYabin Cui
971*01826a49SYabin Cui| `offset_value` | `literals_length` | `Repeated_Offset1` | `Repeated_Offset2` | `Repeated_Offset3` | Comment                 |
972*01826a49SYabin Cui|:--------------:|:-----------------:|:------------------:|:------------------:|:------------------:|:-----------------------:|
973*01826a49SYabin Cui|                |                   |                  1 |                  4 |                  8 | starting values         |
974*01826a49SYabin Cui|           1114 |                11 |               1111 |                  1 |                  4 | non-repeat              |
975*01826a49SYabin Cui|              1 |                22 |               1111 |                  1 |                  4 | repeat 1: no change     |
976*01826a49SYabin Cui|           2225 |                22 |               2222 |               1111 |                  1 | non-repeat              |
977*01826a49SYabin Cui|           1114 |               111 |               1111 |               2222 |               1111 | non-repeat              |
978*01826a49SYabin Cui|           3336 |                33 |               3333 |               1111 |               2222 | non-repeat              |
979*01826a49SYabin Cui|              2 |                22 |               1111 |               3333 |               2222 | repeat 2: swap 1 & 2    |
980*01826a49SYabin Cui|              3 |                33 |               2222 |               1111 |               3333 | repeat 3: rotate 3 to 1 |
981*01826a49SYabin Cui|              3 |                 0 |               2221 |               2222 |               1111 | special case : insert `repeat1 - 1` |
982*01826a49SYabin Cui|              1 |                 0 |               2222 |               2221 |               1111 | == repeat 2             |
983*01826a49SYabin Cui
984*01826a49SYabin Cui
985*01826a49SYabin CuiSkippable Frames
986*01826a49SYabin Cui----------------
987*01826a49SYabin Cui
988*01826a49SYabin Cui| `Magic_Number` | `Frame_Size` | `User_Data` |
989*01826a49SYabin Cui|:--------------:|:------------:|:-----------:|
990*01826a49SYabin Cui|   4 bytes      |  4 bytes     |   n bytes   |
991*01826a49SYabin Cui
992*01826a49SYabin CuiSkippable frames allow the insertion of user-defined metadata
993*01826a49SYabin Cuiinto a flow of concatenated frames.
994*01826a49SYabin Cui
995*01826a49SYabin CuiSkippable frames defined in this specification are compatible with [LZ4] ones.
996*01826a49SYabin Cui
997*01826a49SYabin Cui[LZ4]:https://lz4.github.io/lz4/
998*01826a49SYabin Cui
999*01826a49SYabin CuiFrom a compliant decoder perspective, skippable frames need just be skipped,
1000*01826a49SYabin Cuiand their content ignored, resuming decoding after the skippable frame.
1001*01826a49SYabin Cui
1002*01826a49SYabin CuiIt can be noted that a skippable frame
1003*01826a49SYabin Cuican be used to watermark a stream of concatenated frames
1004*01826a49SYabin Cuiembedding any kind of tracking information (even just a UUID).
1005*01826a49SYabin CuiUsers wary of such possibility should scan the stream of concatenated frames
1006*01826a49SYabin Cuiin an attempt to detect such frame for analysis or removal.
1007*01826a49SYabin Cui
1008*01826a49SYabin Cui__`Magic_Number`__
1009*01826a49SYabin Cui
1010*01826a49SYabin Cui4 Bytes, __little-endian__ format.
1011*01826a49SYabin CuiValue : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F.
1012*01826a49SYabin CuiAll 16 values are valid to identify a skippable frame.
1013*01826a49SYabin CuiThis specification doesn't detail any specific tagging for skippable frames.
1014*01826a49SYabin Cui
1015*01826a49SYabin Cui__`Frame_Size`__
1016*01826a49SYabin Cui
1017*01826a49SYabin CuiThis is the size, in bytes, of the following `User_Data`
1018*01826a49SYabin Cui(without including the magic number nor the size field itself).
1019*01826a49SYabin CuiThis field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits.
1020*01826a49SYabin CuiThis means `User_Data` can’t be bigger than (2^32-1) bytes.
1021*01826a49SYabin Cui
1022*01826a49SYabin Cui__`User_Data`__
1023*01826a49SYabin Cui
1024*01826a49SYabin CuiThe `User_Data` can be anything. Data will just be skipped by the decoder.
1025*01826a49SYabin Cui
1026*01826a49SYabin Cui
1027*01826a49SYabin Cui
1028*01826a49SYabin CuiEntropy Encoding
1029*01826a49SYabin Cui----------------
1030*01826a49SYabin CuiTwo types of entropy encoding are used by the Zstandard format:
1031*01826a49SYabin CuiFSE, and Huffman coding.
1032*01826a49SYabin CuiHuffman is used to compress literals,
1033*01826a49SYabin Cuiwhile FSE is used for all other symbols
1034*01826a49SYabin Cui(`Literals_Length_Code`, `Match_Length_Code`, offset codes)
1035*01826a49SYabin Cuiand to compress Huffman headers.
1036*01826a49SYabin Cui
1037*01826a49SYabin Cui
1038*01826a49SYabin CuiFSE
1039*01826a49SYabin Cui---
1040*01826a49SYabin CuiFSE, short for Finite State Entropy, is an entropy codec based on [ANS].
1041*01826a49SYabin CuiFSE encoding/decoding involves a state that is carried over between symbols,
1042*01826a49SYabin Cuiso decoding must be done in the opposite direction as encoding.
1043*01826a49SYabin CuiTherefore, all FSE bitstreams are read from end to beginning.
1044*01826a49SYabin CuiNote that the order of the bits in the stream is not reversed,
1045*01826a49SYabin Cuiwe just read the elements in the reverse order they are written.
1046*01826a49SYabin Cui
1047*01826a49SYabin CuiFor additional details on FSE, see [Finite State Entropy].
1048*01826a49SYabin Cui
1049*01826a49SYabin Cui[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
1050*01826a49SYabin Cui
1051*01826a49SYabin CuiFSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
1052*01826a49SYabin Cui`Symbol`, `Num_Bits`, and `Baseline`.
1053*01826a49SYabin CuiThe `log2` of the table size is its `Accuracy_Log`.
1054*01826a49SYabin CuiAn FSE state value represents an index in this table.
1055*01826a49SYabin Cui
1056*01826a49SYabin CuiTo obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
1057*01826a49SYabin CuiThe next symbol in the stream is the `Symbol` indicated in the table for that state.
1058*01826a49SYabin CuiTo obtain the next state value,
1059*01826a49SYabin Cuithe decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
1060*01826a49SYabin Cui
1061*01826a49SYabin Cui[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
1062*01826a49SYabin Cui
1063*01826a49SYabin Cui### FSE Table Description
1064*01826a49SYabin CuiTo decode FSE streams, it is necessary to construct the decoding table.
1065*01826a49SYabin CuiThe Zstandard format encodes FSE table descriptions as follows:
1066*01826a49SYabin Cui
1067*01826a49SYabin CuiAn FSE distribution table describes the probabilities of all symbols
1068*01826a49SYabin Cuifrom `0` to the last present one (included)
1069*01826a49SYabin Cuion a normalized scale of `1 << Accuracy_Log` .
1070*01826a49SYabin CuiNote that there must be two or more symbols with nonzero probability.
1071*01826a49SYabin Cui
1072*01826a49SYabin CuiIt's a bitstream which is read forward, in __little-endian__ fashion.
1073*01826a49SYabin CuiIt's not necessary to know bitstream exact size,
1074*01826a49SYabin Cuiit will be discovered and reported by the decoding process.
1075*01826a49SYabin Cui
1076*01826a49SYabin CuiThe bitstream starts by reporting on which scale it operates.
1077*01826a49SYabin CuiLet's `low4Bits` designate the lowest 4 bits of the first byte :
1078*01826a49SYabin Cui`Accuracy_Log = low4bits + 5`.
1079*01826a49SYabin Cui
1080*01826a49SYabin CuiThen follows each symbol value, from `0` to last present one.
1081*01826a49SYabin CuiThe number of bits used by each field is variable.
1082*01826a49SYabin CuiIt depends on :
1083*01826a49SYabin Cui
1084*01826a49SYabin Cui- Remaining probabilities + 1 :
1085*01826a49SYabin Cui  __example__ :
1086*01826a49SYabin Cui  Presuming an `Accuracy_Log` of 8,
1087*01826a49SYabin Cui  and presuming 100 probabilities points have already been distributed,
1088*01826a49SYabin Cui  the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
1089*01826a49SYabin Cui  Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
1090*01826a49SYabin Cui  is the smallest integer `T` that satisfies `(1 << T) > N`.
1091*01826a49SYabin Cui
1092*01826a49SYabin Cui- Value decoded : small values use 1 less bit :
1093*01826a49SYabin Cui  __example__ :
1094*01826a49SYabin Cui  Presuming values from 0 to 157 (inclusive) are possible,
1095*01826a49SYabin Cui  255-157 = 98 values are remaining in an 8-bits field.
1096*01826a49SYabin Cui  They are used this way :
1097*01826a49SYabin Cui  first 98 values (hence from 0 to 97) use only 7 bits,
1098*01826a49SYabin Cui  values from 98 to 157 use 8 bits.
1099*01826a49SYabin Cui  This is achieved through this scheme :
1100*01826a49SYabin Cui
1101*01826a49SYabin Cui  | Value read | Value decoded | Number of bits used |
1102*01826a49SYabin Cui  | ---------- | ------------- | ------------------- |
1103*01826a49SYabin Cui  |   0 -  97  |   0 -  97     |  7                  |
1104*01826a49SYabin Cui  |  98 - 127  |  98 - 127     |  8                  |
1105*01826a49SYabin Cui  | 128 - 225  |   0 -  97     |  7                  |
1106*01826a49SYabin Cui  | 226 - 255  | 128 - 157     |  8                  |
1107*01826a49SYabin Cui
1108*01826a49SYabin CuiSymbols probabilities are read one by one, in order.
1109*01826a49SYabin Cui
1110*01826a49SYabin CuiProbability is obtained from Value decoded by following formula :
1111*01826a49SYabin Cui`Proba = value - 1`
1112*01826a49SYabin Cui
1113*01826a49SYabin CuiIt means value `0` becomes negative probability `-1`.
1114*01826a49SYabin Cui`-1` is a special probability, which means "less than 1".
1115*01826a49SYabin CuiIts effect on distribution table is described in the [next section].
1116*01826a49SYabin CuiFor the purpose of calculating total allocated probability points, it counts as one.
1117*01826a49SYabin Cui
1118*01826a49SYabin Cui[next section]:#from-normalized-distribution-to-decoding-tables
1119*01826a49SYabin Cui
1120*01826a49SYabin CuiWhen a symbol has a __probability__ of `zero`,
1121*01826a49SYabin Cuiit is followed by a 2-bits repeat flag.
1122*01826a49SYabin CuiThis repeat flag tells how many probabilities of zeroes follow the current one.
1123*01826a49SYabin CuiIt provides a number ranging from 0 to 3.
1124*01826a49SYabin CuiIf it is a 3, another 2-bits repeat flag follows, and so on.
1125*01826a49SYabin Cui
1126*01826a49SYabin CuiWhen last symbol reaches cumulated total of `1 << Accuracy_Log`,
1127*01826a49SYabin Cuidecoding is complete.
1128*01826a49SYabin CuiIf the last symbol makes cumulated total go above `1 << Accuracy_Log`,
1129*01826a49SYabin Cuidistribution is considered corrupted.
1130*01826a49SYabin CuiIf this process results in a non-zero probability for a value outside of the
1131*01826a49SYabin Cuivalid range of values that the FSE table is defined for, even if that value is
1132*01826a49SYabin Cuinot used, then the data is considered corrupted.
1133*01826a49SYabin Cui
1134*01826a49SYabin CuiThen the decoder can tell how many bytes were used in this process,
1135*01826a49SYabin Cuiand how many symbols are present.
1136*01826a49SYabin CuiThe bitstream consumes a round number of bytes.
1137*01826a49SYabin CuiAny remaining bit within the last byte is just unused.
1138*01826a49SYabin Cui
1139*01826a49SYabin Cui#### From normalized distribution to decoding tables
1140*01826a49SYabin Cui
1141*01826a49SYabin CuiThe distribution of normalized probabilities is enough
1142*01826a49SYabin Cuito create a unique decoding table.
1143*01826a49SYabin Cui
1144*01826a49SYabin CuiIt follows the following build rule :
1145*01826a49SYabin Cui
1146*01826a49SYabin CuiThe table has a size of `Table_Size = 1 << Accuracy_Log`.
1147*01826a49SYabin CuiEach cell describes the symbol decoded,
1148*01826a49SYabin Cuiand instructions to get the next state (`Number_of_Bits` and `Baseline`).
1149*01826a49SYabin Cui
1150*01826a49SYabin CuiSymbols are scanned in their natural order for "less than 1" probabilities.
1151*01826a49SYabin CuiSymbols with this probability are being attributed a single cell,
1152*01826a49SYabin Cuistarting from the end of the table and retreating.
1153*01826a49SYabin CuiThese symbols define a full state reset, reading `Accuracy_Log` bits.
1154*01826a49SYabin Cui
1155*01826a49SYabin CuiThen, all remaining symbols, sorted in natural order, are allocated cells.
1156*01826a49SYabin CuiStarting from symbol `0` (if it exists), and table position `0`,
1157*01826a49SYabin Cuieach symbol gets allocated as many cells as its probability.
1158*01826a49SYabin CuiCell allocation is spread, not linear :
1159*01826a49SYabin Cuieach successor position follows this rule :
1160*01826a49SYabin Cui
1161*01826a49SYabin Cui```
1162*01826a49SYabin Cuiposition += (tableSize>>1) + (tableSize>>3) + 3;
1163*01826a49SYabin Cuiposition &= tableSize-1;
1164*01826a49SYabin Cui```
1165*01826a49SYabin Cui
1166*01826a49SYabin CuiA position is skipped if already occupied by a "less than 1" probability symbol.
1167*01826a49SYabin Cui`position` does not reset between symbols, it simply iterates through
1168*01826a49SYabin Cuieach position in the table, switching to the next symbol when enough
1169*01826a49SYabin Cuistates have been allocated to the current one.
1170*01826a49SYabin Cui
1171*01826a49SYabin CuiThe process guarantees that the table is entirely filled.
1172*01826a49SYabin CuiEach cell corresponds to a state value, which contains the symbol being decoded.
1173*01826a49SYabin Cui
1174*01826a49SYabin CuiTo add the `Number_of_Bits` and `Baseline` required to retrieve next state,
1175*01826a49SYabin Cuiit's first necessary to sort all occurrences of each symbol in state order.
1176*01826a49SYabin CuiLower states will need 1 more bit than higher ones.
1177*01826a49SYabin CuiThe process is repeated for each symbol.
1178*01826a49SYabin Cui
1179*01826a49SYabin Cui__Example__ :
1180*01826a49SYabin CuiPresuming a symbol has a probability of 5,
1181*01826a49SYabin Cuiit receives 5 cells, corresponding to 5 state values.
1182*01826a49SYabin CuiThese state values are then sorted in natural order.
1183*01826a49SYabin Cui
1184*01826a49SYabin CuiNext power of 2 after 5 is 8.
1185*01826a49SYabin CuiSpace of probabilities must be divided into 8 equal parts.
1186*01826a49SYabin CuiPresuming the `Accuracy_Log` is 7, it defines a space of 128 states.
1187*01826a49SYabin CuiDivided by 8, each share is 16 large.
1188*01826a49SYabin Cui
1189*01826a49SYabin CuiIn order to reach 8 shares, 8-5=3 lowest states will count "double",
1190*01826a49SYabin Cuidoubling their shares (32 in width), hence requiring one more bit.
1191*01826a49SYabin Cui
1192*01826a49SYabin CuiBaseline is assigned starting from the higher states using fewer bits,
1193*01826a49SYabin Cuiincreasing at each state, then resuming at the first state,
1194*01826a49SYabin Cuieach state takes its allocated width from Baseline.
1195*01826a49SYabin Cui
1196*01826a49SYabin Cui| state order      |   0   |   1   |    2   |   3  |    4   |
1197*01826a49SYabin Cui| ---------------- | ----- | ----- | ------ | ---- | ------ |
1198*01826a49SYabin Cui| state value      |   1   |  39   |   77   |  84  |  122   |
1199*01826a49SYabin Cui| width            |  32   |  32   |   32   |  16  |   16   |
1200*01826a49SYabin Cui| `Number_of_Bits` |   5   |   5   |    5   |   4  |    4   |
1201*01826a49SYabin Cui| range number     |   2   |   4   |    6   |   0  |    1   |
1202*01826a49SYabin Cui| `Baseline`       |  32   |  64   |   96   |   0  |   16   |
1203*01826a49SYabin Cui| range            | 32-63 | 64-95 | 96-127 | 0-15 | 16-31  |
1204*01826a49SYabin Cui
1205*01826a49SYabin CuiDuring decoding, the next state value is determined from current state value,
1206*01826a49SYabin Cuiby reading the required `Number_of_Bits`, and adding the specified `Baseline`.
1207*01826a49SYabin Cui
1208*01826a49SYabin CuiSee [Appendix A] for the results of this process applied to the default distributions.
1209*01826a49SYabin Cui
1210*01826a49SYabin Cui[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
1211*01826a49SYabin Cui
1212*01826a49SYabin Cui
1213*01826a49SYabin CuiHuffman Coding
1214*01826a49SYabin Cui--------------
1215*01826a49SYabin CuiZstandard Huffman-coded streams are read backwards,
1216*01826a49SYabin Cuisimilar to the FSE bitstreams.
1217*01826a49SYabin CuiTherefore, to find the start of the bitstream, it is required to
1218*01826a49SYabin Cuiknow the offset of the last byte of the Huffman-coded stream.
1219*01826a49SYabin Cui
1220*01826a49SYabin CuiAfter writing the last bit containing information, the compressor
1221*01826a49SYabin Cuiwrites a single `1`-bit and then fills the byte with 0-7 `0` bits of
1222*01826a49SYabin Cuipadding. The last byte of the compressed bitstream cannot be `0` for
1223*01826a49SYabin Cuithat reason.
1224*01826a49SYabin Cui
1225*01826a49SYabin CuiWhen decompressing, the last byte containing the padding is the first
1226*01826a49SYabin Cuibyte to read. The decompressor needs to skip 0-7 initial `0`-bits and
1227*01826a49SYabin Cuithe first `1`-bit it occurs. Afterwards, the useful part of the bitstream
1228*01826a49SYabin Cuibegins.
1229*01826a49SYabin Cui
1230*01826a49SYabin CuiThe bitstream contains Huffman-coded symbols in __little-endian__ order,
1231*01826a49SYabin Cuiwith the codes defined by the method below.
1232*01826a49SYabin Cui
1233*01826a49SYabin Cui### Huffman Tree Description
1234*01826a49SYabin Cui
1235*01826a49SYabin CuiPrefix coding represents symbols from an a priori known alphabet
1236*01826a49SYabin Cuiby bit sequences (codewords), one codeword for each symbol,
1237*01826a49SYabin Cuiin a manner such that different symbols may be represented
1238*01826a49SYabin Cuiby bit sequences of different lengths,
1239*01826a49SYabin Cuibut a parser can always parse an encoded string
1240*01826a49SYabin Cuiunambiguously symbol-by-symbol.
1241*01826a49SYabin Cui
1242*01826a49SYabin CuiGiven an alphabet with known symbol frequencies,
1243*01826a49SYabin Cuithe Huffman algorithm allows the construction of an optimal prefix code
1244*01826a49SYabin Cuiusing the fewest bits of any possible prefix codes for that alphabet.
1245*01826a49SYabin Cui
1246*01826a49SYabin CuiPrefix code must not exceed a maximum code length.
1247*01826a49SYabin CuiMore bits improve accuracy but cost more header size,
1248*01826a49SYabin Cuiand require more memory or more complex decoding operations.
1249*01826a49SYabin CuiThis specification limits maximum code length to 11 bits.
1250*01826a49SYabin Cui
1251*01826a49SYabin Cui#### Representation
1252*01826a49SYabin Cui
1253*01826a49SYabin CuiAll literal values from zero (included) to last present one (excluded)
1254*01826a49SYabin Cuiare represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
1255*01826a49SYabin CuiTransformation from `Weight` to `Number_of_Bits` follows this formula :
1256*01826a49SYabin Cui```
1257*01826a49SYabin CuiNumber_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
1258*01826a49SYabin Cui```
1259*01826a49SYabin CuiWhen a literal value is not present, it receives a `Weight` of 0.
1260*01826a49SYabin CuiThe least frequent symbol receives a `Weight` of 1.
1261*01826a49SYabin CuiIf no literal has a `Weight` of 1, then the data is considered corrupted.
1262*01826a49SYabin CuiIf there are not at least two literals with non-zero `Weight`, then the data
1263*01826a49SYabin Cuiis considered corrupted.
1264*01826a49SYabin CuiThe most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
1265*01826a49SYabin CuiThe last symbol's `Weight` is deduced from previously retrieved Weights,
1266*01826a49SYabin Cuiby completing to the nearest power of 2. It's necessarily non 0.
1267*01826a49SYabin CuiIf it's not possible to reach a clean power of 2 with a single `Weight` value,
1268*01826a49SYabin Cuithe Huffman Tree Description is considered invalid.
1269*01826a49SYabin CuiThis final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
1270*01826a49SYabin Cui`Max_Number_of_Bits` must be <= 11,
1271*01826a49SYabin Cuiotherwise the representation is considered corrupted.
1272*01826a49SYabin Cui
1273*01826a49SYabin Cui__Example__ :
1274*01826a49SYabin CuiLet's presume the following Huffman tree must be described :
1275*01826a49SYabin Cui
1276*01826a49SYabin Cui|  literal value   |  0  |  1  |  2  |  3  |  4  |  5  |
1277*01826a49SYabin Cui| ---------------- | --- | --- | --- | --- | --- | --- |
1278*01826a49SYabin Cui| `Number_of_Bits` |  1  |  2  |  3  |  0  |  4  |  4  |
1279*01826a49SYabin Cui
1280*01826a49SYabin CuiThe tree depth is 4, since its longest elements uses 4 bits
1281*01826a49SYabin Cui(longest elements are the one with smallest frequency).
1282*01826a49SYabin CuiLiteral value `5` will not be listed, as it can be determined from previous values 0-4,
1283*01826a49SYabin Cuinor will values above `5` as they are all 0.
1284*01826a49SYabin CuiValues from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
1285*01826a49SYabin CuiWeight formula is :
1286*01826a49SYabin Cui```
1287*01826a49SYabin CuiWeight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
1288*01826a49SYabin Cui```
1289*01826a49SYabin CuiIt gives the following series of weights :
1290*01826a49SYabin Cui
1291*01826a49SYabin Cui| literal value |  0  |  1  |  2  |  3  |  4  |
1292*01826a49SYabin Cui| ------------- | --- | --- | --- | --- | --- |
1293*01826a49SYabin Cui|   `Weight`    |  4  |  3  |  2  |  0  |  1  |
1294*01826a49SYabin Cui
1295*01826a49SYabin CuiThe decoder will do the inverse operation :
1296*01826a49SYabin Cuihaving collected weights of literal symbols from `0` to `4`,
1297*01826a49SYabin Cuiit knows the last literal, `5`, is present with a non-zero `Weight`.
1298*01826a49SYabin CuiThe `Weight` of `5` can be determined by advancing to the next power of 2.
1299*01826a49SYabin CuiThe sum of `2^(Weight-1)` (excluding 0's) is :
1300*01826a49SYabin Cui`8 + 4 + 2 + 0 + 1 = 15`.
1301*01826a49SYabin CuiNearest larger power of 2 value is 16.
1302*01826a49SYabin CuiTherefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
1303*01826a49SYabin Cui
1304*01826a49SYabin Cui#### Huffman Tree header
1305*01826a49SYabin Cui
1306*01826a49SYabin CuiThis is a single byte value (0-255),
1307*01826a49SYabin Cuiwhich describes how the series of weights is encoded.
1308*01826a49SYabin Cui
1309*01826a49SYabin Cui- if `headerByte` < 128 :
1310*01826a49SYabin Cui  the series of weights is compressed using FSE (see below).
1311*01826a49SYabin Cui  The length of the FSE-compressed series is equal to `headerByte` (0-127).
1312*01826a49SYabin Cui
1313*01826a49SYabin Cui- if `headerByte` >= 128 :
1314*01826a49SYabin Cui  + the series of weights uses a direct representation,
1315*01826a49SYabin Cui    where each `Weight` is encoded directly as a 4 bits field (0-15).
1316*01826a49SYabin Cui  + They are encoded forward, 2 weights to a byte,
1317*01826a49SYabin Cui    first weight taking the top four bits and second one taking the bottom four.
1318*01826a49SYabin Cui    * e.g. the following operations could be used to read the weights:
1319*01826a49SYabin Cui      `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.
1320*01826a49SYabin Cui  + The full representation occupies `Ceiling(Number_of_Weights/2)` bytes,
1321*01826a49SYabin Cui    meaning it uses only full bytes even if `Number_of_Weights` is odd.
1322*01826a49SYabin Cui  + `Number_of_Weights = headerByte - 127`.
1323*01826a49SYabin Cui    * Note that maximum `Number_of_Weights` is 255-127 = 128,
1324*01826a49SYabin Cui      therefore, only up to 128 `Weight` can be encoded using direct representation.
1325*01826a49SYabin Cui    * Since the last non-zero `Weight` is _not_ encoded,
1326*01826a49SYabin Cui      this scheme is compatible with alphabet sizes of up to 129 symbols,
1327*01826a49SYabin Cui      hence including literal symbol 128.
1328*01826a49SYabin Cui    * If any literal symbol > 128 has a non-zero `Weight`,
1329*01826a49SYabin Cui      direct representation is not possible.
1330*01826a49SYabin Cui      In such case, it's necessary to use FSE compression.
1331*01826a49SYabin Cui
1332*01826a49SYabin Cui
1333*01826a49SYabin Cui#### Finite State Entropy (FSE) compression of Huffman weights
1334*01826a49SYabin Cui
1335*01826a49SYabin CuiIn this case, the series of Huffman weights is compressed using FSE compression.
1336*01826a49SYabin CuiIt's a single bitstream with 2 interleaved states,
1337*01826a49SYabin Cuisharing a single distribution table.
1338*01826a49SYabin Cui
1339*01826a49SYabin CuiTo decode an FSE bitstream, it is necessary to know its compressed size.
1340*01826a49SYabin CuiCompressed size is provided by `headerByte`.
1341*01826a49SYabin CuiIt's also necessary to know its _maximum possible_ decompressed size,
1342*01826a49SYabin Cuiwhich is `255`, since literal values span from `0` to `255`,
1343*01826a49SYabin Cuiand last symbol's `Weight` is not represented.
1344*01826a49SYabin Cui
1345*01826a49SYabin CuiAn FSE bitstream starts by a header, describing probabilities distribution.
1346*01826a49SYabin CuiIt will create a Decoding Table.
1347*01826a49SYabin CuiFor a list of Huffman weights, the maximum accuracy log is 6 bits.
1348*01826a49SYabin CuiFor more description see the [FSE header description](#fse-table-description)
1349*01826a49SYabin Cui
1350*01826a49SYabin CuiThe Huffman header compression uses 2 states,
1351*01826a49SYabin Cuiwhich share the same FSE distribution table.
1352*01826a49SYabin CuiThe first state (`State1`) encodes the even indexed symbols,
1353*01826a49SYabin Cuiand the second (`State2`) encodes the odd indexed symbols.
1354*01826a49SYabin Cui`State1` is initialized first, and then `State2`, and they take turns
1355*01826a49SYabin Cuidecoding a single symbol and updating their state.
1356*01826a49SYabin CuiFor more details on these FSE operations, see the [FSE section](#fse).
1357*01826a49SYabin Cui
1358*01826a49SYabin CuiThe number of symbols to decode is determined
1359*01826a49SYabin Cuiby tracking bitStream overflow condition:
1360*01826a49SYabin CuiIf updating state after decoding a symbol would require more bits than
1361*01826a49SYabin Cuiremain in the stream, it is assumed that extra bits are 0.  Then,
1362*01826a49SYabin Cuisymbols for each of the final states are decoded and the process is complete.
1363*01826a49SYabin Cui
1364*01826a49SYabin CuiIf this process would produce more weights than the maximum number of decoded
1365*01826a49SYabin Cuiweights (255), then the data is considered corrupted.
1366*01826a49SYabin Cui
1367*01826a49SYabin Cui#### Conversion from weights to Huffman prefix codes
1368*01826a49SYabin Cui
1369*01826a49SYabin CuiAll present symbols shall now have a `Weight` value.
1370*01826a49SYabin CuiIt is possible to transform weights into `Number_of_Bits`, using this formula:
1371*01826a49SYabin Cui```
1372*01826a49SYabin CuiNumber_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
1373*01826a49SYabin Cui```
1374*01826a49SYabin CuiSymbols are sorted by `Weight`.
1375*01826a49SYabin CuiWithin same `Weight`, symbols keep natural sequential order.
1376*01826a49SYabin CuiSymbols with a `Weight` of zero are removed.
1377*01826a49SYabin CuiThen, starting from lowest `Weight`, prefix codes are distributed in sequential order.
1378*01826a49SYabin Cui
1379*01826a49SYabin Cui__Example__ :
1380*01826a49SYabin CuiLet's presume the following list of weights has been decoded :
1381*01826a49SYabin Cui
1382*01826a49SYabin Cui| Literal  |  0  |  1  |  2  |  3  |  4  |  5  |
1383*01826a49SYabin Cui| -------- | --- | --- | --- | --- | --- | --- |
1384*01826a49SYabin Cui| `Weight` |  4  |  3  |  2  |  0  |  1  |  1  |
1385*01826a49SYabin Cui
1386*01826a49SYabin CuiSorted by weight and then natural sequential order,
1387*01826a49SYabin Cuiit gives the following distribution :
1388*01826a49SYabin Cui
1389*01826a49SYabin Cui| Literal          |  3  |  4  |  5  |  2  |  1  |   0  |
1390*01826a49SYabin Cui| ---------------- | --- | --- | --- | --- | --- | ---- |
1391*01826a49SYabin Cui| `Weight`         |  0  |  1  |  1  |  2  |  3  |   4  |
1392*01826a49SYabin Cui| `Number_of_Bits` |  0  |  4  |  4  |  3  |  2  |   1  |
1393*01826a49SYabin Cui| prefix codes     | N/A | 0000| 0001| 001 | 01  |   1  |
1394*01826a49SYabin Cui
1395*01826a49SYabin Cui### Huffman-coded Streams
1396*01826a49SYabin Cui
1397*01826a49SYabin CuiGiven a Huffman decoding table,
1398*01826a49SYabin Cuiit's possible to decode a Huffman-coded stream.
1399*01826a49SYabin Cui
1400*01826a49SYabin CuiEach bitstream must be read _backward_,
1401*01826a49SYabin Cuithat is starting from the end down to the beginning.
1402*01826a49SYabin CuiTherefore it's necessary to know the size of each bitstream.
1403*01826a49SYabin Cui
1404*01826a49SYabin CuiIt's also necessary to know exactly which _bit_ is the last one.
1405*01826a49SYabin CuiThis is detected by a final bit flag :
1406*01826a49SYabin Cuithe highest bit of latest byte is a final-bit-flag.
1407*01826a49SYabin CuiConsequently, a last byte of `0` is not possible.
1408*01826a49SYabin CuiAnd the final-bit-flag itself is not part of the useful bitstream.
1409*01826a49SYabin CuiHence, the last byte contains between 0 and 7 useful bits.
1410*01826a49SYabin Cui
1411*01826a49SYabin CuiStarting from the end,
1412*01826a49SYabin Cuiit's possible to read the bitstream in a __little-endian__ fashion,
1413*01826a49SYabin Cuikeeping track of already used bits. Since the bitstream is encoded in reverse
1414*01826a49SYabin Cuiorder, starting from the end read symbols in forward order.
1415*01826a49SYabin Cui
1416*01826a49SYabin CuiFor example, if the literal sequence "0145" was encoded using above prefix code,
1417*01826a49SYabin Cuiit would be encoded (in reverse order) as:
1418*01826a49SYabin Cui
1419*01826a49SYabin Cui|Symbol  |   5  |   4  |  1 | 0 | Padding |
1420*01826a49SYabin Cui|--------|------|------|----|---|---------|
1421*01826a49SYabin Cui|Encoding|`0000`|`0001`|`01`|`1`| `00001` |
1422*01826a49SYabin Cui
1423*01826a49SYabin CuiResulting in following 2-bytes bitstream :
1424*01826a49SYabin Cui```
1425*01826a49SYabin Cui00010000 00001101
1426*01826a49SYabin Cui```
1427*01826a49SYabin Cui
1428*01826a49SYabin CuiHere is an alternative representation with the symbol codes separated by underscore:
1429*01826a49SYabin Cui```
1430*01826a49SYabin Cui0001_0000 00001_1_01
1431*01826a49SYabin Cui```
1432*01826a49SYabin Cui
1433*01826a49SYabin CuiReading highest `Max_Number_of_Bits` bits,
1434*01826a49SYabin Cuiit's possible to compare extracted value to decoding table,
1435*01826a49SYabin Cuidetermining the symbol to decode and number of bits to discard.
1436*01826a49SYabin Cui
1437*01826a49SYabin CuiThe process continues up to reading the required number of symbols per stream.
1438*01826a49SYabin CuiIf a bitstream is not entirely and exactly consumed,
1439*01826a49SYabin Cuihence reaching exactly its beginning position with _all_ bits consumed,
1440*01826a49SYabin Cuithe decoding process is considered faulty.
1441*01826a49SYabin Cui
1442*01826a49SYabin Cui
1443*01826a49SYabin CuiDictionary Format
1444*01826a49SYabin Cui-----------------
1445*01826a49SYabin Cui
1446*01826a49SYabin CuiZstandard is compatible with "raw content" dictionaries,
1447*01826a49SYabin Cuifree of any format restriction, except that they must be at least 8 bytes.
1448*01826a49SYabin CuiThese dictionaries function as if they were just the `Content` part
1449*01826a49SYabin Cuiof a formatted dictionary.
1450*01826a49SYabin Cui
1451*01826a49SYabin CuiBut dictionaries created by `zstd --train` follow a format, described here.
1452*01826a49SYabin Cui
1453*01826a49SYabin Cui__Pre-requisites__ : a dictionary has a size,
1454*01826a49SYabin Cui                     defined either by a buffer limit, or a file size.
1455*01826a49SYabin Cui
1456*01826a49SYabin Cui| `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` |
1457*01826a49SYabin Cui| -------------- | --------------- | ---------------- | --------- |
1458*01826a49SYabin Cui
1459*01826a49SYabin Cui__`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format
1460*01826a49SYabin Cui
1461*01826a49SYabin Cui__`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format.
1462*01826a49SYabin Cui              `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`).
1463*01826a49SYabin Cui              It's used by decoders to check if they use the correct dictionary.
1464*01826a49SYabin Cui
1465*01826a49SYabin Cui_Reserved ranges :_
1466*01826a49SYabin CuiIf the dictionary is going to be distributed in a public environment,
1467*01826a49SYabin Cuithe following ranges of `Dictionary_ID` are reserved for some future registrar
1468*01826a49SYabin Cuiand shall not be used :
1469*01826a49SYabin Cui
1470*01826a49SYabin Cui    - low range  : <= 32767
1471*01826a49SYabin Cui    - high range : >= (2^31)
1472*01826a49SYabin Cui
1473*01826a49SYabin CuiOutside of these ranges, any value of `Dictionary_ID`
1474*01826a49SYabin Cuiwhich is both `>= 32768` and `< (1<<31)` can be used freely,
1475*01826a49SYabin Cuieven in public environment.
1476*01826a49SYabin Cui
1477*01826a49SYabin Cui
1478*01826a49SYabin Cui__`Entropy_Tables`__ : follow the same format as tables in [compressed blocks].
1479*01826a49SYabin Cui              See the relevant [FSE](#fse-table-description)
1480*01826a49SYabin Cui              and [Huffman](#huffman-tree-description) sections for how to decode these tables.
1481*01826a49SYabin Cui              They are stored in following order :
1482*01826a49SYabin Cui              Huffman tables for literals, FSE table for offsets,
1483*01826a49SYabin Cui              FSE table for match lengths, and FSE table for literals lengths.
1484*01826a49SYabin Cui              These tables populate the Repeat Stats literals mode and
1485*01826a49SYabin Cui              Repeat distribution mode for sequence decoding.
1486*01826a49SYabin Cui              It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`),
1487*01826a49SYabin Cui              stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes.
1488*01826a49SYabin Cui              Each recent offset must have a value <= dictionary content size, and cannot equal 0.
1489*01826a49SYabin Cui
1490*01826a49SYabin Cui__`Content`__ : The rest of the dictionary is its content.
1491*01826a49SYabin Cui              The content act as a "past" in front of data to compress or decompress,
1492*01826a49SYabin Cui              so it can be referenced in sequence commands.
1493*01826a49SYabin Cui              As long as the amount of data decoded from this frame is less than or
1494*01826a49SYabin Cui              equal to `Window_Size`, sequence commands may specify offsets longer
1495*01826a49SYabin Cui              than the total length of decoded output so far to reference back to the
1496*01826a49SYabin Cui              dictionary, even parts of the dictionary with offsets larger than `Window_Size`.
1497*01826a49SYabin Cui              After the total output has surpassed `Window_Size` however,
1498*01826a49SYabin Cui              this is no longer allowed and the dictionary is no longer accessible.
1499*01826a49SYabin Cui
1500*01826a49SYabin Cui[compressed blocks]: #the-format-of-compressed_block
1501*01826a49SYabin Cui
1502*01826a49SYabin CuiIf a dictionary is provided by an external source,
1503*01826a49SYabin Cuiit should be loaded with great care, its content considered untrusted.
1504*01826a49SYabin Cui
1505*01826a49SYabin Cui
1506*01826a49SYabin Cui
1507*01826a49SYabin CuiAppendix A - Decoding tables for predefined codes
1508*01826a49SYabin Cui-------------------------------------------------
1509*01826a49SYabin Cui
1510*01826a49SYabin CuiThis appendix contains FSE decoding tables
1511*01826a49SYabin Cuifor the predefined literal length, match length, and offset codes.
1512*01826a49SYabin CuiThe tables have been constructed using the algorithm as given above in chapter
1513*01826a49SYabin Cui"from normalized distribution to decoding tables".
1514*01826a49SYabin CuiThe tables here can be used as examples
1515*01826a49SYabin Cuito crosscheck that an implementation build its decoding tables correctly.
1516*01826a49SYabin Cui
1517*01826a49SYabin Cui#### Literal Length Code:
1518*01826a49SYabin Cui
1519*01826a49SYabin Cui| State | Symbol | Number_Of_Bits | Base |
1520*01826a49SYabin Cui| ----- | ------ | -------------- | ---- |
1521*01826a49SYabin Cui|     0 |      0 |              4 |    0 |
1522*01826a49SYabin Cui|     1 |      0 |              4 |   16 |
1523*01826a49SYabin Cui|     2 |      1 |              5 |   32 |
1524*01826a49SYabin Cui|     3 |      3 |              5 |    0 |
1525*01826a49SYabin Cui|     4 |      4 |              5 |    0 |
1526*01826a49SYabin Cui|     5 |      6 |              5 |    0 |
1527*01826a49SYabin Cui|     6 |      7 |              5 |    0 |
1528*01826a49SYabin Cui|     7 |      9 |              5 |    0 |
1529*01826a49SYabin Cui|     8 |     10 |              5 |    0 |
1530*01826a49SYabin Cui|     9 |     12 |              5 |    0 |
1531*01826a49SYabin Cui|    10 |     14 |              6 |    0 |
1532*01826a49SYabin Cui|    11 |     16 |              5 |    0 |
1533*01826a49SYabin Cui|    12 |     18 |              5 |    0 |
1534*01826a49SYabin Cui|    13 |     19 |              5 |    0 |
1535*01826a49SYabin Cui|    14 |     21 |              5 |    0 |
1536*01826a49SYabin Cui|    15 |     22 |              5 |    0 |
1537*01826a49SYabin Cui|    16 |     24 |              5 |    0 |
1538*01826a49SYabin Cui|    17 |     25 |              5 |   32 |
1539*01826a49SYabin Cui|    18 |     26 |              5 |    0 |
1540*01826a49SYabin Cui|    19 |     27 |              6 |    0 |
1541*01826a49SYabin Cui|    20 |     29 |              6 |    0 |
1542*01826a49SYabin Cui|    21 |     31 |              6 |    0 |
1543*01826a49SYabin Cui|    22 |      0 |              4 |   32 |
1544*01826a49SYabin Cui|    23 |      1 |              4 |    0 |
1545*01826a49SYabin Cui|    24 |      2 |              5 |    0 |
1546*01826a49SYabin Cui|    25 |      4 |              5 |   32 |
1547*01826a49SYabin Cui|    26 |      5 |              5 |    0 |
1548*01826a49SYabin Cui|    27 |      7 |              5 |   32 |
1549*01826a49SYabin Cui|    28 |      8 |              5 |    0 |
1550*01826a49SYabin Cui|    29 |     10 |              5 |   32 |
1551*01826a49SYabin Cui|    30 |     11 |              5 |    0 |
1552*01826a49SYabin Cui|    31 |     13 |              6 |    0 |
1553*01826a49SYabin Cui|    32 |     16 |              5 |   32 |
1554*01826a49SYabin Cui|    33 |     17 |              5 |    0 |
1555*01826a49SYabin Cui|    34 |     19 |              5 |   32 |
1556*01826a49SYabin Cui|    35 |     20 |              5 |    0 |
1557*01826a49SYabin Cui|    36 |     22 |              5 |   32 |
1558*01826a49SYabin Cui|    37 |     23 |              5 |    0 |
1559*01826a49SYabin Cui|    38 |     25 |              4 |    0 |
1560*01826a49SYabin Cui|    39 |     25 |              4 |   16 |
1561*01826a49SYabin Cui|    40 |     26 |              5 |   32 |
1562*01826a49SYabin Cui|    41 |     28 |              6 |    0 |
1563*01826a49SYabin Cui|    42 |     30 |              6 |    0 |
1564*01826a49SYabin Cui|    43 |      0 |              4 |   48 |
1565*01826a49SYabin Cui|    44 |      1 |              4 |   16 |
1566*01826a49SYabin Cui|    45 |      2 |              5 |   32 |
1567*01826a49SYabin Cui|    46 |      3 |              5 |   32 |
1568*01826a49SYabin Cui|    47 |      5 |              5 |   32 |
1569*01826a49SYabin Cui|    48 |      6 |              5 |   32 |
1570*01826a49SYabin Cui|    49 |      8 |              5 |   32 |
1571*01826a49SYabin Cui|    50 |      9 |              5 |   32 |
1572*01826a49SYabin Cui|    51 |     11 |              5 |   32 |
1573*01826a49SYabin Cui|    52 |     12 |              5 |   32 |
1574*01826a49SYabin Cui|    53 |     15 |              6 |    0 |
1575*01826a49SYabin Cui|    54 |     17 |              5 |   32 |
1576*01826a49SYabin Cui|    55 |     18 |              5 |   32 |
1577*01826a49SYabin Cui|    56 |     20 |              5 |   32 |
1578*01826a49SYabin Cui|    57 |     21 |              5 |   32 |
1579*01826a49SYabin Cui|    58 |     23 |              5 |   32 |
1580*01826a49SYabin Cui|    59 |     24 |              5 |   32 |
1581*01826a49SYabin Cui|    60 |     35 |              6 |    0 |
1582*01826a49SYabin Cui|    61 |     34 |              6 |    0 |
1583*01826a49SYabin Cui|    62 |     33 |              6 |    0 |
1584*01826a49SYabin Cui|    63 |     32 |              6 |    0 |
1585*01826a49SYabin Cui
1586*01826a49SYabin Cui#### Match Length Code:
1587*01826a49SYabin Cui
1588*01826a49SYabin Cui| State | Symbol | Number_Of_Bits | Base |
1589*01826a49SYabin Cui| ----- | ------ | -------------- | ---- |
1590*01826a49SYabin Cui|     0 |      0 |              6 |    0 |
1591*01826a49SYabin Cui|     1 |      1 |              4 |    0 |
1592*01826a49SYabin Cui|     2 |      2 |              5 |   32 |
1593*01826a49SYabin Cui|     3 |      3 |              5 |    0 |
1594*01826a49SYabin Cui|     4 |      5 |              5 |    0 |
1595*01826a49SYabin Cui|     5 |      6 |              5 |    0 |
1596*01826a49SYabin Cui|     6 |      8 |              5 |    0 |
1597*01826a49SYabin Cui|     7 |     10 |              6 |    0 |
1598*01826a49SYabin Cui|     8 |     13 |              6 |    0 |
1599*01826a49SYabin Cui|     9 |     16 |              6 |    0 |
1600*01826a49SYabin Cui|    10 |     19 |              6 |    0 |
1601*01826a49SYabin Cui|    11 |     22 |              6 |    0 |
1602*01826a49SYabin Cui|    12 |     25 |              6 |    0 |
1603*01826a49SYabin Cui|    13 |     28 |              6 |    0 |
1604*01826a49SYabin Cui|    14 |     31 |              6 |    0 |
1605*01826a49SYabin Cui|    15 |     33 |              6 |    0 |
1606*01826a49SYabin Cui|    16 |     35 |              6 |    0 |
1607*01826a49SYabin Cui|    17 |     37 |              6 |    0 |
1608*01826a49SYabin Cui|    18 |     39 |              6 |    0 |
1609*01826a49SYabin Cui|    19 |     41 |              6 |    0 |
1610*01826a49SYabin Cui|    20 |     43 |              6 |    0 |
1611*01826a49SYabin Cui|    21 |     45 |              6 |    0 |
1612*01826a49SYabin Cui|    22 |      1 |              4 |   16 |
1613*01826a49SYabin Cui|    23 |      2 |              4 |    0 |
1614*01826a49SYabin Cui|    24 |      3 |              5 |   32 |
1615*01826a49SYabin Cui|    25 |      4 |              5 |    0 |
1616*01826a49SYabin Cui|    26 |      6 |              5 |   32 |
1617*01826a49SYabin Cui|    27 |      7 |              5 |    0 |
1618*01826a49SYabin Cui|    28 |      9 |              6 |    0 |
1619*01826a49SYabin Cui|    29 |     12 |              6 |    0 |
1620*01826a49SYabin Cui|    30 |     15 |              6 |    0 |
1621*01826a49SYabin Cui|    31 |     18 |              6 |    0 |
1622*01826a49SYabin Cui|    32 |     21 |              6 |    0 |
1623*01826a49SYabin Cui|    33 |     24 |              6 |    0 |
1624*01826a49SYabin Cui|    34 |     27 |              6 |    0 |
1625*01826a49SYabin Cui|    35 |     30 |              6 |    0 |
1626*01826a49SYabin Cui|    36 |     32 |              6 |    0 |
1627*01826a49SYabin Cui|    37 |     34 |              6 |    0 |
1628*01826a49SYabin Cui|    38 |     36 |              6 |    0 |
1629*01826a49SYabin Cui|    39 |     38 |              6 |    0 |
1630*01826a49SYabin Cui|    40 |     40 |              6 |    0 |
1631*01826a49SYabin Cui|    41 |     42 |              6 |    0 |
1632*01826a49SYabin Cui|    42 |     44 |              6 |    0 |
1633*01826a49SYabin Cui|    43 |      1 |              4 |   32 |
1634*01826a49SYabin Cui|    44 |      1 |              4 |   48 |
1635*01826a49SYabin Cui|    45 |      2 |              4 |   16 |
1636*01826a49SYabin Cui|    46 |      4 |              5 |   32 |
1637*01826a49SYabin Cui|    47 |      5 |              5 |   32 |
1638*01826a49SYabin Cui|    48 |      7 |              5 |   32 |
1639*01826a49SYabin Cui|    49 |      8 |              5 |   32 |
1640*01826a49SYabin Cui|    50 |     11 |              6 |    0 |
1641*01826a49SYabin Cui|    51 |     14 |              6 |    0 |
1642*01826a49SYabin Cui|    52 |     17 |              6 |    0 |
1643*01826a49SYabin Cui|    53 |     20 |              6 |    0 |
1644*01826a49SYabin Cui|    54 |     23 |              6 |    0 |
1645*01826a49SYabin Cui|    55 |     26 |              6 |    0 |
1646*01826a49SYabin Cui|    56 |     29 |              6 |    0 |
1647*01826a49SYabin Cui|    57 |     52 |              6 |    0 |
1648*01826a49SYabin Cui|    58 |     51 |              6 |    0 |
1649*01826a49SYabin Cui|    59 |     50 |              6 |    0 |
1650*01826a49SYabin Cui|    60 |     49 |              6 |    0 |
1651*01826a49SYabin Cui|    61 |     48 |              6 |    0 |
1652*01826a49SYabin Cui|    62 |     47 |              6 |    0 |
1653*01826a49SYabin Cui|    63 |     46 |              6 |    0 |
1654*01826a49SYabin Cui
1655*01826a49SYabin Cui#### Offset Code:
1656*01826a49SYabin Cui
1657*01826a49SYabin Cui| State | Symbol | Number_Of_Bits | Base |
1658*01826a49SYabin Cui| ----- | ------ | -------------- | ---- |
1659*01826a49SYabin Cui|     0 |      0 |              5 |    0 |
1660*01826a49SYabin Cui|     1 |      6 |              4 |    0 |
1661*01826a49SYabin Cui|     2 |      9 |              5 |    0 |
1662*01826a49SYabin Cui|     3 |     15 |              5 |    0 |
1663*01826a49SYabin Cui|     4 |     21 |              5 |    0 |
1664*01826a49SYabin Cui|     5 |      3 |              5 |    0 |
1665*01826a49SYabin Cui|     6 |      7 |              4 |    0 |
1666*01826a49SYabin Cui|     7 |     12 |              5 |    0 |
1667*01826a49SYabin Cui|     8 |     18 |              5 |    0 |
1668*01826a49SYabin Cui|     9 |     23 |              5 |    0 |
1669*01826a49SYabin Cui|    10 |      5 |              5 |    0 |
1670*01826a49SYabin Cui|    11 |      8 |              4 |    0 |
1671*01826a49SYabin Cui|    12 |     14 |              5 |    0 |
1672*01826a49SYabin Cui|    13 |     20 |              5 |    0 |
1673*01826a49SYabin Cui|    14 |      2 |              5 |    0 |
1674*01826a49SYabin Cui|    15 |      7 |              4 |   16 |
1675*01826a49SYabin Cui|    16 |     11 |              5 |    0 |
1676*01826a49SYabin Cui|    17 |     17 |              5 |    0 |
1677*01826a49SYabin Cui|    18 |     22 |              5 |    0 |
1678*01826a49SYabin Cui|    19 |      4 |              5 |    0 |
1679*01826a49SYabin Cui|    20 |      8 |              4 |   16 |
1680*01826a49SYabin Cui|    21 |     13 |              5 |    0 |
1681*01826a49SYabin Cui|    22 |     19 |              5 |    0 |
1682*01826a49SYabin Cui|    23 |      1 |              5 |    0 |
1683*01826a49SYabin Cui|    24 |      6 |              4 |   16 |
1684*01826a49SYabin Cui|    25 |     10 |              5 |    0 |
1685*01826a49SYabin Cui|    26 |     16 |              5 |    0 |
1686*01826a49SYabin Cui|    27 |     28 |              5 |    0 |
1687*01826a49SYabin Cui|    28 |     27 |              5 |    0 |
1688*01826a49SYabin Cui|    29 |     26 |              5 |    0 |
1689*01826a49SYabin Cui|    30 |     25 |              5 |    0 |
1690*01826a49SYabin Cui|    31 |     24 |              5 |    0 |
1691*01826a49SYabin Cui
1692*01826a49SYabin Cui
1693*01826a49SYabin Cui
1694*01826a49SYabin CuiAppendix B - Resources for implementers
1695*01826a49SYabin Cui-------------------------------------------------
1696*01826a49SYabin Cui
1697*01826a49SYabin CuiAn open source reference implementation is available on :
1698*01826a49SYabin Cuihttps://github.com/facebook/zstd
1699*01826a49SYabin Cui
1700*01826a49SYabin CuiThe project contains a frame generator, called [decodeCorpus],
1701*01826a49SYabin Cuiwhich can be used by any 3rd-party implementation
1702*01826a49SYabin Cuito verify that a tested decoder is compliant with the specification.
1703*01826a49SYabin Cui
1704*01826a49SYabin Cui[decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing
1705*01826a49SYabin Cui
1706*01826a49SYabin Cui`decodeCorpus` generates random valid frames.
1707*01826a49SYabin CuiA compliant decoder should be able to decode them all,
1708*01826a49SYabin Cuior at least provide a meaningful error code explaining for which reason it cannot
1709*01826a49SYabin Cui(memory limit restrictions for example).
1710*01826a49SYabin Cui
1711*01826a49SYabin Cui
1712*01826a49SYabin CuiVersion changes
1713*01826a49SYabin Cui---------------
1714*01826a49SYabin Cui- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
1715*01826a49SYabin Cui- 0.3.9 : clarifications for Huffman-compressed literal sizes.
1716*01826a49SYabin Cui- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
1717*01826a49SYabin Cui- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
1718*01826a49SYabin Cui- 0.3.6 : clarifications for Dictionary_ID
1719*01826a49SYabin Cui- 0.3.5 : clarifications for Block_Maximum_Size
1720*01826a49SYabin Cui- 0.3.4 : clarifications for FSE decoding table
1721*01826a49SYabin Cui- 0.3.3 : clarifications for field Block_Size
1722*01826a49SYabin Cui- 0.3.2 : remove additional block size restriction on compressed blocks
1723*01826a49SYabin Cui- 0.3.1 : minor clarification regarding offset history update rules
1724*01826a49SYabin Cui- 0.3.0 : minor edits to match RFC8478
1725*01826a49SYabin Cui- 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz
1726*01826a49SYabin Cui- 0.2.8 : clarifications for IETF RFC discuss
1727*01826a49SYabin Cui- 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell
1728*01826a49SYabin Cui- 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz
1729*01826a49SYabin Cui- 0.2.5 : minor typos and clarifications
1730*01826a49SYabin Cui- 0.2.4 : section restructuring, by Sean Purcell
1731*01826a49SYabin Cui- 0.2.3 : clarified several details, by Sean Purcell
1732*01826a49SYabin Cui- 0.2.2 : added predefined codes, by Johannes Rudolph
1733*01826a49SYabin Cui- 0.2.1 : clarify field names, by Przemyslaw Skibinski
1734*01826a49SYabin Cui- 0.2.0 : numerous format adjustments for zstd v0.8+
1735*01826a49SYabin Cui- 0.1.2 : limit Huffman tree depth to 11 bits
1736*01826a49SYabin Cui- 0.1.1 : reserved dictID ranges
1737*01826a49SYabin Cui- 0.1.0 : initial release
1738