Gzip internals: How to implement gzip (de)compression
Gzip (typically files with .gz extension) is a file format that contains compressed data blocks. It is used in web protocol such as http to compress data being communicated which saves bandwidth, for example, compressing the message body (webpage) of an http response. The interesting thing about gzip is that gzip decompression can be implemented as a streaming algorithm which makes it a good candidate for compression in web protocols. For example, if you go to wikipedia and look at the GET http network call in your web browser, you will see that there is a response header “content-encoding” with the value “gzip”. This means your browser is decompressing the http response’s message body using gzip decompression at the same time it is receiving the response so that it can render the webpage. The browser doesn’t need to get the whole message body before it can start decompressing. The gzip file format specification is described in rfc 1952 and the compressed blocks specification (deflate) is described in rfc 1951.
To decompress, let’s simply think that we have read the .gz file into memory so that we have a buffer of “bytes” and we have a position marker that indicates which “byte” we are currently at (starting with 0). Essentially what we want to do is iterate through this buffer of bytes and interpret the bytes according to the gzip/deflate specification.
Note: All multi-byte numbers are written with the least significant byte first in the gzipped file. So if the spec talks about a 2 byte field the field’s number value is = byte0 + 2^8 * byte1. Similarly, a 4 byte field’s number value is = byte0 + 2^8 * byte1 + 2^16 * byte2 + 2^24 * byte3
Member header and trailer
Gzip file can contain multiple “members” (typically if you gzip a file it will contain only one member) and every member has the following format. It starts with some header fields which are described by this table (size in bytes in the parenthesis):
|ID1 (1)||ID2 (1)||CM (1)||FLG (1)||MTIME (4)||XFL (1)||OS (1)|
Every member will at least have 10 bytes of header.
- ID1 (Identification 1) – must have value 31 (0x1f)
- ID2 (Identification 2) – must have value 139 (0x8b).
These magic numbers specify the file as being gzip file format.
- CM (Compression Method) – This byte must have the value 8 as gzip only supports deflate as its compression method. So if we see that the third byte is not 8, we can error out saying we don’t recognize the compression method.
- FLG (Flag) – 1 byte of FLG (Flag) depending on its bits, we can have some extra header fields.
- MTIME (Modification Time) – We can just skip these 4 bytes
- XFL (Extra Flags) – We can skip this 1 byte
- OS (Operating System) – We can skip this 1 byte
Now FLG byte’s bits are described as below (starting with bit 0 on the left and bit 7 on the right total 8 bits):
If the RES (reserved) bits are set, we need to indicate error to be standard compliant.
If FTEXT is set, that means the file is probably ASCII text.
If FEXTRA bit is set to 1 ((FLG >> 2) & 1 == 1) we will have the following extra header bytes after the OS byte:
|XLEN (2)||XLEN bytes of extra fields|
So, if FEXTRA is set we will read the next 2 bytes from the buffer and whatever value we get from these two bytes as a number (byte order from lsb to msb – so XLEN = first byte + 256 * second byte) we need to read XLEN number of bytes from the buffer which we can simply skip
If the FNAME bit is set, we need to keep reading the next bytes until we reach a zero byte. This subset of bytes denotes the original filename which we can also just skip.
|Zero terminated filename…|
If the FCOMMENT bit is set, we also need to keep reading the next bytes until we reach a zero byte. This subset of bytes can denote some information about the file which we can also just skip.
|Zero terminated file comment…|
If FHCRC is set, we have to read the next 2 bytes as a CRC16 value which is the cyclic redundancy check value for the member header (upto and not including the CRC16 bytes) (which we can use for checking validity of the header bytes) but we will simply also skip them (we don’t need to verify the validity of the member header if CRC16 _is_ present to be standard compliant decompressor, so we will just skip)
|CRC16 (2 bytes)|
So we have now read through a member’s header bytes, after which the “compressed blocks” start
And after the compressed blocks, 8 bytes of member trailer bytes start at the next byte boundary (blocks can end at any bit position of a byte, not necessarily ending at a byte boundary).
|CRC32 (4 bytes)||ISIZE (4 bytes)|
So, after we decompress the blocks we _do_ need to read these 8 bytes just to make sure they are present (because standard compliant .gz file _will_ contain these bytes) but we don’t need to verify the CRC32 of the decompressed blocks or if the size % 2^32 == ISIZE to be standard compliant.
After the header bytes of a member, the compressed blocks start. Except the first compressed block, any other compressed block can actually start at any bit position (0 – 7) of any byte. So from now on we also need to keep a bit marker position along with the current byte position. Whenever the bit position reaches 7, we will increase the byte position by 1 and make the bit position 0 again
Note: all the bit sequences of any data element other than huffman codes are inserted with the least significant bit first. That means if we are at a byte boundary and we have a data element of 2 bits the 0th bit of the “data element” is put into the 0th bit of the byte and the 1st bit of the data element is put into the 1st bit of the byte. Huffman codes are inserted starting with the most significant bit first. That means if a huffman code is 110 (msb being 1 and lsb being 0) and we are at a byte boundary then the 0th bit of the byte will be 1, the 1st bit 1 and the 2nd bit 0.
The compressed blocks specification is described in rfc 1951 which describes the deflate compressed data format.
All compressed blocks start with 3 header bits. The first bit denotes BFINAL which if set (1) means we are at the final compressed block (meaning after the end of this block we have the member trailer bytes). So we need to keep decompressing blocks until we reach a block which has the BFINAL bit set. The next 2 bits represent what type of compressed block we are dealing with. (we need to remember the note: data elements other than huffman codes are inserted with least significant bit first). So the 2 BTYPE bits have the following meaning:
00 – no compression
01 – compressed using fixed huffman codes
10 – compressed using dynamic huffman codes
11 – error type
So we read the next 2 bits (from the current bit position of the current byte) and according to the above description, if the first bit is 0 and second bit is 0, we need to decompress block type 00, if the first bit is 1 and the second bit is 0, we need to decompress block type 01 and if the first bit is 0 and the second bit is 1, we need to decompress block type 10 and if both are 1 we need to indicate error.
Decompressing block type 00
If we encounter block type 00, we move to the next byte boundary skipping the unread bits from the previous bit position of the previous byte if any. Block type 00 looks like this:
|LEN (2 bytes)||NLEN (2 bytes)||LEN bytes of literal data|
Where the next 2 bytes determine the number of literal data bytes in this block. NLEN is the one’s complement of LEN. so LEN == (unsigned) ~NLEN should be true. If this equality doesn’t hold we can indicate error. After these 4 bytes in the buffer, we simply read the next LEN bytes from the buffer and can just write them to the output file because they are simply literal data bytes. This is the end of decompressing block type 00.
Before we start on the other 2 block types, we need to understand how deflate compresses data. Compressed data blocks use a combination of LZ77 and huffman coding. So if you have some input data bytes you want to compress, first you run the data bytes through LZ77 which then outputs literal bytes + length distance pairs. The length distance pair means that at that point length number of bytes should be copied starting at distance bytes before. So essentially you find matches of subset of bytes in the previous bytes (you can use any string matching algorithm for that) and instead of literal bytes, use length distance pairs to replace those matching bytes. Deflate limits length between 3 and 258 and distance from 1 to 32768. Then you create huffman codes for this alphabet (literal, length, distance) and write the huffman codes (bit sequences for the literal, length and distance data) as the compressed blocks.
Huffman codes are generated based on the frequency of “symbols” from the alphabet we are working with. The symbols with more frequency get a smaller bit length sequence and the characters with less frequency get a longer bit length sequence. For example, if you have the string “ABBAAC” and you want huffman codes for A, B, C first you count the frequency of the alphabet symbols
A = 3
B = 2
C = 1
Then you take the 2 lowest frequencies C(1) and B(2) and add a node on top of them with 0(left) and 1(right) as the path.
Then you insert CB (1 + 2 = 3) into the frequency table instead of the old two
A = 3
CB = 3
And you again take the 2 lowest frequencies A(3) and CB(3) and add a node on top of them. You keep doing this until all the symbols have been added to the tree. So you finally get
so huffman codes are
A = 0
B = 11
C = 10
Note that huffman codes for deflate must not exceed bit length of 15.
Now deflate imposes 2 restrictions when generating huffman codes:
- All codes of a given bit length have lexicographically consecutive values, in the same order as the symbols they represent which means in the above example B should have huffman code 10 and C should have 11
- Shorter codes lexicographically precede longer codes which means it shouldn’t happen in the above example that A has huffman code 1 and B has 01 and C has 00
Because of these restrictions, you can generate only one huffman tree from given alphabet frequencies and you can simply generate the huffman codes given the bit lengths of the symbols in the alphabet. Otherwise, you would need to provide the huffman codes themselves along with all the symbols which would take up more bits.
Given the huffman code lengths for the symbols, you can generate the huffman codes following this algorithm from rfc 1951 section 3.2.2
This generates the next code for a bit length (index in next_code) and you can just keep adding 1 when generating all the huffman codes for the symbols that have the same bit length for their huffman codes.
Why this works?: This algorithm works because of the 2 restrictions imposed on the huffman tree which makes the binary tree (assuming the left path of any node is 0 and the right path is 1) such that any leaf node that is on the “right”, must be at a height greater than or equal to the height of any leaf node on its “left” (because of the 2nd restriction). So, if we want to add the first leaf node at the next greater height, we will go to the next non-leaf node on the right and add a path 0 on that node and then add the leaf node there, which is where the << 1 comes from. Left shifting by 1 adds a ‘0’ bit at the end. And if there are multiple leaf nodes on the same height or level, that means they will be sequentially positioned (because leaf nodes with greater heights will be on the right if any) and to get the code of a node we are currently at, we can simply add 1 to the leaf node’s code on its immediate left, which is where the (code + bl_count[bits – 1]) is coming from. (code + bl_count[bits – 1]) is adding the total number of leaf nodes at a level to the code for the first leaf node at that level which is the code for the next non-leaf node on that level.
Now we know how to generate huffman codes for symbols if we are given the bit lengths of the huffman codes for those symbols
So Deflate compressed blocks contain 3 distinct alphabets which are literal bytes, length and distance and uses “codes” to determine if we are encountering a literal byte or length or distance. The codes for literal bytes are 0 – 255. The length codes follow the below table from rfc 1951:
This means if we encounter a code of 266 from the huffman tree when reading the bits of a compressed block the length value is either 13 or 14 depending on the next bit (1 extra bit for length code 266). If the next bit is 1 length is 14 otherwise 13. Same goes for all the codes given in this table.
Similarly, we have a table for distance codes:
After the length code and its extra bits, we will have to read the distance code (how the distance code is read depends on the block type 01 or 10). If the distance code is 8 we need to read the next 3 bits and calculate the actual distance value is 17 + value of next 3 bits. So the value for distance code 8 can range from 17 – 24.
Note that now we know the concept that there can be length distance pairs in the compressed blocks that means we need to keep a data structure of back references of the decompressed bytes of at least size 32768 (maximum distance) from the beginning when we start decompressing any block. And we need to keep updating this back reference as we keep decompressing bytes. So when we encounter a length distance pair we will simply copy length bytes from distance from the back references data structure to our output buffer for later writing to output file.
Decompressing block type 01
Block type 01 uses fixed huffman codes which are given in rfc 1951. The length of the literal and length codes are as follows:
We can simply use the bit lengths here to get the huffman codes for codes 0 to 287 according to the algorithm in section 3.2.2 (also given above). (although codes 286 and 287 will never appear in the compressed blocks we still need to form the huffman tree with them otherwise we will get wrong huffman codes for the literal length codes. Something to keep in mind). After we generate the huffman codes for these literal, length codes, we need to form the huffman tree from the huffman bit sequences of these huffman codes which is just a binary tree with paths either 0 or 1 and the leaf nodes of the tree denote a “code”. The tree must be formed with the huffman codes’ most significant bit first as stated earlier.
So after the header bits, we keep reading the bits from the input buffer and we keep walking the huffman tree starting from the root node. If we are at a node that is a leaf node, we see if it is a literal code or a length code. If it’s a literal code (0 – 255) we can simply copy this byte in our output buffer. If it’s a length code we need to read any extra bits if necessary and calculate the length value according to the length/extra-bits table. After the length code, we need to read 5 bits for the distance code. And after the 5 bits we need to read extra bits if any according to the distance/extra-bits table and calculate the distance value. For this length distance pair, we need to copy length bytes from distance from the back references data structure that we are maintaining. If we encounter the code 256 which is the block end marker, it means we have decompressed a block of type 01. So we need to keep reading bits and keep walking the huffman tree from root node until we reach the code 256 i.e., until the block ends.
Decompressing block type 10
Block types 10 use dynamic huffman codes as opposed to block type 01’s fixed huffman codes. These codes’ lengths (remember we know how to generate huffman codes from code lengths?) are specified after the header bits. Now these code lengths are not directly specified; rather these “code length alphabet” are themselves huffman coded and the lengths of the huffman codes of the “code length alphabet” are provided and then the literal length or distance codes’ lengths are given using these code length codes. The alphabet for the code lengths is as follows:
0 – 15 (remember huffman code lengths must not exceed 15 for deflate?): represents code lengths from 0 to 15
16: repeat the previous code length 3 – 6 times. The next 2 bits determine how many times to repeat the previous code (00 = 3 (0 + 3), …, 11 = 6 (3 + 3))
17: repeat a code length of 0 3 – 10 times. The next 3 bits determine how many times to repeat 0. (000 = 3(0 + 3), …, 111 = 10(7 + 3))
18: repeat a code length of 0 11 – 138 times. The next 7 bits determine how many times to repeat 0. (000 0000 = 11 (0 + 11), …, 111 1111 = 138(127 + 11))
Now the format of the compressed block type 10 after the 3 header bits is as follows:
|HLIT (5 bits)||HDIST (5 bits)||HCLEN (4 bits)|
HLIT = number of literal length codes – 257 (257 – 286)
HDIST = number of distance codes – 1 (1 – 32) (distance codes 30 – 31 will never appear in compressed blocks but will participate in building the distance codes huffman tree if present)
HCLEN = number of code lengths – 4 (4 – 19)
After the HLIT, HDIST and HCLEN fields, the block contains the lengths for the code length codes 3 bits each and they are in this order: 16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15
This means, after the HCLEN field, there are (HCLEN + 4) numbers of group of 3 bits that denote the lengths of the code length codes according to the above serial. So, the next 3 bits are the length of the code length code 16, the next 3 bits for 17…
So we need to read the next (HCLEN + 4) * 3 bits and generate the huffman codes for the given code lengths and form the huffman tree of these code length codes.
Then, there should be HLIT + 257 code lengths for the literal/length alphabet (0 – 285) given using the code length huffman codes. That means we need to keep reading bits and keep walking the code length codes huffman tree and if we hit a leaf node we need to remember the length for the “corresponding” (starting from 0 to HLIT + 256) literal/length alphabet symbol. If the leaf node has value 0 – 15 that means this is the length of that corresponding literal/length code, if its 16 we need to repeat the previous length for 3 – 6 times according to the value of the next 2 bits and if its 17 / 18 we need to repeat 0 as code lengths for the next literal/length symbols for “the corresponding extra bits value” number of times. So the first leaf node that we hit will indicate the length for literal code 0, the leaf node that we hit for 257th iteration will indicate the length for code 256 (block end marker) and similarly, the leaf node that we hit for 286th iteration (if HLIT + 257 == 286) will indicate the length for length code 285 which denotes the length value 258 (remember the length/extra-bits table?). Of course, there can be less than 286 code lengths (which will mean the latter length codes are not in use in this compressed block). After we get all the HLIT + 257 code lengths for the literal/length alphabet we need to generate the huffman codes from these lengths again using the rfc 1951 section 3.2.2 algorithm (also given above) and form the literal/length code huffman tree.
After the literal/length code lengths, there should be HDIST + 1 numbers of distance code lengths specified, again, using the code length codes. So we need to keep walking the code length code huffman tree and get the distance codes’ lengths the same way as we did for literal/length codes and generate the huffman tree for distance codes.
After these bits, the actual compressed data begins. So we need to keep reading bits and this time we need to keep walking the literal/length code tree. If we get a literal code (0 – 255) we just copy it into our output buffer. If we get a length code then we need to read extra bits according to the length/extra-bits table mentioned above and get the actual length value. After getting the length value, we then need to read the next bits and this time keep walking the distance code tree to determine the distance code (because after length always comes the distance) and when we get the distance code, we need to read the extra bits according to the distance/extra-bits table above to get the distance value. If we get proper length and distance values, we need to copy length bytes from distance from our back references data structure that we have been maintaining. The compressed block ends when we encounter the code 256 in the literal/length huffman tree. Thus block type 10 is decompressed.
So, to summarize, we need to read member header, keep decompressing the 3 block types till we reach a block that has BFINAL header bit set, write decompressed bytes to the output file, read the member trailer and keep doing this if there are multiple members present.
Rfc 1952 and 1951 must be read thoroughly if you are planning on implementing gzip as the rfcs contain all the quirks, which need to be kept in mind when developing. This post contains the things that I found harder to fully understand from the rfcs but this post is definitely not the full specification which the rfcs _are_. You can look at my gzip decompression implementation alongside for reference: https://github.com/dorjoy03/ungzip