1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
|
# BZip3 Format Specification
Version 1
## Headers
The File and Frame formats share a similar structure, differing only in whether they include a
block count field.
### File Header
```
+----------------+------------------+--------------------+
| Header | Chunk 1 | Chunk 2 |
| (9 bytes) | (variable size) | (variable size) |
+----------------+------------------+--------------------+
```
This is created by the CLI tool.
### Frame Header
```
+----------------+------------------+--------------------+
| Header | Chunk 1 | Chunk 2 |
| (13 bytes) | (variable size) | (variable size) |
+----------------+------------------+--------------------+
```
This is created/read by `bz3_compress` and `bz3_decompress`.
### Header Structure
| Field | Type | Description | File Header | Frame Header |
| -------------- | ------ | ------------------------------- | ----------- | ------------ |
| Signature | u8[5] | Fixed "BZ3v1" ASCII string | ✓ | ✓ |
| Max Block Size | u32_le | Maximum decompressed block size | ✓ | ✓ |
| Block Count | u32_le | Number of blocks in the stream | ✗ | ✓ |
### Validation Rules
1. **Signature**: Must exactly match "BZ3v1"
2. **Max Block Size**:
- Minimum: 65KiB (66,560 bytes)
- Maximum: 511MiB (535,822,336 bytes)
3. **Block Count** (Frame Format only):
- Must match the actual number of blocks in the stream
- Should be greater than 0
### Example Parser
```c
typedef struct {
uint32_t max_block_size;
uint32_t block_count; // Frame Format only
} bzip3_header_t;
bool read_bzip3_header(FILE* fp, bzip3_header_t* header, bool is_frame_format) {
char signature[6] = {0};
// Read signature
if (fread(signature, 1, 5, fp) != 5)
return false;
if (strcmp(signature, "BZ3v1") != 0)
return false;
// Read max block size
uint8_t size_bytes[4];
if (fread(size_bytes, 1, 4, fp) != 4)
return false;
header->max_block_size = read_neutral_s32(size_bytes);
if (header->max_block_size < 65536 ||
header->max_block_size > 535822336)
return false;
// Read block count if Frame Format
if (is_frame_format) {
uint8_t count_bytes[4];
if (fread(count_bytes, 1, 4, fp) != 4)
return false;
header->block_count = read_neutral_s32(count_bytes);
if (header->block_count == 0)
return false;
}
return true;
}
```
The integers in BZip3 are written unaligned, in little endian format.
A portable implementation is below.
```c
// Reading a 32-bit integer
static s32 read_neutral_s32(u8 * data) {
return ((u32)data[0]) |
(((u32)data[1]) << 8) |
(((u32)data[2]) << 16) |
(((u32)data[3]) << 24);
}
// Writing a 32-bit integer
static void write_neutral_s32(u8 * data, s32 value) {
data[0] = value & 0xFF;
data[1] = (value >> 8) & 0xFF;
data[2] = (value >> 16) & 0xFF;
data[3] = (value >> 24) & 0xFF;
}
```
## Block Format
After the header, both File and Frame formats contain a sequence of blocks that follow the Block
Format specification. Each block is encapsulated in a chunk structure that defines its size.
The blocks (***without chunk header***) can be encoded/decoded using the `bz3_encode_block`
and `bz3_decode_block` APIs.
### Chunk Structure
```c
// Main block structure
struct Chunk {
u32_le compressedSize; // Size of compressed block
u32_le origSize; // Original uncompressed size
if (origSize < 64) {
SmallBlock block;
} else {
Block block;
}
};
```
### Small Block Format (< 64 bytes)
For blocks smaller than 64 bytes, no compression is attempted. The data is stored with just a checksum:
```c
struct SmallBlock {
u32_le crc32; // CRC32 checksum
u32_le literal; // Always 0xFFFFFFFF for small blocks. This is basically an invalid `bwtIndex`
u8 data[parent.compressedSize - 8]; // Uncompressed data
};
```
### Regular Block Format (≥ 64 bytes)
Larger blocks use a more complex format that supports multiple compression features:
```c
struct Block {
u32_le crc32; // CRC32 checksum of uncompressed data
u32_le bwtIndex; // Burrows-Wheeler transform index
u8 model; // Compression model flags
if ((model & 0x02) != 0)
u32_le lzpSize; // Size after LZP compression
if ((model & 0x04) != 0)
u32_le rleSize; // Size after RLE compression
u8 data[parent.compressedSize - (popcnt(model) * 4 + 9)];
};
```
#### Compression Model
The `model` byte in regular blocks indicates which compression features were used:
- `0x02`: LZP (Lempel Ziv Prediction) filter
- `0x04`: RLE (Run-Length Encoding) filter
## External Resources
- [BZip3 Pattern for ImHex](https://github.com/WerWolv/ImHex-Patterns/pull/329)
|