## Bitmask Compression Example ##

Bitmask compression allows for storing sparse tensors efficiently on the disk. 

Instead of storing each zero element represented as an actual number, we use bitmask to indicate which tensor entries correspond to zero elements. This approach is useful when the matrix is mostly zero values, as it saves space by not wastefully storing those zeros explicitly.

The example below shows how to save and load sparse tensors using bitmask compression. It also demonstrates the benefits of the bitmask compression over "dense" representation, and finally, introduces the enhanced `safetensors` file format for storing sparse weights.

In [1]:
import torch
import os
from safetensors import safe_open
from safetensors.torch import save_model
from compressed_tensors import save_compressed_model, load_compressed, BitmaskConfig
from transformers import AutoModelForCausalLM

In [2]:
# load a tiny, pruned llama2 model
model_name = "neuralmagic/llama2.c-stories110M-pruned50"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto")
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 768)
    (layers): ModuleList(
      (0-11): 12 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=768, out_features=768, bias=False)
          (k_proj): Linear(in_features=768, out_features=768, bias=False)
          (v_proj): Linear(in_features=768, out_features=768, bias=False)
          (o_proj): Linear(in_features=768, out_features=768, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=768, out_features=2048, bias=False)
          (up_proj): Linear(in_features=768, out_features=2048, bias=False)
          (down_proj): Linear(in_features=2048, out_features=768, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((768,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((768,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((768,), eps=1e-05)
    (rotary_emb): LlamaRotaryEm

In [3]:
# most of the weights of the model are pruned to 50% (except for few layers such as lm_head or embeddings)
state_dict = model.state_dict()
state_dict.keys()
example_layer = "model.layers.0.self_attn.q_proj.weight"
print(f"The example layer {example_layer} has sparsity {100 * state_dict[example_layer].eq(0).sum().item() / state_dict[example_layer].numel():.0f}%")

The example layer model.layers.0.self_attn.q_proj.weight has sparsity 50%


In [4]:
# we can inspect to total sparsity of the state_dict
total_num_parameters = 0
total_num_zero_parameters = 0
for key in state_dict:
    total_num_parameters += state_dict[key].numel()
    total_num_zero_parameters += state_dict[key].eq(0).sum().item()
print(f"The model is {total_num_zero_parameters/total_num_parameters*100:.0f}% sparse overall")

The model is 32% sparse overall


In [5]:
# let's save the model on disk using safetensors and compressed-tensors and compare the size on disk

## save the model using safetensors ##
save_model(model, "model.safetensors")
size_on_disk_mb = os.path.getsize('model.safetensors') / 1024 / 1024

## save the model using compressed-tensors ##
save_compressed_model(model, "compressed_model.safetensors", compression_format="sparse-bitmask")
compressed_size_on_disk_mb = os.path.getsize('compressed_model.safetensors') / 1024 / 1024

print(f"Size of the model's weights on disk using safetensors: {size_on_disk_mb:.2f} MB")
print(f"Size of the model's weights on disk using compressed-tensors: {compressed_size_on_disk_mb:.2f} MB")
print("The compression ratio is x{:.2f}".format(size_on_disk_mb / compressed_size_on_disk_mb))

Compressing model: 100%|██████████| 111/111 [00:00<00:00, 313.39it/s]


Size of the model's weights on disk using safetensors: 417.83 MB
Size of the model's weights on disk using compressed-tensors: 366.82 MB
The compression ratio is x1.14


Storing weights with around 30% of zero entries requires significantly less disk space when using `compressed-tensors`. The compression ratio improves radically for more sparse models. 

We can load back the `state_dict` from the compressed and uncompressed representation on disk and confirm, that they represent same tensors in memory.

In [6]:
# load the safetensor and the compressed-tensor and show that they have the same representation

## load the uncompressed safetensors to memory ##
state_dict_1 = {}
with safe_open('model.safetensors', framework="pt") as f:
    for key in f.keys():
        state_dict_1[key] = f.get_tensor(key)

## load the compressed-tensors to memory ##
config = BitmaskConfig() # we need to specify the method for decompression
state_dict_2 = dict(load_compressed("compressed_model.safetensors", config)) # load_compressed returns a generator, we convert it to a dict

tensors_equal = all(torch.equal(state_dict_1[key], state_dict_2[key]) for key in state_dict_1)

print(f"Once loaded, the state_dicts from safetensors and compressed-tensors are equal: {tensors_equal}")

Once loaded, the state_dicts from safetensors and compressed-tensors are equal: True


### SafeTensors File Format

The reason why the introduced bitmask compression is much more efficient, is imbibing the information about the compression in the header of the `.safetensors` file.
For each parameter in the uncompressed `state_dict`, we store the following attributes needed for decompression in the compressed `state_dict`:

* Compressed tensor
* Bitmask
* Uncompressed shape
* Row offsets

```bash
# Dense
{
    PARAM_NAME: uncompressed_tensor
}

# Compressed
{
    PARAM_NAME.compressed: compressed_tensor,  # 1d tensor
    PARAM_NAME.bitmask: value,  # 2d bitmask tensor (nrows x (ncols / 8))
    PARAM_NAME.shape: value,  # Uncompressed shape tensor
    PARAM_NAME.row_offsets: value  # 1d offsets tensor
}
```