This document presents step-by-step instructions for pruning Huggingface models using the Intel® Neural Compressor.
PyTorch 1.8 or higher version is needed with pytorch_fx backend.
pip install -r examples/pytorch/nlp/huggingface_models/text-classification/pruning/eager/requirements.txt
The dataset will be downloaded automatically from the datasets Hub. See more about loading huggingface dataset
Several pruning examples are provided, which are trained on different datasets/tasks, use different sparsity patterns, etc. We are working on sharing our sparse models on HuggingFace.
There are pruning scripts for MPRC and SST2 sparse models (Bert-mini, Distilbert-base-uncased, etc). The sparse model with different patterns ("4x1", "2:4", "1xchannel", etc) can be obtained by modifying "target_sparsity" and "pruning_pattern" parameters. Pruning Scripts.
Dense model can also be fine-tuned on glue datasets (by setting --do_prune to False) Bert-mini MRPC
To try to train a sparse model in mixed pattern Mixed-patterns Example, local pruning config can be set as follows:
pruning_configs=[
{
"op_names": [".*output", ".*intermediate"], # list of regular expressions, containing the layer names you wish to be included in this pruner.
"pattern": "1x1",
"pruning_scope": "local", # the score map is computed corresponding layer's weight.
"pruning_type": "snip_momentum",
"sparsity_decay_type": "exp",
"pruning_op_types": ["Linear"]
},
{
"op_names": [".*query", ".*key", ".*value"],
"pattern": "4x1",
"pruning_scope": "global", # the score map is computed out of entire parameters.
"pruning_type": "snip_momentum",
"sparsity_decay_type": "exp",
"max_sparsity_ratio_per_op": 0.98, # Maximum sparsity that can be achieved per layer(iterative pruning).
"min_sparsity_ratio_per_op": 0.5, # Minimum sparsity that must be achieved per layer(iterative pruning).
"pruning_op_types": ["Linear"]
}
]
Please be aware that when the keywords appear in both global and local settings, the local settings are given priority.
The snip-momentum pruning method is used by default, and the initial dense model is fine-tuned.
Model | Dataset | Sparsity pattern | Element-wise/matmul, Gemm, conv ratio | Dense Accuracy (mean/max) | Sparse Accuracy (mean/max) | Relative drop |
---|---|---|---|---|---|---|
Bert-Mini | MRPC | 4x1 | 0.8804 | 0.8619/0.8752 | 0.8610/0.8722 | -0.34% |
Bert-Mini | MRPC | 2:4 | 0.4795 | 0.8619/0.8752 | 0.8666/0.8689 | -0.72% |
Bert-Mini | MRPC | per channel | 0.66 | 0.8619/0.8752 | 0.8629/0.8680 | -0.83% |
Distilbert-base-uncased | MRPC | 4x1 | 0.8992 | 0.9026 | 0.8985 | -0.46% |
Distilbert-base-uncased | MRPC | 2:4 | 0.5000 | 0.9026 | 0.9088 | +0.69% |
Model | Dataset | Sparsity pattern | Element-wise/matmul, Gemm, conv ratio | Dense Accuracy (mean/max) | Sparse Accuracy (mean/max) | Relative drop |
---|---|---|---|---|---|---|
Bert-Mini | SST-2 | 4x1 | 0.8815 | 0.8660/0.8761 | 0.8651/0.8692 | -0.79% |
Bert-Mini | SST-2 | 2:4 | 0.4795 | 0.8660/0.8761 | 0.8731/0.8773 | +0.14% |
Bert-Mini | SST-2 | per channel | 0.53 | 0.8660/0.8761 | 0.8651/0.8692 | -0.79% |