Presets

Presets
- Customization
- Intermediate Loss

Presets include some pre-defined loss functions and strategies.

textbrewer.presets.ADAPTOR_KEYS (List)

Keys in the dict returned by the adaptor：
- 'logits', 'logits_mask', 'losses', 'inputs_mask', 'labels', 'hidden', 'attention'
textbrewer.presets.KD_LOSS_MAP (Dict)

Available kd_loss types
- 'mse' : mean squared error
- 'ce': cross-entropy loss
PROJ_MAP (Dict)

Used to match the different dimensions of intermediate features
- 'linear' : linear layer, no activation
- 'relu' : ReLU activation
- 'tanh': Tanh activation
MATCH_LOSS_MAP (Dict)

Intermediate feature matching loss functions.
- Includes 'attention_mse_sum', 'attention_mse', ‘attention_ce_mean', 'attention_ce', 'hidden_mse', 'cos', 'pkd', 'fsp', 'nst'. See Intermediate Loss for details.
WEIGHT_SCHEDULER (Dict)

Scheduler used to dynamically adjust kd_loss weights and hard_label_loss weights.
- ‘linear_decay' : decay from 1 to 0
- 'linear_growth' : grow from 0 to 1
TEMPERATURE_SCHEDULER (DynamicDict)

Used to dynamically adjust distillation temperature.
- 'constant' : Constant temperature (for testing).
- 'flsw' : See Preparing Lessons: Improve Knowledge Distillation with Better Supervision. Needs parameters beta and gamma.
- 'cwsm': See Preparing Lessons: Improve Knowledge Distillation with Better Supervision. Needs parameter beta.
Different from other modules, when using 'flsw' and 'cwsm', you need to provide extra parameters, for example：
```
#flsw
distill_config = DistillationConfig(
temperature_scheduler = ['flsw', 1， 1]
...)

#cwsm
distill_config = DistillationConfig(
temperature_scheduler = ['cwsm', 1]
...)
```

Customization

If the pre-defined modules do not satisfy your requirements, you can add your own defined modules to the above dict, for example:

MATCH_LOSS_MAP['my_L1_loss'] = my_L1_loss
WEIGHT_SCHEDULER['my_weight_scheduler'] = my_weight_scheduler

Usage in DistiilationConfig：

distill_config = DistillationConfig(
  kd_loss_weight_scheduler = 'my_weight_scheduler'
  intermediate_matches = [{'layer_T':0, 'layer_S':0, 'feature':'hidden','loss': 'my_L1_loss', 'weight' : 1}]
  ...)

See source code for more details (will be explained in more details in the next version of the documentation).

Intermediate Loss

attention_mse

Takes in two matrics with the shape (batch_size, num_heads, len, len), computes the mse loss between the two matrices.
If the inputs_mask is provided, masks the positions where input_mask==0.

attention_mse_sum

Takes in two matrics with the shape (batch_size, len, len), computes the mse loss between two matrices; if the the shape is (batch_size, num_heads, len, len), sum along the num_heads dimension and then compute the mse loss between the two matrices.
If the inputs_mask is provided, masks the positions where input_mask==0.

attention_ce

Takes in two matrics with the shape (batch_size, num_heads, len, len), applies softmax on dim=-1, computes cross-entropy loss between the two matrices.
If the inputs_mask is provided, masks the positions where input_mask==0.

attention_ce_mean

Takes in two matrics. If the shape is (batch_size, len, len), compute the cross-entropy loss between the two matrices; if the shape is (batch_size, num_heads, len, len), averages over dimension num_heads and then computes cross-entropy loss between the two matrics.
If the inputs_mask is provided, masks the positions where input_mask==0.

hidden_mse

Takes in two matrices with the shape (batch_size, len, hidden_size), computes mse loss between the two matrices.
If the inputs_mask is provided, masks the positions where input_mask==0.
If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.

cos

Takes in two matrices with the shape (batch_size, len, hidden_size), computes their cosine similarity loss.
From DistilBERT
If the inputs_mask is provided, masks the positions where input_mask==0.
If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.

pkd

Takes in two matrices with the shape (batch_size, len, hidden_size), computes normalized vector mse loss at position 0 along len dimension.
From Patient Knowledge Distillation for BERT Model Compression
If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.

nst (mmd)

Takes in two lists of matrices A and B. Each list contains 2 matrices with the shape (batch_size, len, hidden_size). hidden_size of matrices in A doesn't need to be the same as that of B. Computes the mse loss of similarity matrix of the 2 matrices in A and the 2 in B (both have the size (batch_size, len, len)).
See: Like What You Like: Knowledge Distill via Neuron Selectivity Transfer
If the inputs_mask is provided, masks the positions where input_mask==0.

fsp

Takes in two lists of matrics A and B, each list contains two matrices with the shape (batch_size, len, hidden_size). Computes the similarity matrix between the two matrices in A ( (batch_size, hidden_size, hidden_size) ) and that in B ( (batch_size, hidden_size, hidden_size) ), then computes those two matrics' mse loss.
See: A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning
If the inputs_mask is provided, masks the positions where input_mask==0.

If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.

  intermediate_matches = [
  {'layer_T':[0,0], 'layer_S':[0,0], 'feature':'hidden','loss': 'fsp', 'weight' : 1, 'proj':['linear',384,768]},
  ...]

API

Provide feedback

Saved searches

Use saved searches to filter your results more quickly