-
Notifications
You must be signed in to change notification settings - Fork 239
Presets
Presets include some pre-defined loss functions and strategies.
-
textbrewer.presets.ADAPTOR_KEYS (
List
)Keys in the dict returned by the adaptor:
- 'logits', 'logits_mask', 'losses', 'inputs_mask', 'labels', 'hidden', 'attention'
-
textbrewer.presets.KD_LOSS_MAP (
Dict
)Available kd_loss types
- 'mse' : mean squared error
- 'ce': cross-entropy loss
-
PROJ_MAP (
Dict
)Used to match the different dimensions of intermediate features
- 'linear' : linear layer, no activation
- 'relu' : ReLU activation
- 'tanh': Tanh activation
-
MATCH_LOSS_MAP (
Dict
)Intermediate feature matching loss functions.
- Includes 'attention_mse_sum', 'attention_mse', ‘attention_ce_mean', 'attention_ce', 'hidden_mse', 'cos', 'pkd', 'fsp', 'nst'. See Intermediate Loss for details.
-
WEIGHT_SCHEDULER (
Dict
)Scheduler used to dynamically adjust kd_loss weights and hard_label_loss weights.
- ‘linear_decay' : decay from 1 to 0
- 'linear_growth' : grow from 0 to 1
-
TEMPERATURE_SCHEDULER (
DynamicDict
)Used to dynamically adjust distillation temperature.
-
'constant' : Constant temperature (for testing).
-
'flsw' : See Preparing Lessons: Improve Knowledge Distillation with Better Supervision. Needs parameters beta and gamma.
-
'cwsm': See Preparing Lessons: Improve Knowledge Distillation with Better Supervision. Needs parameter beta.
Different from other modules, when using 'flsw' and 'cwsm', you need to provide extra parameters, for example:
#flsw distill_config = DistillationConfig( temperature_scheduler = ['flsw', 1, 1] ...) #cwsm distill_config = DistillationConfig( temperature_scheduler = ['cwsm', 1] ...)
-
If the pre-defined modules do not satisfy your requirements, you can add your own defined modules to the above dict, for example:
MATCH_LOSS_MAP['my_L1_loss'] = my_L1_loss
WEIGHT_SCHEDULER['my_weight_scheduler'] = my_weight_scheduler
Usage in DistiilationConfig:
distill_config = DistillationConfig(
kd_loss_weight_scheduler = 'my_weight_scheduler'
intermediate_matches = [{'layer_T':0, 'layer_S':0, 'feature':'hidden','loss': 'my_L1_loss', 'weight' : 1}]
...)
See source code for more details (will be explained in more details in the next version of the documentation).
- Takes in two matrics with the shape (batch_size, num_heads, len, len), computes the mse loss between the two matrices.
- If the inputs_mask is provided, masks the positions where input_mask==0.
- Takes in two matrics with the shape (batch_size, len, len), computes the mse loss between two matrices; if the the shape is (batch_size, num_heads, len, len), sum along the num_heads dimension and then compute the mse loss between the two matrices.
- If the inputs_mask is provided, masks the positions where input_mask==0.
- Takes in two matrics with the shape (batch_size, num_heads, len, len), applies softmax on dim=-1, computes cross-entropy loss between the two matrices.
- If the inputs_mask is provided, masks the positions where input_mask==0.
- Takes in two matrics. If the shape is (batch_size, len, len), compute the cross-entropy loss between the two matrices; if the shape is (batch_size, num_heads, len, len), averages over dimension num_heads and then computes cross-entropy loss between the two matrics.
- If the inputs_mask is provided, masks the positions where input_mask==0.
- Takes in two matrices with the shape (batch_size, len, hidden_size), computes mse loss between the two matrices.
- If the inputs_mask is provided, masks the positions where input_mask==0.
- If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.
- Takes in two matrices with the shape (batch_size, len, hidden_size), computes their cosine similarity loss.
- From DistilBERT
- If the inputs_mask is provided, masks the positions where input_mask==0.
- If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.
- Takes in two matrices with the shape (batch_size, len, hidden_size), computes normalized vector mse loss at position 0 along len dimension.
- From Patient Knowledge Distillation for BERT Model Compression
- If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.
- Takes in two lists of matrices A and B. Each list contains 2 matrices with the shape (batch_size, len, hidden_size). hidden_size of matrices in A doesn't need to be the same as that of B. Computes the mse loss of similarity matrix of the 2 matrices in A and the 2 in B (both have the size (batch_size, len, len)).
- See: Like What You Like: Knowledge Distill via Neuron Selectivity Transfer
- If the inputs_mask is provided, masks the positions where input_mask==0.
-
Takes in two lists of matrics A and B, each list contains two matrices with the shape (batch_size, len, hidden_size). Computes the similarity matrix between the two matrices in A ( (batch_size, hidden_size, hidden_size) ) and that in B ( (batch_size, hidden_size, hidden_size) ), then computes those two matrics' mse loss.
-
See: A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning
-
If the inputs_mask is provided, masks the positions where input_mask==0.
-
If the hidden sizes of student and teacher are different, 'proj' option is needed in the inetermediate_matches to match the dimensions.
intermediate_matches = [ {'layer_T':[0,0], 'layer_S':[0,0], 'feature':'hidden','loss': 'fsp', 'weight' : 1, 'proj':['linear',384,768]}, ...]