"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

December 25, 2022

Model Optimization / Performance Key Notes

  • Knowledge distillation is a method in which a small model (student) is trained to mimic a larger model or ensemble of models (teacher). 
  • DistilBERT, reduces the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.
  • Pruning was a method originally used for decision trees where you remove sections of a tree that are uncritical and redundant for classification
  • The more common meaning is to find parameters least useful to predictions and set them to 0
  • Quantization reduces a model’s size by using fewer bits to represent its parameters. 
  • By default, most software packages use 32 bits to represent a float number (single precision floating point). If a model has 100M parameters, each requires 32 bits to store, it’ll take up 400MB. If we use 16 bits to represent a number, we’ll reduce the memory footprint by half. Using 16 bits to represent a float is called half precision.

Ref2 - Link 


  • Reduce the memory footprint of the model by using fewer GPU devices and less GPU memory;
  • Reduce the desired computation complexity by lowering the number of FLOPs needed;
  • Reduce the inference latency and make things run faster.
  • Post-Training Quantization (PTQ): A model is first trained to convergence and then we convert its weights to lower precision without more training
  • Unstructured pruning is allowed to drop any weight or connection, so it does not retain the original network architecture.
  • Structured pruning aims to maintain the dense matrix multiplication form where some elements are zeros
Keep Exploring!!!

No comments: