Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): Model Optimization / Performance Key Notes

December 25, 2022

Model Optimization / Performance Key Notes

Knowledge distillation is a method in which a small model (student) is trained to mimic a larger model or ensemble of models (teacher).
DistilBERT, reduces the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.
Pruning was a method originally used for decision trees where you remove sections of a tree that are uncritical and redundant for classification
The more common meaning is to find parameters least useful to predictions and set them to 0
Quantization reduces a model’s size by using fewer bits to represent its parameters.
By default, most software packages use 32 bits to represent a float number (single precision floating point). If a model has 100M parameters, each requires 32 bits to store, it’ll take up 400MB. If we use 16 bits to represent a number, we’ll reduce the memory footprint by half. Using 16 bits to represent a float is called half precision.

Knowledge Distillation

Pruning in Keras example

Ref - Link

Ref2 - Link

Large Transformer Model Inference Optimization

Reduce the memory footprint of the model by using fewer GPU devices and less GPU memory;
Reduce the desired computation complexity by lowering the number of FLOPs needed;
Reduce the inference latency and make things run faster.
Post-Training Quantization (PTQ): A model is first trained to convergence and then we convert its weights to lower precision without more training
Unstructured pruning is allowed to drop any weight or connection, so it does not retain the original network architecture.
Structured pruning aims to maintain the dense matrix multiplication form where some elements are zeros

Keep Exploring!!!

No comments:

Subscribe to: Post Comments (Atom)