- Knowledge distillation is a method in which a small model (student) is trained to mimic a larger model or ensemble of models (teacher).
- DistilBERT, reduces the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster.
- Pruning was a method originally used for decision trees where you remove sections of a tree that are uncritical and redundant for classification
- The more common meaning is to find parameters least useful to predictions and set them to 0
- Quantization reduces a model’s size by using fewer bits to represent its parameters.
- By default, most software packages use 32 bits to represent a float number (single precision floating point). If a model has 100M parameters, each requires 32 bits to store, it’ll take up 400MB. If we use 16 bits to represent a number, we’ll reduce the memory footprint by half. Using 16 bits to represent a float is called half precision.
Ref2 - Link
- Reduce the memory footprint of the model by using fewer GPU devices and less GPU memory;
- Reduce the desired computation complexity by lowering the number of FLOPs needed;
- Reduce the inference latency and make things run faster.
- Post-Training Quantization (PTQ): A model is first trained to convergence and then we convert its weights to lower precision without more training
- Unstructured pruning is allowed to drop any weight or connection, so it does not retain the original network architecture.
- Structured pruning aims to maintain the dense matrix multiplication form where some elements are zeros
Keep Exploring!!!
No comments:
Post a Comment