Notes
- Scenarios to custom train
- Privacy, IP, Customization
- Smaller and Efficient Models
- Restrict Information shared with LLM models
- Code completion model by Replit
Stack
- Databricks pipeline
- Hugging Face for tokenizers / inference tools for code
- MosaicML - GPU and model training
- Training LLM Architecture
- Extensive code base of Git / Stackoverflow
- Data preprocessing
- All preprocessing in distributed fashion
- Lot of work on notebooks
- Removed auto generated code from training
- Anonymize data remove PII info
- Remove code that does not compile
- Remove Python2 code and keep it for one version
- Maximum line length set
- Custom Vocabulary creation
- Custom tokenizer for domain specific dataset
MosaicML for training
Future
- Optimal / Smaller LLM
- Customized LLMs
- LLM with reasoning
Keep Exploring!!!
No comments:
Post a Comment