Data Science, Database, AI Startups and Domain Learning's (Video-Image-Text-Data-Database): How to train your own LLM - Copilot type LLMs

June 11, 2023

How to train your own LLM - Copilot type LLMs

Notes

Scenarios to custom train
Privacy, IP, Customization
Smaller and Efficient Models
Restrict Information shared with LLM models

Code completion model by Replit

Stack

Databricks pipeline
Hugging Face for tokenizers / inference tools for code
MosaicML - GPU and model training

Training LLM Architecture

Extensive code base of Git / Stackoverflow

Data preprocessing
All preprocessing in distributed fashion
Lot of work on notebooks
Removed auto generated code from training
Anonymize data remove PII info
Remove code that does not compile
Remove Python2 code and keep it for one version
Maximum line length set

Custom Vocabulary creation
Custom tokenizer for domain specific dataset

MosaicML for training

Future

Optimal / Smaller LLM
Customized LLMs
LLM with reasoning

Keep Exploring!!!

No comments:

Subscribe to: Post Comments (Atom)