"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

June 11, 2023

How to train your own LLM - Copilot type LLMs

Notes

  • Scenarios to custom train
  • Privacy, IP, Customization
  • Smaller and Efficient Models
  • Restrict Information shared with LLM models

  • Code completion model by Replit

Stack

  • Databricks pipeline
  • Hugging Face for tokenizers / inference tools for code
  • MosaicML - GPU and model training

  • Training LLM Architecture

  • Extensive code base of Git / Stackoverflow

  • Data preprocessing 
  • All preprocessing in distributed fashion
  • Lot of work on notebooks
  • Removed auto generated code from training
  • Anonymize data remove PII info
  • Remove code that does not compile
  • Remove Python2 code and keep it for one version
  • Maximum line length set

  • Custom Vocabulary creation
  • Custom tokenizer for domain specific dataset

MosaicML for training




Future

  • Optimal / Smaller LLM
  • Customized LLMs
  • LLM with reasoning

Keep Exploring!!!

No comments: