"No one is harder on a talented person than the person themselves" - Linda Wilkinson ; "Trust your guts and don't follow the herd" ; "Validate direction not destination" ;

January 08, 2023

ChatGPT Notes - Tech Perspectives

Paper - Link

Ref Link 

  • Label output of GPT
  • Apply human feedback
  • SFT - Supervised Finetune model
  • Rank outputs

Label Approach

  • Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring thetasks had sufficient diversity.
  • Few-shot: We ask the labelers to come up with an instruction, and multiple query/responsepairs for that instruction.
  • User-based: We had a number of use cases stated in waitlist applications to the OpenAIAPI. We asked labelers to come up with prompts corresponding to these use cases.

  • sequence of indices feeds into a Transformer
  • probability distribution over the next index in the sequence comes out
  • clever with batching (both across examples and over sequence length) for efficiency.
Key params
  • GPT-1-like: 12 layers, 12 heads, d_model 768 (125M)
  • GPT-3: 96 layers, 96 heads, with d_model of 12,288 (175B parameters).
  • feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel
  • GPT refers to a family of models which includes ChatGPT. text-davinci-003 is the model trained almost the same way as ChatGPT except it is not tuned for dialogue. We can use text-davinci-003 through OpenAI api today. There is a cost based on usage we need to factor when building applications.
  • This could be a combination of factors - one the attention mechanism helps a model to focus on the appropriate parts of the sentence for context. Also the inclusion of code in training corpus may have a role in its ability to remember the tokens mentioned in the beginning of a dialog when answering a question that comes much later. For instance code completion requires model to complete a braces that marks end of a code block or remember to use a global variable that was mentioned earlier.
  • There is no middleware - the entire context is present in the input token sequence that could be as long as 2048–4096 tokens
Ref - Link



  • The encoder takes the input and encodes it into a fixed-length vector. 
  • The decoder takes that vector and decodes it into the output sequence. 
  • Transformers use multi-headed attention, which is a parallel computation of a specific attention function called scaled dot-product attention. 

  • GPT Competitors
  • PEER by Meta AI
  • LaMDA by Google AI 
  • PaLM by Google AI
  • Computing the Unnormalized Attention Weights
  • Computing the Attention Scores
  • Multi-Head Attention
  • Generative: A GPT generates text.
  • Pre-trained: A GPT is trained on lots of text from books, the internet, etc ...
  • Transformer: A GPT is a decoder-only transformer neural network
At a high level, the GPT architecture has three sections:
  • Text + positional embeddings
  • A transformer decoder stack
  • A projection to vocab step
  • Take a bunch of text blocks and feed them to the OpenAI embeddings API
  • Find the most similar vectors among my notes
Good Read - GAN 
  • 𝐈𝐦𝐩𝐫𝐨𝐯𝐞 𝐩𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐯𝐢𝐭𝐲 𝐚𝐧𝐝 𝐩𝐫𝐨𝐜𝐞𝐬𝐬 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲.
  • 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐞 𝐯𝐚𝐥𝐮𝐞 𝐜𝐡𝐚𝐢𝐧𝐬.
  • 𝐑𝐞𝐝𝐞𝐟𝐢𝐧𝐞 𝐭𝐡𝐞 𝐞𝐧𝐭𝐢𝐫𝐞 𝐞𝐜𝐨𝐬𝐲𝐬𝐭𝐞𝐦.
Keep Exploring!!!

No comments: