Paper - Link
Ref Link
- Label output of GPT
- Apply human feedback
- SFT - Supervised Finetune model
- Rank outputs
- Plain: We simply ask the labelers to come up with an arbitrary task, while ensuring thetasks had sufficient diversity.
- Few-shot: We ask the labelers to come up with an instruction, and multiple query/responsepairs for that instruction.
- User-based: We had a number of use cases stated in waitlist applications to the OpenAIAPI. We asked labelers to come up with prompts corresponding to these use cases.
- sequence of indices feeds into a Transformer
- probability distribution over the next index in the sequence comes out
- clever with batching (both across examples and over sequence length) for efficiency.
Key params
- GPT-1-like: 12 layers, 12 heads, d_model 768 (125M)
- GPT-3: 96 layers, 96 heads, with d_model of 12,288 (175B parameters).
- feedforward layer four times the size of the bottleneck layer, dff = 4 ∗ dmodel
- GPT refers to a family of models which includes ChatGPT. text-davinci-003 is the model trained almost the same way as ChatGPT except it is not tuned for dialogue. We can use text-davinci-003 through OpenAI api today. There is a cost based on usage we need to factor when building applications.
- This could be a combination of factors - one the attention mechanism helps a model to focus on the appropriate parts of the sentence for context. Also the inclusion of code in training corpus may have a role in its ability to remember the tokens mentioned in the beginning of a dialog when answering a question that comes much later. For instance code completion requires model to complete a braces that marks end of a code block or remember to use a global variable that was mentioned earlier.
- There is no middleware - the entire context is present in the input token sequence that could be as long as 2048–4096 tokens
Ref - Link
- The encoder takes the input and encodes it into a fixed-length vector.
- The decoder takes that vector and decodes it into the output sequence.
- Transformers use multi-headed attention, which is a parallel computation of a specific attention function called scaled dot-product attention.
- GPT Competitors
- PEER by Meta AI
- LaMDA by Google AI
- PaLM by Google AI
- Computing the Unnormalized Attention Weights
- Computing the Attention Scores
- Multi-Head Attention
- Generative: A GPT generates text.
- Pre-trained: A GPT is trained on lots of text from books, the internet, etc ...
- Transformer: A GPT is a decoder-only transformer neural network
At a high level, the GPT architecture has three sections:
- Text + positional embeddings
- A transformer decoder stack
- A projection to vocab step
- Take a bunch of text blocks and feed them to the OpenAI embeddings API
- Find the most similar vectors among my notes
Good Read - GAN
- 𝐈𝐦𝐩𝐫𝐨𝐯𝐞 𝐩𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐯𝐢𝐭𝐲 𝐚𝐧𝐝 𝐩𝐫𝐨𝐜𝐞𝐬𝐬 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲.
- 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐞 𝐯𝐚𝐥𝐮𝐞 𝐜𝐡𝐚𝐢𝐧𝐬.
- 𝐑𝐞𝐝𝐞𝐟𝐢𝐧𝐞 𝐭𝐡𝐞 𝐞𝐧𝐭𝐢𝐫𝐞 𝐞𝐜𝐨𝐬𝐲𝐬𝐭𝐞𝐦.
Keep Exploring!!!
No comments:
Post a Comment