Intro Unit – “Foundational Large Language Models & Text Generation”
Podcast
Large Language Models
Transformer(Neural Model) Architecture - Foundation for most model LLM’s
- Has Encoder(Input) and Decoder(Outputs Token)
- Transformers transform all the tokens at the same time, need to do positional encoding to ensure the order of appearance of the tokens
- “The tiger jumped out of a tree to get a drink because it was thirsty”. We make the model to figure out that “it” refers to the tiger
- The model compares all the tokens with each other to get the similarity between each, with this value, we infer that “it” has the most similarity with “tiger”
- Input text → Tokens → Vector Embeddings → Output
- Many new LLM’s have Decoder only model
- Encoders main part is only to create a representation for the whole input sequence upfront. Whereas, Decoder directly provides output token by token
- Can be used for Generating Text
- MoE (Mixture of Experts) - Set of expert models, for any given input, the required expert models alone gets called. Offers better efficiency
The Timeline
- GPT1 - Open AI - 2018 - Decoder Only
- They did Unsupervised pre-training on the data, then fine-tune for specific tasks
- Limitation: It would get stuck repeating same text, wasn’t good for long conversations
- Bert - Google - 2018 - Encoder Only
- Focused on understanding language, not generating it
- GPT2 - Open AI - 2019
- Web Text Corpus dataset training
- Better coherence, could learn new tasks before even training it particularly for it (Zero short learning)
- GPT3
- 175 billion parameters
- Got better at few shot learning
- GPT3.5
- Can understand and write code
- GPT4
- Game changer. True Multimodal model
- Lambda - 2021
- Google conversational AI. More about dialouges
- Gopher - DeepMind
- Decoder only model
- Used for knowledge intensive tasks
- GLAM - Google
- Better performance with less computing power
- Chinchilla - DeepMind - 2022
- 70 billion parameter model, was trained on a much scaled data which in turn gave better results
- Paulm, Paulm 2 - Google - 2022, 2023
- Was better at reasoning, coding, math. Foundation of most of Google’s current GenAI
- Gemini - Google
- Multimodal from start, pushing boundaries
- Designed to handle images, videos, audios. Run’s on google TPU (Tensor Processing Units)
- Open Source Models
- Gemma, Gemma 2 - Google
- Has huge vocabulary, lightweight
- Lamma - Meta
- Reasoning, coding, safety enhancements
- Mistral - Mistral AI
- Math, coding, multilingual tasks
- AI 01 - OpenAI
- DeepSeek
- Group relative optimization policy
- Grock 3 - X AI
- Gemma, Gemma 2 - Google
Fine-Tuning (Cost x Performance x Efficiency)
All the above foundational models are built using the Transformer Architecture, but they need to be tailored for specific tasks
The Training process :
-
Pre- Training Feed the model tons of data just raw text, helps it learn the grammar and vocabulary of language
Take the pre-trained model with General knowledge, train it further with smaller more specific data in a particular domain
SFT (Supervised Fine Tuning) Main technique for training models on labelled examples. Helps model to learn and perform the specific task. Teaching it how to behave
Reinforcement Learning With Human Feedback (RLHF) Teaching it to generate responses that humans prefer. This reward model is used to fine tune the model using Reinforcement Learning
Fine-Tuning can be so expensive ($) To overcome this, PFT (Parameter Efficient Fine-Tuning: To only train a small part of the model leaving most of the pre-trained weights Frozen)
- Adapter-based fine-tuning: Add small modules called Adapters into the model and we only train the parameters within those adapters
- Low-rank Adaptation (or) LoRA: We use low-rank matrices to approximate the changes we would make to full fine tuning. Drastically reduces the number of parameters that we need to train
- Soft Prompt : The small set of trainable vectors Soft prompt which we add to the input, which we can specifically optimize without changing the original weights
Prompt-Engineering
Once we have a find-tuned model, need Prompt Engineering to use it fully
- Prompt Engineering is about designing the input that we give to the model
- It makes a huge difference in the response of the model
- Zero shot Prompting Give the model a direct instruction or question without any example
- Few shot Prompting Give the model a few examples to tell what kind of response you are looking for
- Chain Of Thought Prompting For more complex reasoning tasks, we show the model how to think step-by-step
Sampling Techniques Way the model generates text
- Greedy Search Model always picks the most likely next token. Fast, but may lead to repetitive outputs
- Random Search More random, can lead to more creative output. Higher Temperature - More Randomness Top K Limits the model’s choices to the top K most likely tokens, to control the output Top P Nucleus sampling - Uses dynamic threshold based on the probabilities of the tokens Best of n Generates multiples responses and picks best one based on some criteria
Evaluating the LLM’s
In the context of text generation (Something that is open-ended), we cannot use the traditional metrics like accuracy, f1 score etc.
Requirements for a good Evaluation framework for LLM’s:
- Must be multifaceted, need data specifically designed for the task we’re evaluation. Should use real user interaction data, synthetic data to cover all kinds of situations
- Cannot evaluate the model in isolation, should consider the whole system of which the model is a part of. For example, in cases of an Agent, should consider the entire agent’s use case
- Define what good actually means for our specific use case (Like, accuracy, creativity etc.)
Methods used :
- BLEU (or) ROUGE like traditional metrics can still be used. But, they don’t always capture the nuances of language. Sometimes a creative or unexpected response might be just good or even better than the expected one
- Human Evaluation is good, but time consuming. So, people are using LLM powered autorater. We give the autorater model the task, the evaluation criteria and the response generated by the model, it then gives a score with reason for judgement. We need to calibrate the autorater regularly
Accelerating the Inference of models
- Balance : Balance the quality of output with the speed and cost of generating it
- Output Approximating Methods : Changing the output slightly to gain efficiency.
- Quantization : Reducing the numerical precision of models’s weights and activations. Instead of using 32-bit integers, use 8 or 4 bit integers
- Distillation : Train a smaller (Student - Faster and efficient) model to mimic a larger (Teacher) one
- Data distillation
- Knowledge distillation
- On Policy distillation
- Output Preserving Methods: Keeps the output the exactly same, but tries to optimize the computations
- Flash Attention : Designed to optimize the self attention calculations within the Transformer. Minimized the amount of data movement needed during those calculations which can be a bottle neck. Doesn’t change the result of the attention computations, but just the way it’s done
- Prefix Caching: Especially used in conversational applications. Saves time when we have repeated parts of input. Cache the results of attention calculations of parts of the input so that it need not be calculated again
- Speculative Decoding: Clever, use a smaller, faster drafter model to predict a bunch of future tokens. The main model checks the predictions, if the drafter is right, we can accept those tokens and skip the calculations for them which speeds up the decoding process
- Batching: Process multiple requests at the same time instead of doing it one by one
- Parallelization: Splitting up the computation across multiple processors and devices
LLM Use Case in practice:
Code and Math
- Code generation (AlphaCode - Google)
- Completing code
- Refactoring, debugging, documentation
- Understand large codes Machine Translations. Tech Summarization. Question Answering Systems using RAG. ChatBot systems. Content Creation. Natural Language Inference - Sentimental Analyis. Text Classification - News, spam classification. Autoraters. Multimodal LLM’s - Text + Audio + Video for education, business + more.
Prompt Engineering
Output Length
The number of tokens the model generates directly impacts cost and processing time
- Need to engineer our prompt to be more targeted to get a concise response
Sampling
Temperature
- Dial that controls the randomness
- Low Temperature : Predictable, Precision, deterministic output
- High Temperature : Introduced more randomness, to get a more creative, unexpected results
Top K
- We consider only the top K most likely options for the output
- Higher K : More variety in the output
- Lower K : More focused
- K = 1 : Deterministic best guess
Top P (Nucleus Sampling)
- Probability instead of fixed numbers, it selects the smallest set of words whose probabilities add up to at least the value of P
- Lower P : Makes output more restrictive
- P closer to 1 : Considers almost all possible words
Case where we include all three in mix
- Temperature + Top P + Top K : Model first looks for words that meet both Top P and Top K criteria, then Temperature setting influences the final choice among those filtered words
- Top P + Top K : Just filters by which ever one is active, with no temperature setting, the model randomly choses from those that passes the filter
TIP
Value | Benchmark Recommendation | More Variety | Single Answer |
---|---|---|---|
Temperature | 0.2 | 0.9 | 0 |
Top P | 0.95 | 0.99 | |
Top K | 30 | 40 |
Prompting Techniques
Zero shot prompting (or) General Prompting
- We’re give the model a task description and input without any examples
One shot (or) Few shot prompting
- We provide examples within the prompt to guide the model
- Model learns from the examples and gets a much clear understanding
- Quality and relevance of model are very important (GIGO)
- Also, can include examples of edge cases to make the solutions more robust
System Prompting
- Can be used to specify output requirements like “Output all predictions as a Json object with specific keys”
- Setting the overall context and purpose of the prompt
Role Prompting
- Giving the LLM a specific persona or identity
- Can prompt the model to be a Technical Writer to get a specific document
Contextual Prompting
- Providing more specific background information that’s relevant to the task at the hand
- Ex: To debug a code, we provide the code snippet, where the bug is, what libraries we are using
- The more context we provide, the more relevant the model will behave
Step Back Prompting
- First ask LLM to answer a broader question before diving into the detail of our specific question
Chain Of Thought Prompting
- Idea is to prompt the model to explicitly generate the intermediate reasoning steps before giving the final answer or code solution
- Instead of just getting the solution, we also see the models thought process on how it arrived to the solution
- It uses more tokens → Higher cost
- Self Consistency : Make the model to generate multiple reasoning paths for the same prompt and then we choose the most consistent answer across those paths (Worth for high stake tasks)
Tree Of Thought (2T)
- A generalization of COT, the LLM explores multiple reasoning paths simultaneously rather than just a single COT. The model branches out
- More creative problem solving
ReAct (Reason And Act)
- Combining the LLM’s reasoning capabilities with the ability to use external tools like search engines, code interpreters or even API’s
- Example: A model uses a web search to gather some information and make decisions accordingly
APE (Automatic Prompt Engineering)
- Automating the creating of essential prompts
- Ask the model to generate various prompt variations for a task and then we evaluate those variations based on their performance
Code Prompting
- Prompts for writing particular codes
- Prompts for explaining codes
- Translating code from one language to another
- Prompts for debugging and reviewing code
Multimodal Prompting
- Inputs like images, audios
Best Practises
- Provide Examples
- Design with simplicity - Clear, simple, easy to understand
- Be specific about the desired output
- Use instructions over constraints (Focus on positive instructions)
- Control max. token length
- Use variable in Prompts (Dynamic prompts that we can use for various use cases)
- Use schema if you are working with specific data formatting
- Document the various prompt attempts - Allows to track our progress