Large Language Models Simplified : Part 1 – Five Formulas That Define Their Functionality
Large Language Models (LLMs) are extraordinary feats of engineering, powering everything from chatbots to scientific research. But how do these models actually work? Welcome to the first part of our series, “Large Language Models Simplified”, where we break down their mechanisms one step at a time.
This first installment explores the five foundational formulas that define LLM functionality: generation, memory, efficiency, scaling, and reasoning. Future parts in this series will delve deeper into optimization and fine-tuning, applications, and the cutting-edge advancements shaping the AI world.
Explaining LLMs to a Five-Year-Old
Before diving into the technical details, let’s try to explain Large Language Models (LLMs) in a way even a five-year-old could understand:
LLMs are like super-smart robot friends who love words. Imagine a robot that’s read millions of books, stories, and jokes and uses all that knowledge to guess what comes next in a sentence or answer your questions.
For example:
- You: “Why did the chicken cross the road?”
- Robot: “To get to the other side!”
The robot doesn’t “know” things like we do—it’s simply really good at guessing. It works like this:
- Playing a Word Game: The robot tries to guess the next word in a sentence, like a puzzle. The better it gets at guessing, the smarter it becomes!
- Remembering Stories: If you say, “Once upon a time, there was a…,” the robot remembers that words like “princess” or “castle” often come next.
- Paying Attention: For longer stories, the robot focuses on the most important parts. For example, in “Sarah lost her dog, but she found it later,” the robot knows “dog” is a key idea.
This simple explanation highlights the main ideas behind LLMs—patterns, memory, and attention. Now that we’ve met our clever robot friend, let’s explore the formulas that make it all possible!
1. Generation: Modeling Text Using Probabilities
At their core, LLMs are probabilistic models that predict the likelihood of sequences of words. They take in a sequence of tokens (words or subwords) and calculate the probability of the next token using the chain rule of probability:
Example: Text Prediction
Imagine this sentence:
“The cat sat on the ___.”
The model calculates the probabilities of possible next words like this:
- p(“mat” | “The cat sat on the”) = 0.65
- p(“dog” | “The cat sat on the”) = 0.05
- p(“roof” | “The cat sat on the”) = 0.15
The word “mat” has the highest probability and is selected.
Why Perplexity Matters
Perplexity quantifies how well the model predicts the sequence. Lower perplexity means the model is better at predicting text. For instance:
- A model with a perplexity of 20 performs better than one with a perplexity of 100 for the same dataset.
- If the sentence above had 10 tokens and the model assigned a perplexity of 30, it implies the model effectively predicts the sequence but has room for improvement.
Practical Application
In applications like autocomplete, LLMs use this predictive power to suggest words or phrases as you type. If you’ve used Google Docs’ smart compose, you’ve seen this in action.
2. Memory: Attention and Context
Traditional models struggled to maintain long-term context, but attention mechanisms revolutionized this. Attention enables LLMs to “focus” on relevant parts of the input sequence.
The Attention Formula
- Queries (Q): What we are trying to find.
- Keys (K): Where to look.
- Values (V): The information to retrieve.
- d_k: Scaling factor for numerical stability.
Example: Understanding Context
Suppose you have this sentence: “John threw the ball to Peter, and he caught it.”
To determine who “he” refers to:
- Q: “he”
- K : All previous words in the sentence.
The attention mechanism calculates similarity scores for “he” with every other word:
- score(“he”,”John”)=0.2
- score(“he”,”Peter”)=0.8
After normalization via softmax, the model focuses on “Peter” and retrieves “caught it” as its association.
3. Efficiency: Leveraging GPUs
Training LLMs involves billions of parameters and trillions of operations. GPUs make this feasible by efficiently handling the GEMM (General Matrix Multiplication) operation:
Memory Hierarchy Optimization
Instead of fetching data repeatedly from slower global memory, GPUs:
- Load smaller chunks into faster-shared memory.
- Perform computations on these chunks.
- Repeat until the entire task is completed.
Example
Imagine you are computing attention for a 10,000-word vocabulary. Without GPU optimization, you’d repeatedly fetch data from slower memory. GPUs optimize this by:
- Loading smaller blocks of data into shared memory (faster).
- Performing computations on these smaller, manageable pieces.
For instance:
- Naive Method: Compute directly from global memory.
- Optimized Method: Load a 32 x 32 block of the matrix into shared memory, compute, and repeat.
This optimization reduces training time dramatically, enabling modern LLMs like GPT-4 to scale.
4. Scaling: Bigger Isn’t Always Better
Scaling laws reveal that performance improves predictably with model size and training data. But indiscriminate scaling wastes resources.
Chinchilla Scaling Law
- L(N, D): Loss as a function of model size (NN) and dataset size (DD).
- A, B, E: Empirical constants.
- : Scaling exponents.
Example: Optimal Scaling
Imagine you have a fixed compute budget. Should you:
- Double model parameters (NN)?
- Double the dataset size (DD)?
Chinchilla suggests equal scaling is optimal: splitting resources equally between model size and dataset size leads to better performance than favoring one over the other.
5. Reasoning: Simulating Algorithms
LLMs don’t just predict text—they simulate reasoning tasks using frameworks like RASP (Restricted Attention Span Processing). RASP demonstrates how LLMs can replicate algorithmic behavior.
Example: Associative Memory
RASP can determine if a word has appeared before. The formula for this task:
For the input “apple, banana, apple, cherry,” RASP computes for:
- “apple”: Finds it at index 0, then identifies it again at index 2.
- “banana” and “cherry”: Finds no prior occurrences.
Practical Summary
To bring this closer to reality:
- Generation: Think of a chatbot that predicts and responds with the next logical sentence.
- Attention: Imagine translating a paragraph, where attention ensures every word’s meaning is preserved.
- Efficiency: Training GPT-3 in weeks rather than years thanks to GPU optimizations.
- Scaling: Avoiding wasteful resources to make models both powerful and cost-effective.
- Reasoning: LLMs mimic basic algorithms, opening doors to more complex capabilities.
As LLMs evolve, understanding their inner workings is vital for leveraging them responsibly and creatively.
What’s Next?
This introduction lays the foundation for understanding LLMs through five essential formulas. In coming blogs, we’ll explore:
- Optimization and fine tuning Techniques: How to train faster, cheaper, and better.
- Applications: Real-world use cases for LLMs, from chatbots to code generation.
- Emerging Trends: New architectures, interpretability, and ethical considerations.
Stay tuned for the next installment of “Large Language Models Simplified”, where we’ll go even deeper into the world of LLMs.
Acknowledgment of Inspiration
The structure of this blog is inspired by a fantastic YouTube tutorial by Sasha Rush, originally developed for an invited session at the Harvard Data Science Initiative. The tutorial breaks down the complexity of LLMs into five key areas: generation, memory, efficiency, scaling, and reasoning, presenting these concepts in a high-level and intuitive manner.
This blog aims to build on that framework, presenting the ideas in a simplified and detailed way, with practical examples and technical explanations to help readers gain a deeper understanding of LLMs.
For further exploration, we encourage you to check out the original tutorial:
- Video: Building Intuition About LLMs
- Slides: Excalidraw