Andrej Karpathy — Deep Dive into LLMs like ChatGPT

LLM

This three and a half hour video by Andrej Karpathy is a wonderful overview of how LLMs work and the current status. It took me a couple of days to watch and study it in depth.

Published

May 21, 2025

This is James / linking

https://www.youtube.com/watch?v=7xTGNNLPyMI

Personal takeaways

Stages of creating an LLM:
- Pre-training — like creating a large lossy snapshot of the web. Knowledge acquisition.
- Supervised Fine-Tuning (SFT) — Learning how to chat and acquisition of a personality. Human driven learning. The creation of a human simulator. Huge numbers of example prompts and responses.
- Reinforcement Learning (RL) — the latest and most exciting stage. Can be done iteratively and continuously. Uses questions and answers, with correct answers reinforced. Potentially can create LLMs that can have completely different thinking strategies to humans. Could surpass human understanding — which happened with AlphaGo (which used RL) and Move 37 in 2016. But can it work with stuff where you can’t easily rank fitness? e.g. write a joke/poem.
- Reinforcement Learning from Human Feedback (RLHF) — A way to rank fitness for jokes/poem type responses. Real humans put responses in order (e.g. one to five) and a completely separate reward model is trained on that data. Then you use the reward model for further training of the main model. Works but only to an extent because the LLM can find ways to game the reward model, and it’s not possible to fix that because it’s impossible to make a reward model that can’t be gamed.
LLMs are amazing but make dumb and weird mistakes in an unpredictable way.
It’s a mistake to think that they work like human brains—they operate in a completely different way.
Potentially they can become much better than us at tasks where a correct answer can be quantified, e.g. math and programming. But creative stuff? we don’t know.

ONE: Pre-training

FineWeb: decanting the web for the finest text data at scale

Step 1 - download and preprocess the internet

FineWeb

Curated dataset for LLM pre-training.
- 15-trillion tokens.
- 44TB. Not massive, but this is just text which has been aggressively filtered.
Made from Common Crawl snapshots.
- Over 250 billion pages spanning 18 years.
- Common Crawl is raw, unfiltered.
Block-lists are applied (porn, malware, racist websites etc).
HTML is converted to markdown/text-only.
Deduplication is applied (but not overdone).
Personally identifiable information is automatically removed (as far as possible).
(Companies like Google, OpenAI, Anthropic, crawl the web themselves and make their own datasets).
Aim is to have a high quality dataset of diverse, high quality documents.
FineWeb is focused on English.
Anyone can download it.

The source data is getting contaminated by content generated by LLMs:

“while for relatively small trainings this data does not seem to harm performance (and might actually improve it), it is not clear that this holds for much larger trainings.”

HuggingFaceFW/fineweb · Datasets at Hugging Face

Interesting to look at the raw data — a very random snapshot of humanity!?
All of this text is essentially just concatenated together into a massive wall of text, essentially a single, one-dimensional string.
The text data has pattern, so the next stage is to identify those patterns so that the model can generate text using it.

Step 2 - tokenization

View it as bits. It’s a very long sequence of 0s and 1s.
View as bytes - 8 times shorter.
Next run a byte-pair encoding algorithm
- Look for consecutive pairs of bytes that are very common.
- So for example 116, 32 is very common (full-stop space).
- You can do this multiple times to create more symbols.
- Turns out about 100k symbols is good to aim for.
- These are called tokens.
- ”Tokenization”.

Tiktokenizer

Good for demonstrating tokenization.
Select cl100k_base
- GTP 4 base model tokenizer.
- Notice it is case sensitive. Case matters.
- so about 100k different tokens.

Step 3 - Neural network training

Take your huge dump of tokens and more or less randomly extract “windows” of them - individual little sequences.
from zero to maximum size, in practice up to 8k tokens.
Then try to predict the token that comes next in the sequence.
The selected token are the context.
so for a particular sequence there are 100k possible next tokens, so 100k probabilities for possible next values.
At the very beginning, the probabilities (“weights”) are randomized.
We know what comes next, so we can adjust the values, increasing the probability for the “correct” answer and decreasing others.

At the start predictions are random.
over time the weights are adjusted so become less random.

LLM Visualization

This is a transformer:

The transformer has no memory.
It is very different to how biological neurons work (they are much more complex).
The details aren’t that important.
- It transforms inputs into outputs via mathematical processes.

Step 4: Inference

In inference, we generate new data from the model.
Give give it a token, and it responds with a token randomly, but with the probabilities weighted, so some responses are more probable than others.
We can now feed our two tokens in, and get a third, and so on.
It’s important to keep in mind that this process is stochastic, not deterministic, so you will get different results from the same input.
- Occasionally you can get a black swan token.
- That can get fed back into the context and so you can get weird results.

Example: GPT-2

Released 2019
1.6 billion params.
Max context length 1024 tokens.
Trained on about 100 billion tokens.
Training cost about 40k USD
- would be much cheaper today
  - Higher quality datasets
  - Both hardware and software have improved a lot.
  - Could do it for 100USD.

Models tend to run on GPUs like this:

GPUs are very optimized for the kind of matrix multiplication needed for creating models.

Lots of the are brought together in data centers:

Largest American companies by market capitalization

This process just creates a “next token producer”, which actually isn’t what we want. We want something that will answer questions / respond to prompts.

So more steps need to be made.

When a base model is released we need two things:

A program (often a python script) that does the processing (often just a couple of hundred lines of code.
The parameters/weights — this is a bunch of numbers.

Llama 3

from Meta, 2024
405 billion params, trained on 15 trillion tokens.
They released two versions:
- Llama 3.1 405B (the base model)
- Llama 3.1 405B Instruct (the trained model, that chats)

Hyperbolic GPU Marketplace: On-Demand NVIDIA GPU Rentals Starting at 0.16/hr

Using Llama 3.1 405B
- ask a question, or put the start of a statement.
- it “dreams the internet”.
- It is stochastic — it will be different every time.
- It can still be useful.
  - the 405B params are a kind of lossy compression of the whole internet.
  - Some prompts can result in a useful response. e.g.
    - Here is my top 10 list of landmarks to see in Paris:
    - Note that we can’t totally trust details in the response.
  - Give it a multi-shot prompt and it may give useful responses, e.g. give it pairs of words with their translations, and it will give the correct response for the final english word.
    - They have simple “in-context learning” (which isn’t really learning but just emergent from the data)
    - i.e. can follow patterns
- Start with text from a wikipedia article. It will continue.
- Try the first lines of “let it go”
  - Anthropic specifically give this as an example in it’s system prompt of how not to respond…
- High quality sources (like wikipedia) are preferentially sampled from, which means that it is more likely to have compressed literally the whole text.
- Give it a text that it definitely hasn’t seen. E.g. start of today’s news article.
- Ask it who the current president is (it has a cutoff date before Trump won.)
- You can mimmic chat by clever prompt engineering:
  - Make a few shot prompt of conversation between two people.
  - Only problem is that it will hallucinate the responses of both participants.

Karpathy’s notes:

Base model then goes to:

TWO: Post-training (SFT, Supervised Fine-tuning)

Cheaper than pre-training.
Making the model respond in a conversational manner.
Give it “personality”.
Make it refuse if necessary, and prevent unethical responses.

How is this done? With hundreds and thousands of example conversations— the ideal assistant response to different prompts.

we replace the original training data with a new set of data, the example conversations.
The model adjusts the weights and “learns” how to respond differently to before.
Pre-training can take many months, post-training can be done in just hours.
The training dataset is much smaller than the original dataset.
First step - tokenization of conversations.
- A little bit different than original training.
- conversations are tokenized in a more structured way.

In this case, IM stands for Imaginary Monologue!
Here we see special new tokens.
So when ChatGPT sends a prompt from the user, it will end something like this:

<|im_start|>assistant<|im_sep|>

and so that is what will be the trigger for the AI to respond.

One of the first papers about this method is from 2022:

The paper: 2203

3.4 Human data collection To produce our demonstration and comparison data, and to conduct our main evaluations, we hired a team of about 40 contractors on Upwork and through ScaleAI. … Our aim was to select a group of labelers who were sensitive to the preferences of different demographic groups, and who were good at identifying outputs that were potentially harmful. … During training we prioritize helpfulness to the user … in our final evaluations we asked labelers prioritize truthfulness and harmlessness (since this is what we really care about) … we hire a separate set of labelers who do not produce any of the training data.

The humans who do the data generation have to study a huge document (100’s of pages) outlining how they should do it.
Very manually laborious process.
Note openAI never released their data, but there are some open training sets for this type of thing.
The model takes on the persona of the training data.
Called “programming by example”.

Example of this type of training data: OpenAssistant/oasst1 · Datasets at Hugging Face

Now there are new methods to do this.
The LLMs themselves can be used to create the datasets.
So for example an LLM will come up with an answer, and then the human trainer will edit the answer to make it better.

A newer dataset, made with the help of LLMs:

https://github.com/thunlp/UltraChat?tab=readme-ov-file

These datasets can now be pretty huge:

ultrachat | Atlas Map

This dispels some of the magic of talking to an AI.
- What is returned is statistically aligned to the training set.
- The training set are humans following instructions (100’s pages).
- This isn’t magical intelligence, it is statistically imitating human language. So it’s kind of imitating the instructions that are given to humans.
- It is a simulation of a human labeler.
- What would a human labeler say?
- So if you manage to find a question construct that it hasn’t been trained on, it will probably do badly.
  - e.g. “What does ‘as green as an elephant’ mean?”
- The human labelers are often experts in their field.
- You are speaking to a simulation of those people.
- If the question you ask is in the training set, the answer you get will be very similar to the answer the expert gave.
- If not, it will be the pre-trained data combined with the post-trained data (or it will search).

LLM Psychology

What are emergent effects of this training?

Hallucinations

Ask a question about someone (or some place or whatever) who doesn’t exist.
Ask about a quote or expression that doesn’t exist.
Ask rare questions.

Gives a confident answer.

The training set does not contain lots of questions to which the answer is “I don’t know” so it doesn’t answer like that. In the training set an answer is always given.

Try in inference-playground: Inference Playground - a Hugging Face Space by huggingface Try multiple times with the same model (Andrej uses falcon-7b-instruct) It samples from the probabilities — “makes stuff up”.

In ChatGPT say “Do not use any tools” to prevent it from using search.

How can we prevent this?

Mitigation number one

In this paper Meta show how they train the model to answer “I don’t know” to cases where it doesn’t know the answer. They did this by:

taking samples from the training data.
getting an LLM to generate say three questions for that sample.
Ask the model the questions. Again using an LLM compare the answer to the correct answer from the previous set. Do this multiple times (say 3) for each question.
For those questions that the model does not know the answer, new question and answer training data is created with the answer “I don’t know”
Train the model in this way and eventually it will be able to answer “I don’t know”.

Mitigation number two

use tools.
The model has special tokens for tool use:
- <search_start>search<search_end>
- The returned result from the tool enters the context for the model.
- The context window is like the working memory of the model.
- So the model has to be trained with examples of using these new tokens…
- You can add to your prompt “use web search to make sure”.

Better prompting by giving more info. in the context

If you want the LLM to talk about specific things, it is better to provide data about that specific thing in the context window rather than relying on its vague memory. e.g. if you want it to summarize a chapter of a book, actually provide it with that chapter in the context.

Do some tests of this.

LLMs knowledge of self

People ask things like “What models are you? Who built you?”
The LLM won’t know unless it has this in its training.
If it doesn’t know, you may get a hallucination.
It’s possible to “hard code” such answers into models by training on the same question/answer pair multiple times, or put into a “system message” which is part of the context but the user isn’t able to see it.

Computational abilities: “Models use tokens to think”

Models are probabilistic not deterministic, so not good at math by default.
Since models give results in a linear fashion, stuff given after an answer is often post-hoc justification.

The answer is $3. This is because 2 oranges at $2 are $4 total. So the 3 apples cost $9, and therefore each apple is 9/3 = $3

Everything after the $3 was generated afterwards. The $3 was generated in a single pass through the model. It then enters the context for the rest of the answer. So it’s not actually saying how it calculated the answer.

A better answer gives the calculation first. So it’s better not to train models to give an answer immediately:

“The total cost of the oranges is $4. 13 - 4 = 9, the cost of the 3 apples is $9. 9/3 = 3, so each apple costs $3. The answer is $3”.

This has essentially spent a lot more compute on the answer.
Some people think it’s good to say “just give the answer” but there is a reason why LLMs give verbose answers.
Karpathy gives an example with ChatGPT-4o where it gives the wrong result to a fairly simple math question if told just to answer immediately, but the correct answer if given time to work it out:

but actually, models are not good at calculations, it is better if they use tools (e.g. “use code”)
They are also very bad at counting (how many “r”s in strawberry).
They are not good at spelling. They don’t see letters, they see tokens, which can be multiple letters. So if you ask it to e.g. print every third character, it will screw up.

There are other areas where models don’t work well, and it’s not well understood why.

This error may be the result of Bible verses in the training data!

THREE — Post-training (RL, Reinforcement Learning)

“Taking LLMs to school”

Exposition (background knowledge) Pretraining
Worked problems (examples) Supervised finetuning
Practice problems == reinforcement learning

For a single question, there may be multiple correct answers.
An answer with a good explanation can be better than just the answer.
Example of multiple correct answers:

Emily buys 3 apples and 2 oranges. Each orange costs $2. The total cost of all the fruit is $13. What is the cost of each apple?

Set up a system of equations. x = price of apples 3x + 22 = 13 3x + 4 = 13 3x = 9 x= 3

The oranges cost 2*2 = 4. So the apples cost 13 - 4 = 9. There are 3 apples. So each apple costs 9/3I= 3.

13 - 4 = 9, 9/3 = 3.

(13 - 4)/3 = 3.

Answer: $3

Which is the best solution to give to the LLM? We don’t really know.
So ideally, rather than giving it a “right” answer, we want the LLM to find out which is the best answer for it.

Take the best (or correct) solutions, four in this case, and train on those.
Parameters are updated slightly.
RL is a new stage, that has only recently been done on models.
It is more complex than shown here.
It is why the paper from DeepSeek was such a big deal, as it went into detail into things that other companies (like OpenAI) were doing in secret:

The types of math problems used in the training are here:

Art of Problem Solving

The length of the response increases as accuracy goes up:

They get longer because they learn to backtrack and reframe problems multiple times, effectively going over their workings multiple times. This is an emergent thing, not programmed. It just works for the model. “Chains of Thought”. It discovers cognitive strategies.

https://chat.deepseek.com/ Sign in - Google Accounts

The AI Acceleration Cloud - Fast Inference, Fine-Tuning & Training

GPT-4o is a SFT model (more or less)
GPT-o3 is a RL model
- OpenAI does not show the full chain-of-thought, it shows summaries of them. Perhaps because they don’t want companies using them for RL.
Gemini Flash is also a thinking model.
RL was used when training AlphaGo (which plays Go). So not invented by DeepSeek, they just did it publicly for LLMs.
AlphaGo shows that RL can be better than humans.
- If we scale RL, can we go beyond human reasoning? Perhaps.
- It could drift from the distribution of it’s training data and use unique solutions, for example coming up with its own language or methods.
What about problems where it is very hard to score solutions? e.g.
- Write a joke about pelicans.
- Write a poem/story whatever.
What could be an automatic strategy for doing this?

RLHF (Reinforcement Learning from Human Feedback)

From this paper:

For doing RL in unverifiable domains.

The reward model just returns scores.

RLHF upsides

We can run RL in arbitrary domains, even unverifiable ones.
This improves the performance of the model, possibly due to the “discriminatory - generator gap” (Karpathy idea)
- In many cases it is much easier to discriminate than to generate.
- e.g. “Write a poem” vs. “Which of these 5 poems is best?”

RLHF downsides

We are doing RL with respect to a lossy simulation of humans. The lossy simulation might be misleading and not actually reflect a human with a brain.
RL is very good at discovering ways to “game” the reward model. It can result in non-sensical output. It is just trying to get the best of five so all five could be rubbish, but the reward models gives one of them a good score even though it is a rubbish answer. These are called “adversarial examples”. There will always be an infinite number of these hiding in the reward model, and they can’t be “trained out”.
At first the process might improve the main model, but after a while you start to get nonsensical results and the improvements drop dramatically.
So you can’t overuse this method. You have to just run it a limited number of times.
You can run RL indefinitely and keep getting improvements. That is not true of RLHF.
So this is just a bit of fine-tuning, it just gets a bit better performance, it’s not a magic bullet (like RL!).

Use them as tools. Use them for first drafts. Use them for inspiration. But don’t completely trust them, and be ultimately responsible for the product of your work.

The future

Multimodal (not just text but audio, images, video, natural conversations).
tasks -> agents (long, coherent, error-correcting contexts)
Pervasive, invisible,
computer using
test-time training? etc.

(He doesn’t mention diffusion based LLMs).

How to keep up to date

Chatbot Arena

Seems to be gamed a bit, not as good as it used to be.

[Me] Compare to openrouter.ai LLM Rankings

Discover, download, and run local LLMs

Summary

Your query is tokenized.
It goes into the conversation format, responds according to training.
“Learning facts” comes from pre-training. It’s like compressing the whole internet into a fuzzy/lossy file.
“Personality” comes from Supervised Fine-Tuning, where the original model is refined to respond in a conversational manner, based on training with a huge number of example prompt/responses. Fundamentally a human training process, by humans teaching by example.
LLMs do not work in the same way as human brains. They work in a very different way. What is easy or hard for them is very different to what is easy or hard for a human.
An LLM is a lossy simulation of a human.
They are like Swiss cheese — bad at certain things in an unpredictable way. There are holes in what they can do.
The thinking models do change things a bit.
- They have undergone RL, and so have “thinking strategies”
- These could potentially be better than human strategies.
- They have emergent properties from simulated thinking, so that is new and exciting. In principle, these models can do stuff that no human has thought of before.
- But can RL be done for creative tasks? We don’t know. Early days yet.