Day 1- AI Engineering Journey - How Large Language Models (LLMs) Actually Work


How Large Language Models (LLMs) Actually Work — Explained Simply (But Correctly)

This is Day 1 of Posting AI Engineering content. In this post I will mostly cover how exactly LLM works.


Most people say things like “LLMs understand language” or “they think like humans.”
That sounds nice — but it’s not how they actually work.

Let’s break it down properly.


The Core Idea

At its heart, an LLM is not thinking, reasoning, or understanding.

It is a system that does one thing extremely well:

It predicts the next token given the previous tokens.

That’s it.

Everything else — conversations, code, reasoning — emerges from this single capability.


Step 1: Text → Tokens

Before processing, your input is converted into tokens.

Tokens are not exactly words. They can be:

  • Full words → cat
  • Parts of words → un, believ, able
  • Symbols → +, =, ;

Example:

"unbelievable" → ["un", "believ", "able"]

This matters because:

  • Cost is based on tokens
  • Models have token limits (context window)
  • Poor tokenization can affect output quality






Step 2: Transformer Processes the Input

Once tokenized, the input is passed into a transformer model.

The transformer:

  • Looks at the entire sequence at once
  • Understands relationships between tokens using attention
  • Builds a contextual understanding of the input

Example:

“The animal didn’t cross the road because it was tired.”

The model understands:

  • “it” refers to animal, not road




Step 3: Predicting the Next Token

The model does NOT generate full sentences.

Instead, it calculates:

What is the probability of each possible next token?

Example:

"The sky is ___"

blue → 0.7  
green → 0.1  
falling → 0.05  
pizza → 0.001  

Step 4: Sampling (Why Outputs Change)

The model doesn’t always pick the highest probability token.

Instead, it samples based on parameters like:

  • Temperature
    • Low → safer, deterministic
    • High → creative, risky
  • Top-k / Top-p
    • Restrict which tokens can be chosen

This is why:

The same prompt can produce different outputs

Step 5: Repeat the Loop

Once a token is selected:

  1. It gets appended to the sequence
  2. The model runs again
  3. Predicts the next token

This loop continues until the response is complete.


Why LLMs Hallucinate

LLMs are not optimized for truth.

They are optimized for:

Generating the most probable continuation

So if something sounds right, the model may generate it — even if it’s wrong.

Reasons include:

  • No real-world grounding
  • Imperfect training data
  • No built-in verification system

Context Window Limitation

LLMs can only process a limited number of tokens at once.

If input is too large:

  • Older parts get truncated
  • Important context is lost

Even within limits:

  • Too much information → weaker attention → poorer answers

Final Mental Model

If you remember just one thing, remember this:

An LLM converts text into tokens, processes them using a transformer, predicts the probability of the next token, selects one using sampling, and repeats this process step-by-step to generate output.

Why This Matters

Understanding this unlocks:

  • Better prompt engineering
  • Building RAG systems
  • Designing AI agents
  • Debugging hallucinations

Final Thought

LLMs don’t “know” things.

They are incredibly powerful pattern predictors.

And once you understand that — you stop using them blindly, and start using them like an engineer.



If you're learning AI engineering, this is your foundation. Everything else builds on top of this.



What's Next:

Comments