Day 13- AI Engineering - Building a RAG pipeline

I Built a RAG Pipeline From Scratch. The LLM Was the Least Important Part.

AI Engineering Journey - Day 13



After 10+ years of building software, I thought I had a decent mental model of how systems work. Databases store data. APIs serve it. Frontends render it. Clean separation of concerns.

Then I started learning AI engineering, and for the first few days, my mental model was embarrassingly simple:

User types question -> LLM thinks hard -> Answer appears

That's it. That was my entire understanding of how ChatGPT-like systems work.

Today I built an actual RAG pipeline from scratch - embeddings, vector search, retrieval, prompt construction, LLM generation - everything wired together. And the thing that hit me hardest?

The LLM is the dumbest part of the system. It just generates text from whatever you hand it. All the intelligence is in what you choose to hand it.

Let me walk you through what I built, what broke, and why I think differently now.


What I Actually Built

No frameworks. No LangChain. No abstractions. Just raw Python wiring these pieces together:

Component What I Used Why
Embeddings SentenceTransformers (all-MiniLM-L6-v2) Small, fast, runs on CPU
Vector DB FAISS Local, no setup, good for learning
LLM HuggingFace flan-t5-base Ease to use and learning

Everything runs locally. No API keys. No cloud. Just Python and about 1GB of downloaded models.

The full pipeline:

User Query
  |
SentenceTransformer encodes query -> 384-dim vector
  |
FAISS searches index -> finds nearest chunk vectors
  |
Top-k chunks retrieved -> injected into prompt
  |
flan-t5-base generates answer from that context
  |
Grounded answer returned

If you've built microservices before, this should feel familiar. It's just a pipeline. Data flows through stages. Each stage transforms it. The only new concept is that "search" happens in vector space instead of a SQL WHERE clause.


The Knowledge Base (Deliberately Small)

I used 8 simple chunks - support policy stuff:

"Customers can return products within 7 days with receipt."
"Orders are shipped within 3 business days."
"Users must verify email before login."
"Orders can be cancelled before shipping."
"Premium members get free shipping on all orders."
"Refund is processed within 5 business days after return."
"Customer support is available 24/7 via chat."
"Password must be at least 8 characters long."

Tiny. On purpose. When you're debugging a system, you want to see exactly which chunk gets retrieved and why. With 10,000 documents you can't eyeball that. With 8, you can trace every decision.

Same principle as testing with small data before scaling - something I've done a hundred times in backend work, but had to relearn in this AI context.


It Worked. Then I Broke It. That's Where the Learning Was.

The happy path was boring - ask "Can I get a refund?", get the refund chunk, LLM says "Yes". Great. Works as expected.

The interesting part was the three failure experiments I ran. This is where my software engineering instincts kicked in - you don't understand a system until you've seen it fail.


Failure #1 - I Fed the LLM the Wrong Chunks

I asked "Can I get a refund?" but forced the retrieval to return account-related chunks instead:

Context given to LLM:
  "Users must verify email before login."
  "Password must be at least 8 characters long."

Question: "Can I get a refund?"

LLM Answer: "no"

Just "no". Flat out wrong. The refund policy exists - the LLM just never saw it.

This is the AI equivalent of a bug I've seen a hundred times in traditional systems: the query hits the wrong table. Except here, there's no SQL error, no stack trace. The system just confidently returns garbage. That's scarier than a 500 error, honestly.

Garbage in, garbage out - except the garbage comes with perfect grammar and total confidence.

Failure #2 - I Removed the Grounding Instruction

My prompt had this line:

"Answer the question only using the provided context."

I removed it and asked the same refund question. Same retrieval. Same chunks. Only difference: no instruction to stay grounded.

WITH grounding:    "Yes"
WITHOUT grounding: "No, I don't want to be charged for the service."

Read that second answer again. It's not just wrong - it's bizarre. The model started generating from its training data instead of the context I gave it. It hallucinated a completely made-up response.

One line of instruction in the prompt was the difference between a usable system and a hallucinating one. If you've worked with input validation in web apps, this feels similar - you can't trust the system to do the right thing by default. You have to explicitly constrain it.


Failure #3 - Plain LLM vs RAG

This one sealed the deal for me.

Question: "Can I get a refund?"

Plain LLM (no context):  "Yes, I will refund you the money."
RAG System (grounded):   "Yes"

The plain LLM confidently promises a refund - even though it has zero knowledge of any refund policy. It has no idea what your company's rules are. It just generates plausible-sounding text.

The RAG system says "Yes" because it actually retrieved the chunk that says "Customers can return products within 7 days with receipt." It's grounded in real data.

Think about deploying that plain LLM in a customer-facing product. It just made a promise your company might not keep. That's a liability, not a feature.


The Architecture - How This Maps to Production

As someone who's built a fair number of systems, I immediately started thinking about how this tiny script maps to a real architecture. Here's what I see:

My script (rag_pipeline.py)         Production equivalent
----------------------------        ----------------------

documents = [...]               ->  Ingestion service (PDFs, DBs, APIs)
                                    + Chunking pipeline
                                    + Regular DB for raw text

SentenceTransformer.encode()    ->  Embedding microservice (GPU, batched)

faiss.IndexFlatL2.add()         ->  Vector DB (Pinecone, Weaviate, Chroma)
                                    with ANN indexes for scale

index.search(query, k=2)       ->  Retrieval service + Re-ranker
                                    + Metadata filtering

prompt = f"Context:..."         ->  Prompt template engine
                                    + Guard rails + Output formatting

t5_model.generate()             ->  LLM API call (GPT-4, Claude, Llama)
                                    with streaming, fallbacks, caching

print(answer)                   ->  API response -> Frontend

What's interesting is that the retrieval layer - the part between the query and the LLM - is where all the complexity and all the important decisions live. The LLM call is basically an API request at the end.


Where the Decisions Actually Are

Coming from a software engineering background, I instinctively look for "where are the decision points in this system?" Here's what I found:

Decision Options Impact
How to chunk documents? Fixed size, semantic, recursive, overlap I proved in Day 9 that this alone swings retrieval scores by 16%
Which embedding model? MiniLM, BGE, OpenAI, Cohere Determines vector quality and dimensionality
How many chunks to retrieve (top-k)? 1? 3? 10? Too few = miss info. Too many = noise drowns the signal.
Grounding instruction? Strict vs loose Difference between factual answers and hallucinated nonsense
Distance threshold? Accept all? Filter low-confidence? Low-quality retrievals poison the LLM context

Notice something? Most of these decisions happen before the LLM even runs. The model just receives whatever you hand it and generates text from that. If the upstream pipeline is broken, the model will generate a perfect-looking wrong answer.

Reminds me of a pattern I've seen many times in backend systems: the hardest bugs are never in the rendering layer. They're in the data pipeline that feeds it.


Precision vs Recall - The Tradeoff That Runs Everything

This was a concept I vaguely knew from search engineering, but it clicked here in a way it never had before.

Concept In RAG terms
High Precision Every chunk you retrieve is actually relevant. No noise.
High Recall You find every relevant chunk that exists. Nothing missed.

You can't max out both. High top-k improves recall but hurts precision (more noise). Low top-k improves precision but risks missing something important.

In my tiny 8-chunk experiment, top-k=2 was the sweet spot. In a production system with 100K chunks, this becomes a real engineering problem - and that's where re-ranking comes in (next post).


What Connected for Me Today

I've been learning these pieces one by one over the past few days. Today they snapped together:

Bad chunking
  -> Bad embeddings (the vector represents broken meaning)
    -> Bad retrieval (wrong chunks come back)
      -> Bad context (LLM sees irrelevant info)
        -> Bad answer (hallucination or wrong facts)

Every layer depends on the layer above it. There's no way to fix a downstream failure if the upstream data is broken. You can't out-model bad retrieval. You can't out-retrieve bad chunking.

As software engineers, we know this pattern well - it's just the data pipeline principle applied to AI systems.


Code

Full implementation (single file, no frameworks):

Rag Pipeline

Includes the working pipeline + all three failure experiments. Total setup: pip install sentence-transformers faiss-cpu transformers numpy and you're running.


What's Next

The pipeline works, but retrieval is naive - it just picks the nearest vectors and hopes for the best. In production, that's not good enough.

Next up: re-ranking - a second pass that scores retrieved chunks more carefully before handing them to the LLM. Think of it like how a database optimizer rewrites your query plan - same data, better execution.


What I Think Now That I Didn't Think Before

When I started this journey, I assumed building AI products was mostly about picking the right model. GPT-4 vs Claude vs Llama - that's the decision that matters, right?

Now I think the model choice is maybe 20% of the problem. The other 80% is:

  • How you chunk your documents
  • How you embed and index them
  • How you retrieve the right context
  • How you construct the prompt
  • How you handle edge cases when retrieval fails

That's just systems engineering. With vectors instead of SQL. And that's actually good news for people like us - because we already know how to think about systems, data pipelines, and failure modes. The AI part is learnable. The engineering part is the foundation.

RAG systems are not AI systems with some engineering. They're engineering systems with some AI.

Day 13. This one felt like a milestone.

Comments