Day 5 — Evaluating LLM Outputs (The “Looks Good” Trap)
AI Engineering — Day by Day
My journey to becoming an AI Engineer
Till now in my journey, I was mostly focused on:
- How LLMs work
- How to write better prompts
And honestly, I thought I was making progress.
But then I realized something uncomfortable:
I had no real way to measure if my outputs were actually good.
I was just looking at responses and thinking:
“Yeah… this looks fine.”
And that’s where the problem started.
⚠️ The Problem with “Looks Good”
At first glance, this seems harmless. But it’s actually dangerous.
Here’s why:
- It’s subjective → depends on mood and perspective
- It’s inconsistent → same output may feel different later
- It hides failures → edge cases go unnoticed
What I understood:
If you can’t measure it, you can’t improve it.
🧠 What Does “Good Output” Even Mean?
This question sounds simple… but it’s actually the core of everything.
I started breaking it down into dimensions:
| Metric | What it means |
|---|---|
| Correctness | Is the answer factually right? |
| Relevance | Does it answer the actual question? |
| Clarity | Is it easy to understand? |
| Completeness | Is anything important missing? |
| Format | Does it follow instructions? |
And then I realized something important:
“Good” is not universal — it depends on what you are building.
🔁 The Biggest Realization — One Output Means Nothing
Earlier, I used to test like this:
Prompt → Output → Done
Now I test like this:
Prompt → Output 1 Prompt → Output 2 Prompt → Output 3
Why?
Because LLMs are:
- Non-deterministic
- Probabilistic
Which means:
One output is just one sample — not the system behavior.
🛠️ My First Practical Evaluation Attempt
I tried something simple.
Prompt:
Explain stock market in 3 bullet points
Output I got:
- Correct but too complex
- Not beginner-friendly
- Format slightly off
So instead of saying “looks okay”, I scored it:
| Metric | Score (10) |
|---|---|
| Correctness | 8 |
| Clarity | 5 |
| Format | 4 |
Now I had something actionable.
👉 Problem is NOT correctness
👉 Problem is clarity + format
This completely changed how I improve prompts.
🤔 Questions I Had While Learning
❓ Why is “looks good” dangerous?
Because it is subjective and non-measurable. It hides inconsistencies and makes it impossible to improve the system reliably.
❓ Why test multiple outputs?
Because LLMs are non-deterministic. A single output doesn’t represent the full behavior of the model.
❓ Why is evaluation use-case dependent?
Because different applications require different qualities. A creative tool and a financial assistant cannot be evaluated the same way.
🔄 The Real Workflow I Learned
- Define what “good” means
- Generate outputs
- Evaluate using metrics
- Identify weak areas
- Improve prompt/system
- Repeat
This loop is where real progress happens.
🚀 What Changed for Me
Before:
- I judged outputs casually
- I changed prompts randomly
Now:
- I define evaluation criteria
- I test multiple outputs
- I improve systematically
💭 Final Thought
LLMs don’t become better just because you “feel” they are better.
They become better when:
You measure → understand → improve → repeat
This was Day 5 of my AI engineering journey —
and honestly, this felt like a major shift from experimenting… to engineering.
No comments:
Post a Comment