AI Engineering, JavaScript Solutions, Competitive programming in JavaScript, MCQ in JS

Tuesday, April 28, 2026

Day 5 - AI Engineering Journey - Evaluating LLM Outputs

Day 5 — Evaluating LLM Outputs (The “Looks Good” Trap)

AI Engineering — Day by Day
My journey to becoming an AI Engineer




Till now in my journey, I was mostly focused on:

  • How LLMs work
  • How to write better prompts

And honestly, I thought I was making progress.

But then I realized something uncomfortable:

I had no real way to measure if my outputs were actually good.

I was just looking at responses and thinking:

“Yeah… this looks fine.”

And that’s where the problem started.


⚠️ The Problem with “Looks Good”

At first glance, this seems harmless. But it’s actually dangerous.

Here’s why:

  • It’s subjective → depends on mood and perspective
  • It’s inconsistent → same output may feel different later
  • It hides failures → edge cases go unnoticed

What I understood:

If you can’t measure it, you can’t improve it.

🧠 What Does “Good Output” Even Mean?

This question sounds simple… but it’s actually the core of everything.

I started breaking it down into dimensions:

Metric What it means
Correctness Is the answer factually right?
Relevance Does it answer the actual question?
Clarity Is it easy to understand?
Completeness Is anything important missing?
Format Does it follow instructions?

And then I realized something important:

“Good” is not universal — it depends on what you are building.

🔁 The Biggest Realization — One Output Means Nothing

Earlier, I used to test like this:

Prompt → Output → Done

Now I test like this:

Prompt → Output 1  
Prompt → Output 2  
Prompt → Output 3

Why?

Because LLMs are:

  • Non-deterministic
  • Probabilistic

Which means:

One output is just one sample — not the system behavior.

🛠️ My First Practical Evaluation Attempt

I tried something simple.

Prompt:

Explain stock market in 3 bullet points

Output I got:

  • Correct but too complex
  • Not beginner-friendly
  • Format slightly off

So instead of saying “looks okay”, I scored it:

Metric Score (10)
Correctness 8
Clarity 5
Format 4

Now I had something actionable.

👉 Problem is NOT correctness
👉 Problem is clarity + format

This completely changed how I improve prompts.


🤔 Questions I Had While Learning

❓ Why is “looks good” dangerous?

Because it is subjective and non-measurable. It hides inconsistencies and makes it impossible to improve the system reliably.

❓ Why test multiple outputs?

Because LLMs are non-deterministic. A single output doesn’t represent the full behavior of the model.

❓ Why is evaluation use-case dependent?

Because different applications require different qualities. A creative tool and a financial assistant cannot be evaluated the same way.


🔄 The Real Workflow I Learned

  1. Define what “good” means
  2. Generate outputs
  3. Evaluate using metrics
  4. Identify weak areas
  5. Improve prompt/system
  6. Repeat

This loop is where real progress happens.


🚀 What Changed for Me

Before:

  • I judged outputs casually
  • I changed prompts randomly

Now:

  • I define evaluation criteria
  • I test multiple outputs
  • I improve systematically

💭 Final Thought

LLMs don’t become better just because you “feel” they are better.

They become better when:

You measure → understand → improve → repeat

This was Day 5 of my AI engineering journey —
and honestly, this felt like a major shift from experimenting… to engineering.

No comments:

Post a Comment