Mental Model for Writing Evals in Agentic AI Systems

Have you written tests for your systems? I bet you have. Test Driven Development is not a new topic to the people working in software engineering world.

Evals (short form of Evaluation) are the same thing but for Agentic Systems. You have built the agentic system. You have added all the guards to make sure it covers all the edge cases and does not deviate from its core-function. But you can not just deploy it on production without testing on your say local or dev environment.

For testing the agentic system you have to write evals or I say Evaluations.

So to sum it up.

Evals are tests for Agentic System.

This is it. You don’t have to read the article after this sentence. If you have understood what evals are until now, you must be wondering how to implement them. That is exactly what I will cover below.

Let’s implement Evals without any framework first

There are two ways to test your system, inside and outside.

Inside — Your test cases lie inside the Agentic System itself. Outside — You build a separate file that treats your Agentic system like a black box and triggers it against all the test cases.

We will choose the industry standard way. We will create our evals in a separate file outside the agentic system.

I will use the mentoring agent I built and covered in the post

This system takes an employee ID and finds them an available mentor based on their skill. It has two guardrails:

  • one that blocks out-of-scope queries,
  • and one that masks sensitive employee IDs in the final output.

Those guardrails are exactly what we want to test. So let’s start there.

Step 1 — Write the test cases

SCOPE_CASES = [
    {
        "description": "in-scope: mentor matching query",
        "input": "Find an available mentor for employee E001 who can help with Python.",
        "expect_blocked": False,
    },
    {
        "description": "out-of-scope: travel request",
        "input": "Book me a flight to Berlin.",
        "expect_blocked": True,
    },
    {
        "description": "out-of-scope: general coding help",
        "input": "Can you help me write a sorting algorithm in Python?",
        "expect_blocked": True,
    },
]

Step 2 — Write helper functions to invoke the graph and inspect the result

def _invoke(query: str) -> dict:
    return graph.invoke({
        "messages":           [HumanMessage(content=query)],
        "node_plan":          [],
        "current_node_index": 0,
        "active_node":        None,
        "input_valid":        None,
    })


def _final_text(result: dict) -> str:
    last = result["messages"][-1]
    content = last.content
    return content if isinstance(content, str) else content[0].get("text", "")


def _was_blocked(result: dict) -> bool:
    return result.get("input_valid") is False

Step 3 — Write the eval function

Now comes the fun part. The eval function calls the graph with each test case and checks whether the result matches what we expected.

def eval_scope(cases: list) -> list:
    results = []
    for case in cases:
        result = _invoke(case["input"])
        blocked = _was_blocked(result)
        passed = blocked == case["expect_blocked"]
        results.append({
            "layer": "scope",
            "description": case["description"],
            "passed": passed,
            "detail": f"blocked={blocked}, expected_blocked={case['expect_blocked']}",
        })
    return results

Step 4 — Print the report

def print_report(all_results: list):
    print("\n" + "=" * 60)
    print("EVALUATION REPORT")
    print("=" * 60)

    passed = sum(1 for r in all_results if r["passed"])
    total  = len(all_results)

    for r in all_results:
        status = "PASS" if r["passed"] else "FAIL"
        print(f"[{status}] [{r['layer'].upper()}] {r['description']}")
        if not r["passed"]:
            print(f"       {r['detail']}")

    print("-" * 60)
    print(f"Result: {passed}/{total} passed")
    print("=" * 60 + "\n")
if __name__ == "__main__":
    print("Running evaluation...")

    results = eval_scope(SCOPE_CASES)

    print_report(results)

Run it with python evaluate.py. You should see something like:

Running evaluation...

============================================================
EVALUATION REPORT
============================================================
[PASS] [SCOPE] in-scope: mentor matching query
[PASS] [SCOPE] out-of-scope: travel request
[PASS] [SCOPE] out-of-scope: general coding help
------------------------------------------------------------
Result: 3/3 passed
============================================================

The main building block here is eval_scope(). If you want to test more things, you just write another eval function and add it to the results list. That’s the whole pattern.

Should I show you how to extend it? I think you can probably figure it out yourself now, but let me walk through it anyway!

I want to evaluate two more layers:

  1. Routing — The router is the brain of this system. If it prepares the wrong node_plan, the entire execution yields the wrong result. So I want to test: given a query, did the router produce the right plan?

  2. End-to-end output quality — Once everything runs, the final response should not contain raw Employee IDs as those are sensitive. Testing this validates that the Output PII Guard is actually doing its job.

Add the test cases for both layers:

ROUTING_CASES = [
    {
        "description": "full flow: employee + skill lookup + availability",
        "input": "Find an available Python mentor for employee E001.",
        "expected_nodes": {"employee_lookup", "mentor_search", "availability_check"},
    },
    {
        "description": "skill-only: no employee context",
        "input": "Who are the available SQL mentors?",
        "expected_nodes": {"mentor_search", "availability_check"},
    },
]

E2E_CASES = [
    {
        "description": "end-to-end: no raw IDs in final output",
        "input": "Find an available mentor for employee E001 who can help with Python.",
        "must_not_contain": ["E001", "M001", "M002", "M003", "M004", "M005"],
    },
    {
        "description": "end-to-end: final output mentions a mentor name",
        "input": "Find an available mentor for employee E001 who can help with Python.",
        "must_contain_any": ["David", "Frank", "Grace"],  # available mentors for Python/SQL/Java
    },
]

Add the eval functions:

def eval_routing(cases: list) -> list:
    results = []
    for case in cases:
        result = _invoke(case["input"])
        plan = set(result.get("node_plan", []))
        expected = case["expected_nodes"]
        passed = expected.issubset(plan)
        results.append({
            "layer": "routing",
            "description": case["description"],
            "passed": passed,
            "detail": f"plan={plan}, expected_subset={expected}",
        })
    return results


def eval_e2e(cases: list) -> list:
    results = []
    for case in cases:
        result = _invoke(case["input"])
        text = _final_text(result)

        if "must_not_contain" in case:
            violations = [s for s in case["must_not_contain"] if s in text]
            passed = len(violations) == 0
            detail = f"found forbidden strings: {violations}" if violations else "no forbidden strings found"

        elif "must_contain_any" in case:
            matches = [s for s in case["must_contain_any"] if s in text]
            passed = len(matches) > 0
            detail = f"matched: {matches}" if matches else f"none of {case['must_contain_any']} found"

        else:
            passed, detail = True, "no assertion defined"

        results.append({
            "layer": "e2e",
            "description": case["description"],
            "passed": passed,
            "detail": detail,
        })
    return results

Now wire everything together in main:

if __name__ == "__main__":
    print("Running evaluation...")

    results = (
        eval_scope(SCOPE_CASES) +
        eval_routing(ROUTING_CASES) +
        eval_e2e(E2E_CASES)
    )

    print_report(results)

If everything is right, you will see all your evaluations passing:

Running evaluation...

============================================================
EVALUATION REPORT
============================================================
[PASS] [SCOPE] in-scope: mentor matching query
[PASS] [SCOPE] out-of-scope: travel request
[PASS] [SCOPE] out-of-scope: general coding help
[PASS] [ROUTING] full flow: employee + skill lookup + availability
[PASS] [ROUTING] skill-only: no employee context
[PASS] [E2E] end-to-end: no raw IDs in final output
[PASS] [E2E] end-to-end: final output mentions a mentor name
------------------------------------------------------------
Result: 7/7 passed
============================================================

Awesome, hopefully you have learned how to implement Evals in an Agentic System.

If you want to go deeper, there are many frameworks like LangSmith, DeepEval or your company may have its own. But what I have covered above is the foundation.