Stop writing prompts to classify text: make evaluation declarative
I've built the same thing more than once: a step that reads an inbound message — a lead form, a...

Originally published on DEV Community by Ayoola Solomon. Read on the original site
title: "Stop writing prompts to classify text: make evaluation declarative" published: true tags: ai, llm, typescript, softwareengineering
I've built the same thing more than once: a step that reads an inbound message — a lead form, a support ticket, a DM — and decides what to do with it. Qualify it, escalate it, route it, drop it.
Every time, the implementation had the same shape: hand-write a prompt that asks an LLM to return JSON, parse the JSON, branch on it. And every time it rotted the same way:
- The prompt was untestable. "Looks right" was the only QA.
- It drifted. A model upgrade or a one-word prompt tweak silently changed the output and nobody noticed until something got misrouted.
- The JSON lied. The model would confidently return a category that wasn't in
my allowed set, or a number outside the range I expected, and my downstream
switchwould happily act on garbage. - "Confidence" was theater.
if confidence > 0.7is a magic number that means nothing across different inputs.
Eventually I stopped writing prompts. This is what I built instead and what I learned.
The core idea: declare what to detect, not how to ask
Instead of a prompt, you declare typed detectors. Two kinds:
- presence — "is this signal in the text?" → returns
found - classification — "put this into exactly one of a fixed set of categories" → returns the category, enum-validated
{
"name": "Partnership routing",
"template": "custom",
"custom_detectors": [
{
"name": "competitor_mentioned",
"type": "presence",
"examples": ["we're currently using Acme", "switching off a competitor"],
"non_examples": ["love your product"]
},
{
"name": "partnership_inquiry",
"type": "classification",
"categories": ["reseller", "affiliate", "strategic", "none"],
"examples": [
"interested in your reseller program",
"want to co-sell with you",
"just a support question"
]
}
]
}
You never see the prompt — it's compiled from the declaration. That part isn't the interesting bit; anyone can template a prompt. The interesting bit is everything that becomes possible because a detector is a typed object instead of a string.
1. Detectors are tested at create-time, not in prod
Every detector requires at least one positive example. When you create the
evaluation, those examples run as a smoke test: each positive must actually
match, and a classification example must land inside its declared
categories. If it doesn't, creation fails.
This is the part I wish I'd had years ago. A prompt can be syntactically fine
and semantically broken, and you find out in production. Here, a broken detector
can't ship — the assertion runs before it's ever live. (non_examples are
presence-only, because a classification detector always lands somewhere, so
there's no "not found" state to assert.)
2. The output is validated deterministically, not trusted
The LLM proposes; deterministic code disposes. The validators are boring on
purpose: present, range:0-100, enum:yes,no. An out-of-set classification
doesn't get to pass — it's coerced to not-found and surfaced in an
invalid_fields list so you can see the model misbehaved instead of silently
acting on it.
This is the line I'd defend hardest: structured outputs / function calling get you a valid shape. They don't get you a checked value. A schema says "this is a string from a set"; it doesn't run your range check or tell you the model went off-menu.
3. Escalation is rules, not a confidence threshold
Escalation is separate from the model's self-reported confidence. You write triggers on extracted values:
- a classification trigger fires when the value is in a declared set
- a presence trigger fires when the detector is found
required: truetriggers are ANDed;required: falseare ORed
So "escalate if it's a strategic partnership AND a competitor is mentioned" is expressible and deterministic. No magic 0.7.
4. One call, structured decision out
POST /v1/evaluate → { status, extracted_signals, next_action }
status is one of QUALIFIED / PARTIAL / FAILED / ESCALATE. That's the whole
point: the thing my switch branches on is a small closed enum, not free text I
have to parse and pray over. It drops straight into n8n/Zapier/Make. You can
also POST real outcomes back later (converted? deal value? days to close?) so
the rubric can be measured against reality instead of vibes.
What I got wrong / what's still ugly
Being honest, because these are real:
- Everything hits the LLM today. Even an obvious keyword goes through a model call. The plan is a pre-LLM pattern extractor so deterministic signals never pay for inference — not built yet. So cost/latency is "one batched LLM call per eval": fine for inbound webhooks, not for high-QPS streams.
- It's synchronous. Long transcripts are slow; I prepend a structured header (severity/tier) instead of dumping a 5k-token thread.
- Batching detectors into one prompt keeps cost down but lets one detector's phrasing bleed into another's extraction. Isolating them costs N calls. I chose cost; not sure it's right.
- Multi-turn is naive. Re-evaluating a growing conversation re-sends the whole thing. Delta prompts are on the list.
The question I actually have
Is "declare + validate + smoke-test" the right altitude? Or do people doing this seriously want prompt-level control and would find the abstraction a cage the first time they hit an edge case?
My bet: for the 80% case — lead qual, ticket triage, intent on inbound — nobody should be hand-maintaining a classification prompt, the same way nobody hand-writes a query planner. But I've been wrong about abstractions before. Curious where this breaks for you.
I packaged this up as the evaluation API behind EchoStack — you can run an evaluation on your own text in the demo (no signup) or skim the API quickstart.
Originally published on DEV Community by Ayoola Solomon. Read on the original site
You might also like

Mastra vs LangChain: Building an AI Agent Pipeline and Analyzing the Data
A week ago, I saw this tweet: I had just shipped SupportMesh, a multi-tenant AI support platform built on Mastra, so I had opinions from production. I liked the .dowhile() loop, the typed step schem

How Large-Scale Platforms Handle Millions of Daily Transactions
Every day, millions of people order food, stream videos, send messages, book rides, make payments, and shop online. Most of these actions take only a few seconds from the user's perspective. A user cl

The Saga Pattern in Node.js: How to Roll Back Distributed Transactions Across Microservices
Building reliable workflows across multiple microservices is challenging. In a monolith, a database transaction can ensure that multiple operations either succeed or fail together. But once data is sp