Stop writing prompts to classify text: make evaluation declarative

I've built the same thing more than once: a step that reads an inbound message — a lead form, a...

Ayoola SolomonJun 2, 20265 min read

Originally published on DEV Community by Ayoola Solomon. Read on the original site


title: "Stop writing prompts to classify text: make evaluation declarative" published: true tags: ai, llm, typescript, softwareengineering

I've built the same thing more than once: a step that reads an inbound message — a lead form, a support ticket, a DM — and decides what to do with it. Qualify it, escalate it, route it, drop it.

Every time, the implementation had the same shape: hand-write a prompt that asks an LLM to return JSON, parse the JSON, branch on it. And every time it rotted the same way:

  • The prompt was untestable. "Looks right" was the only QA.
  • It drifted. A model upgrade or a one-word prompt tweak silently changed the output and nobody noticed until something got misrouted.
  • The JSON lied. The model would confidently return a category that wasn't in my allowed set, or a number outside the range I expected, and my downstream switch would happily act on garbage.
  • "Confidence" was theater. if confidence > 0.7 is a magic number that means nothing across different inputs.

Eventually I stopped writing prompts. This is what I built instead and what I learned.

The core idea: declare what to detect, not how to ask

Instead of a prompt, you declare typed detectors. Two kinds:

  • presence — "is this signal in the text?" → returns found
  • classification — "put this into exactly one of a fixed set of categories" → returns the category, enum-validated
{
  "name": "Partnership routing",
  "template": "custom",
  "custom_detectors": [
    {
      "name": "competitor_mentioned",
      "type": "presence",
      "examples": ["we're currently using Acme", "switching off a competitor"],
      "non_examples": ["love your product"]
    },
    {
      "name": "partnership_inquiry",
      "type": "classification",
      "categories": ["reseller", "affiliate", "strategic", "none"],
      "examples": [
        "interested in your reseller program",
        "want to co-sell with you",
        "just a support question"
      ]
    }
  ]
}

You never see the prompt — it's compiled from the declaration. That part isn't the interesting bit; anyone can template a prompt. The interesting bit is everything that becomes possible because a detector is a typed object instead of a string.

1. Detectors are tested at create-time, not in prod

Every detector requires at least one positive example. When you create the evaluation, those examples run as a smoke test: each positive must actually match, and a classification example must land inside its declared categories. If it doesn't, creation fails.

This is the part I wish I'd had years ago. A prompt can be syntactically fine and semantically broken, and you find out in production. Here, a broken detector can't ship — the assertion runs before it's ever live. (non_examples are presence-only, because a classification detector always lands somewhere, so there's no "not found" state to assert.)

2. The output is validated deterministically, not trusted

The LLM proposes; deterministic code disposes. The validators are boring on purpose: present, range:0-100, enum:yes,no. An out-of-set classification doesn't get to pass — it's coerced to not-found and surfaced in an invalid_fields list so you can see the model misbehaved instead of silently acting on it.

This is the line I'd defend hardest: structured outputs / function calling get you a valid shape. They don't get you a checked value. A schema says "this is a string from a set"; it doesn't run your range check or tell you the model went off-menu.

3. Escalation is rules, not a confidence threshold

Escalation is separate from the model's self-reported confidence. You write triggers on extracted values:

  • a classification trigger fires when the value is in a declared set
  • a presence trigger fires when the detector is found
  • required: true triggers are ANDed; required: false are ORed

So "escalate if it's a strategic partnership AND a competitor is mentioned" is expressible and deterministic. No magic 0.7.

4. One call, structured decision out

POST /v1/evaluate → { status, extracted_signals, next_action }

status is one of QUALIFIED / PARTIAL / FAILED / ESCALATE. That's the whole point: the thing my switch branches on is a small closed enum, not free text I have to parse and pray over. It drops straight into n8n/Zapier/Make. You can also POST real outcomes back later (converted? deal value? days to close?) so the rubric can be measured against reality instead of vibes.

What I got wrong / what's still ugly

Being honest, because these are real:

  • Everything hits the LLM today. Even an obvious keyword goes through a model call. The plan is a pre-LLM pattern extractor so deterministic signals never pay for inference — not built yet. So cost/latency is "one batched LLM call per eval": fine for inbound webhooks, not for high-QPS streams.
  • It's synchronous. Long transcripts are slow; I prepend a structured header (severity/tier) instead of dumping a 5k-token thread.
  • Batching detectors into one prompt keeps cost down but lets one detector's phrasing bleed into another's extraction. Isolating them costs N calls. I chose cost; not sure it's right.
  • Multi-turn is naive. Re-evaluating a growing conversation re-sends the whole thing. Delta prompts are on the list.

The question I actually have

Is "declare + validate + smoke-test" the right altitude? Or do people doing this seriously want prompt-level control and would find the abstraction a cage the first time they hit an edge case?

My bet: for the 80% case — lead qual, ticket triage, intent on inbound — nobody should be hand-maintaining a classification prompt, the same way nobody hand-writes a query planner. But I've been wrong about abstractions before. Curious where this breaks for you.


I packaged this up as the evaluation API behind EchoStack — you can run an evaluation on your own text in the demo (no signup) or skim the API quickstart.

Originally published on DEV Community by Ayoola Solomon. Read on the original site

You might also like