Lara praneeth kondeti

Posted on Jun 24

Spec-First Engineering with Specmatic: Contract-Testing a Multi-Agent AI Assistant

#specmatic #testing #api #python

When I started this challenge, I thought of an API specification as documentation — something you write after the code works, to tell other people how to call it. By the end, I had completely inverted that view. The OpenAPI specification became the single source of truth that my code had to honour, the script my tests ran from, and the definition my mocks had to stay faithful to. This post walks through how I applied Specmatic's spec-first approach to TRIO, my multi-agent AI assistant, and the things I learned — often the hard way — across several rounds of review.

The project: TRIO

TRIO is a local-first, multi-agent AI assistant I built. The backend is a FastAPI service; the frontend is React. It has a Jarvis-style voice overlay, speech-to-text using Faster-Whisper, text-to-speech via Piper, a ChromaDB-backed memory, and a set of agents coordinated by an agent manager. The backend exposes a focused HTTP API: health and system-info endpoints, agent listing, full CRUD for conversations, a chat endpoint that routes a user message through the agent manager to a local LLM (Ollama), and a voice endpoint that converts text to speech and returns WAV audio.
That chat endpoint — and its dependency on a live LLM — turned out to be the heart of the interesting testing problems, but I'm getting ahead of myself.

Why contract testing, and why spec-first

The premise of contract testing is simple but powerful: instead of writing tests that poke at your API and assert on whatever it happens to return, you write a precise specification of how the API should behave, and then a tool verifies that the running service actually conforms to that specification. The specification is executable. If the code drifts from the contract, the tests fail; if the contract is wrong, that surfaces too.
I wrote TRIO's contract as an OpenAPI 3.0 document, trio_api.yaml, describing every endpoint, every request body, every response shape and status code. Specmatic reads this file, spins up tests against the running backend, and reports whether each documented operation behaves as specified. From the very first run, this caught things I would never have written a manual test for — response fields that didn't quite match, status codes that were subtly wrong, edge cases I hadn't considered.

Resiliency testing: letting the tool attack my API

Beyond checking the happy path, Specmatic can run schema-based resiliency tests. Rather than only sending the well-formed requests I documented, it generates mutated requests — wrong data types, missing required fields, nulls where strings belong — and checks that my backend rejects them gracefully with the correct validation error, instead of crashing or, worse, silently accepting garbage.
This is where my contract first met reality. FastAPI returns a 422 Unprocessable Entity for validation failures, with a fairly specific body structure. My specification's error schema was too loose, and Specmatic — which validates the entire response, not just the status code — flagged a cascade of mismatches. Each one taught me something about how FastAPI actually behaves. The loc array in a validation error contains a mix of strings and integers (field names interleaved with array indices), so its items had to be modelled as oneOf: [string, integer]. FastAPI echoes the offending value back in an input field that can be literally any type, so the schema had to permit any type there. And conversation_id legitimately accepts null for a brand-new chat that doesn't have an id yet, so the spec had to mark it nullable — otherwise a null was wrongly expected to be rejected. None of these were the code being wrong; they were the contract being made to describe the implementation's real behaviour precisely. That precision is the entire point.

The hardest problem: testing an endpoint that calls an LLM

My /api/chat/ endpoint sends the user's message to a local large language model through Ollama and returns the generated reply. This is wonderful for users and miserable for testing, for three reasons. The model is slow — every call takes seconds, so a test suite that hits it dozens of times grinds to a halt. The model may not be running at all, in which case the request hangs or errors. And the model is non-deterministic — it returns a different answer every time, so there's no fixed response you can assert on.
My first instinct, which I'm glad a reviewer talked me out of, was to hardcode a fixed reply inside the application when running under test. That is an anti-pattern: it fakes the answer inside the very code you're trying to test, and it means your tests are no longer exercising the real request-handling path. You end up proving that your fake works, not that your application does.
The correct technique is service virtualization — and Specmatic does this too. Instead of faking the LLM inside my code, I stood up a mock of the LLM outside my code, driven by its own specification. I wrote an OpenAPI spec describing the subset of the Ollama API that TRIO actually calls — /api/chat, /api/generate, and /api/tags — and ran Specmatic in stub mode against it. That produces a fast, deterministic fake Ollama listening on a port, returning contract-valid responses instantly. I then pointed my backend at the mock by setting the OLLAMA_BASE_URL environment variable, which TRIO already used for configuration, so no application code changed at all.
The result is that testing /api/chat/ now exercises the entire real pipeline — Specmatic drives TRIO, TRIO handles the request and routes it through the agent manager, the agent manager calls what it believes is Ollama, that call lands on the Specmatic mock, and a deterministic response flows all the way back. The only thing faked is the LLM itself, and it is faked outside the application by a spec-driven server, not hardcoded inside it. The chat endpoint that used to hang now responds instantly and predictably, so it can be included in the full test run with nothing filtered out, and the suite reaches complete coverage with chat fully tested. The difference between this and hardcoding is the difference between proving the AI flow works and merely claiming it does.

Upgrading the configuration to V3

Partway through, I migrated my Specmatic configuration from the older v2 format to the v3 format, which wires the system-under-test, the spec sources, and the run options together explicitly. The property nesting in v3 is strict, and at first the parser errors were frustrating — until I realised the errors themselves were the documentation. Each one names exactly which properties are valid at that level, so the migration became a methodical exercise in following the errors down to the correct structure. It was a good reminder that a strict tool with clear error messages is teaching you its model, not just rejecting your input.

Making the mock faithful, not just convenient

A later round of review delivered a lesson that stuck with me. My Ollama mock worked, but in building it I had quietly renamed some schema components and changed which fields were required, to suit my own convenience. The reviewer pushed back, and rightly: a mock of a real service should mirror that service's actual contract, or it stops being a trustworthy stand-in. If my mock's ChatResponse requires different fields than real Ollama's, then passing tests against my mock prove nothing about whether TRIO would work against the real thing.
So I reworked the mock to map faithfully onto the official Ollama OpenAPI specification — using its real component names (ChatRequest, ChatMessage, ChatResponse, GenerateRequest, GenerateResponse, ListResponse, ModelSummary, ModelOptions) and matching its required fields exactly. The /api/tags response, for instance, is modelled with the official ListResponse containing ModelSummary items, not a custom-named type I'd invented. I kept only the deviations that don't compromise fidelity — including just the endpoints TRIO calls, omitting unused fields, and adding concrete examples so the mock can serve responses — and I documented every one of those deviations in a notes file in the repository so the reasoning is explicit and auditable. The principle: trimming a contract down to what you use is fine; quietly rewriting its shape is not.

Contract versus resiliency: two runs, two reports

I had been treating a single Specmatic run as both my contract report and my resiliency report, which meant the two were identical — and a reviewer noticed. They are conceptually different things. The contract run verifies the documented happy paths: does each endpoint, given a valid request, return the specified response? The resiliency run goes further, generating negative and mutated requests to check the API degrades gracefully under bad input. I separated them cleanly — the contract run uses the standard tests, and the resiliency run enables generation through the SPECMATIC_GENERATIVE_TESTS environment variable — and wired both into CI as distinct steps that publish two separate report artifacts. Seeing the two reports genuinely differ in size and content is itself a useful signal that the two suites are doing different jobs.

Covering the negative paths explicitly

Even after separating the reports, the contract report showed a handful of skipped tests: the validation (422) paths weren't being exercised in the pure contract run, because the only examples I had provided were happy-path ones. The resiliency run covered them through generation, but the contract itself stayed silent on them. The fix was to add paired negative examples — for each affected endpoint, a request that intentionally violates the schema (a wrong-typed required field) tied to the corresponding 422 response body. With those in place, the contract run executes the error paths too and the report shows no skipped tests. The deeper point is that documenting the negatives as examples puts the expected error behaviour into the contract, where it belongs, rather than leaving it to be discovered only by generation.

The voice endpoint and honest graceful degradation

The voice endpoint gave me my most stubborn problem. It returns WAV audio produced by a text-to-speech engine, and that engine simply isn't available everywhere — not on a headless CI runner, and not on every developer machine. So the endpoint kept returning a 503 in environments without TTS, which failed the contract test that expected a 200.
I resisted two tempting shortcuts. One was to hardcode a fake audio response under test — the same anti-pattern I'd already rejected for the LLM. The other was to simply document the 503 as acceptable, which would have left the endpoint less than fully covered. Instead I made the TTS layer degrade honestly: it tries Piper first, then espeak, and if no engine is available it returns a valid, well-formed silent WAV file — a real audio file with a proper RIFF/WAVE header, just containing silence. This isn't faking a response; it's a legitimate product behaviour. When no speech engine is present, the endpoint still returns valid audio/wav, deterministically, in every environment. That fixed the coverage cleanly and honestly, and it also fixed a real bug I found along the way: the endpoint had been catching its own intended 503 in a generic exception handler and re-raising it as a 500, which I corrected so the deliberate status code propagates.

Continuous integration

Finally, all of this runs automatically. The GitHub Actions workflow, on every push, sets up Python, installs the backend and the audio dependencies, starts the Specmatic mock of Ollama, starts the TRIO backend pointed at that mock, runs the contract suite and then the resiliency suite as separate steps, and uploads both reports as artifacts. Because the mock starts first, the chat and voice endpoints are exercised in CI exactly as they are locally — nothing skipped, nothing filtered. Every change to the backend is now checked against the contract before it can quietly break anything.

What I took away

The throughline across every round of this work was the same idea, arriving from different directions: the specification is not paperwork bolted on after the code. It is the contract the code must honour, the script the tests run from, and the definition the mocks must stay true to. Service virtualization turned an untestable, non-deterministic AI endpoint into a fully covered one without a GPU, without a running model, and without faking anything inside my own application. Strict schemas forced my contract to describe reality precisely. And faithful mocks kept my tests honest. Each piece of reviewer feedback was, in effect, a push to make the specification carry more of that weight — and the system is far more trustworthy for it.

DEV Community