Issue #8: The Rise of Harness Engineering

AI started as a threat to software engineers’ jobs. The models of yesteryears still required humans in the loop. But right now, they’re able to build complex software rapidly, albeit still requiring an engineer to oversee things. So, we won’t need software engineers eventually, right? Wrong!

Let’s examine what it actually takes to build complex software using AI.

Imagine you want to build a complex UX around project management, something like what Linear does. Just prompting this into an LLM is usually a recipe for disaster because a simple prompt is grossly underspecified. And when prompts are underspecified, you get a regression to code that looks like one-off scripts.

True software engineering is not additive, but like constraint solving. And usually, where you don’t know all the constraints upfront. For example, not only do you need an auth library, but you also need one that can be easily extended to implement another provider should you switch providers for cost reasons. Or that you want libraries that are actively maintained or have good security practices. In many cases, popular does not equal the right choice.

The constraints that humans work with are complex and sometimes cannot adequately be expressed using language alone. Thus, we try to break a single rule down into tens or hundreds of sub-rules which can be expressed, and we hope that adds up to what we want. In the agentic era, the question now becomes, how do we expose those constraints to an AI Agent?

Enter: Harness Engineering

There are two primary levers that a software engineer has to get LLMs to build software:

The prompt and its subtypes, such as skill files and documentation.
The harness.

If you’re calling an LLM inside a loop till the task is done, like we do in a coding agent, then everything that surrounds that loop is called the harness. The harness includes preparing context for the LLM, loading skills, executing the tools it has requested, enforcing security constraints, exposing exploratory utilities to read files and documents, and even verifying the results of the LLM’s actions.

Earlier this year, OpenAI wrote a post on harness engineering: Harness engineering: leveraging Codex in an agent-first world. The post primarily covers the following decisions or techniques, which I’ll list and then talk about the pattern in each:

Having a way for the engineer to ask for a review from “specific agent reviewers.”
Repository-level tools (like npm scripts) that the agents can execute directly.
Programmatic access to the browser and dev tools, and the necessary skills to understand them.
Programmatic and ephemeral access to logs and metrics.
A documentation with proper hierarchy instead of stuffing everything inside AGENTS.md (I’d also add that using better documentation structures and giving doc-search tools might also prove valuable).
Past design decisions are committed to the repository for the agents to review while building future features.

A few of these line items add to the harness. Some of them can be improved by exposing them as tools rather than files.

LLMs are primarily used as a reasoning and generative layer. They also reason via the generative layer, which often causes confusion when people try to separate the two.

Any reasoning layer requires things to actuate. If the LLM is like an algorithm, then the rest of the harness is like the robot that it runs. It uses that harness to both sense and act. The harnesses are not something a non-programmer can trivially decide on or even create. It’s not like LLMs have swallowed the complexities of software engineering and engineers aren’t needed anymore.

It’s just that the complexity of software has been pushed into harnesses. In software, complexity can be abstracted under a rug, but never truly hidden.

How Harnesses Enable the Building of “Complex” Software

Let’s take the most trivial use case of writing a piece of email using just AI. A task that most think just requires a “prompt” and an MCP server.

Just asking the LLM to draft an email will make you sound like everyone else. Humans are good at pattern recognition and will often detect subtly that something’s off about your writing. And that would be very off-putting.

Let’s talk about the problem, which is making the text sound like you and making it factually accurate and useful, which is not as easy as it looks. You need the following to even approach a reliable system that almost always works, rather than one that only works sometimes.

A solid structure to specify the input. What would the user have to provide as the absolute bare minimum details?
The general system prompt that activates the right context in the LLM, or effectively fights against the LLM’s preferences that you don’t really care for. Instead of writing “instructions,” this is more like molding the probability distribution of the output.
An annotated set of writing samples. It’s not enough to provide a sample; they need to be annotated with the thought process that existed before they were written and a human critique of the writing itself.
An enrichment of those annotations using an LLM. We would add things like what kind of emails is this particular sample applicable to? (We’ll get to why we’re doing this in a second.)
Exposing a hybrid search tool on those enriched, annotated samples. The enrichment was to improve the recall of the search tool.
A set of tools to fetch pertinent data, like recent customer tickets and their resolutions, before writing to the customer. This is where MCP comes in.
A tool that critiques the writing using the review agents specified. We won’t get the writing reviewed by all the agents all the time. Individual reviewers could check for factual accuracy, professionalism, emotional content, and the tonality of the writing. The outputs themselves can be JSONs that reference the exact line and the reason for flagging, so that subsequent tool calls can pick this up.
And then the final presentation of the email, along with reasons for each decision, as well as citations. The citations themselves could be written in a programmatic JSON format, which can be verified by other tools. Because LLMs can still hallucinate citations.

The above could be considered a harness for writing an email. That harness can be exposed as an API or sold as a SaaS too. We’ve not even gotten to the task of writing serious software yet.

Even with the above steps for writing an email, I’ve overlooked many important details. But the point I wanted to make was that even the most basic task that an LLM does could require serious engineering. And it’s always possible to engineer a better system that delivers far better, more consistent output.

It’s Not the Code

Software engineering was never just about the code. We’ve gone through assembly, low-level languages, high-level languages, frameworks and libraries, and whatnot. Each evolution fundamentally changed the last-mile activity of a software engineer, but never the fundamentals.

When LLM started writing code, everyone just focused on the LLM and kept looking for the complexity that was supposedly gone. Well, it’s in the harness, which just got pushed onto another layer. And that’s what keeps happening in software engineering: we always abstract away complexity from one area and have to forcefully shove it into another.

The art of engineering is everywhere, even if what you engineer keeps changing. And when was the last time a software engineer wanted to do the same thing over and over again anyway?

See you in Issue 9!

Enter: Harness Engineering

How Harnesses Enable the Building of “Complex” Software

It’s Not the Code

Get the next issue in your inbox