AI – Jason O'Neil

It’s been about 3 months since I posted “AI Articles and reading“, and I think I need to do it more regularly. Here’s a bunch of quotes from the best things I’ve been reading / watching / listening to that are informing how my thinking about AI and the software industry is developing.

On harness engineering

An introduction to “Harness Engineering”

Birgitta Böckeler has written a great intro to “Harness Engineering” (a follow-up from an earlier post I linked to). Some great new terminology in this piece:

“Guides” / “feedforward”: context you supply up front to nudge the agent in the right direction. For example AGENTS.md, package documentation, specifications and plans, examples to copy, skills.
“Sensors” / “feedback”: tools or scripts the agent can run to get feedback and self-correct. For example linters, unit tests, browser dev tools.
Computational / Inferential: terms to differentiate between deterministic code that is fast and always returns the same result, and code relying on LLMs, which is slow and non-deterministic.

She describes how the engineer’s job changes in this world too – you spend more time tweaking the system that writes code, than tweaking the code itself:

The human’s job in this is to steer the agent by iterating on the harness. Whenever an issue happens multiple times, the feedforward and feedback controls should be improved to make the issue less probable to occur in the future, or even prevent it.
Birgitta Böckeler in Harness Engineering on the Martin Fowler blog

I like this point:

A good harness should not necessarily aim to fully eliminate human input, but to direct it to where our input is most important.
Birgitta Böckeler in Harness Engineering on the Martin Fowler blog

“Every layer of review makes you 10x slower”

Avery Pennarun (CEO at Tailscale) on PR reviews being the bottleneck, and the kind of quality culture you need to adopt if you’re going to start removing or reducing them:

But, the call of AI coding is strong. That first, fast step in the pipeline is so fast! It really does feel like having super powers. I want more super powers. What are we going to do about it?

Maybe we finally have a compelling enough excuse to fix the 20 years of problems hidden by code review culture, and replace it with a real culture of quality.

I think the optimists have half of the right idea. Reducing review stages, even to an uncomfortable degree, is going to be needed. But you can’t just reduce review stages without something to replace them. That way lies the Ford Pinto or any recent Boeing aircraft.

The complete package, the table flip, was what Deming brought to manufacturing. You can’t half-adopt a “total quality” system. You need to eliminate the reviews and obsolete them, in one step.
Every layer of review makes you 10x slower – Avery Pennarun

He names code review as the new bottleneck – that’s my experience too. I love that he then looks at other industrial processes and thinks about how we could build quality into every step rather than trying to catch all things in review (or worse, not catch them if we skimp on reviews).

The spec driven development triangle

Drew Breunig (who I’ve been following since his excellent How Long Contexts Fail post) has been experimenting with spec driven development, including publishing an open source “library” that is just a spec, leaving it up to an agent to implement. In this post

This isn’t a one-way equation. It’s a feedback loop. The act of writing code improves the spec, and it improves the tests. Just like software doesn’t really work until it meets the real world, a spec doesn’t really work until it’s implemented.

So instead of an equation, I propose a triangle. The spec defines what tests need to be written, and what code needs to be written. Tests validate the code. That’s essentially what we had before, just in a different shape.

But the act of implementing code generates new decisions. Those decisions inform the spec. And when the spec updates, new tests need to be written. And sometimes it’s not new decisions — it’s just dependencies or subtle choices. New code surfaces new behaviors that need to be tested.

I call this: the Spec-Driven Development Triangle.

As each node moves forward, our job — and our tooling’s job — is to keep those nodes in sync. That’s the job. If we improve the code, we must improve the spec.
Learnings from a No-Code Library: Keeping the Spec Driven Development Triangle in Sync – Drew Breunig

Trying to compress the remaining 80%

“There’s a lot of noise around how “Ai generates buggy software”. The truth is writing the code was always 20% of the time. Testing and refining was always 80%. But now as that 20% (coding) compresses 10x, we want to compress the 80% (testing and validation) as well. That is the mistake… I find that actually AI is more thoughtful in the initial 20% than most (great) engineers are. But discovering real edge cases doesn’t happen at the initial code writing time.”
Balint Orosz (founder of Craft Docs). Quoted in Are AI agents actually slowing us down? on The Pragmatic Engineer

On company culture in the AI transition

“AI will not make up for that”

This video from Laura Tacho (CTO at DX) points out that a bunch of the main problems software companies have aren’t writing code… but perhaps we could use AI to tackle some of those problems too:

AI time savings is not going to make up for bad meeting culture and lots of interruptions and developers who are constantly being pulled out of their work: unplanned work, interruptions, outages, those kinds of things. AI will not make up for that. We can use AI to help solve that problem, but AI in and of itself is not going to make up for it.

Then when we look in the bottom half: build and test wait time, toil, and dev environment – and we put all that together, we realise that just the time savings from coding task speedup isn’t going to get us very far. But what will get us far is when we can take AI and point it at those problems. Can we use AI to help reduce meeting frequency? Can we use AI to improve CI wait time? Can we use AI to reduce dev environment toil? That is what winning organisations are doing right now. They’re putting Dev Ex at the centre of their universe and using AI as a tool to fix systems level problems.”
Data vs Hype: How Orgs Actually Win with AI – The Pragmatic Summit
Laura Tacho, CTO at DX.

Main quests and side quests

It’s so tempting with agents to spend time re-writing SAAS products that don’t quite fit the workflow we want. This comment from Karri Saarinen (CEO of Linear) hit home:

Internally, we always talked about main quests and side quests. Everyone should focus on the main quest, and moderately – or not all – on side quests. Both quest lines feel productive, but only one of them advances the main mission of the company.
Karri Saarinen. Quoted in Are AI agents actually slowing us down? on The Pragmatic Engineer

And this follow up comment from Gergely Orosz:

I love this framing from Karri: it speaks to why I see more engineers do “side quests” like rebuilding a SaaS vendor using AI and bragging about it. Yes, a new ticketing system is impressive, but it won’t help the company generate more revenue, and probably won’t save costs as it lacks features that a mature solution has, and maintaining it will take up precious focus and time.
Gergely Orosz, Are AI agents actually slowing us down? on The Pragmatic Engineer

On the industry as a whole

When everyone can code

I loved this simple parallel from Steve Yegge:

Everyone’s going to be forking. I think that’s a natural consequence of everybody writing code.
Just like everyone can take a picture now. That didn’t used to be true.
Steve Yegge – From The Pragmatic Engineer Podcast: From IDEs to AI Agents with Steve Yegge

Things definitely changed with digital photography, and then modern smartphone photography. And I appreciate that there’s still a healthy industry of professional photographers even though everyone can “do it themselves”

“Education is solved because we now have computers”

The Pragmatic Engineer interview with Mario Zechner (creator of Pi) and Armin Ronacher (creator of Flask) was excellent. I liked the way their conversation was from people who are very AI forward while still being relatively low on hype, high on caring about code quality etc.

This interaction on the hubris of software engineers thinking every person’s job will now be replaced by AI and software felt grounding compared to the hype:

Mario: I think one thing we software engineers or IT people underestimate is just how freaking complex the world is. And how much human squishiness is in each little nook and granny and corner, right?

Now we can automate everything, like every bit of knowledge work. But we as software engineers are so bad at becoming domain experts that we don’t see all the non-machine parts that go into a workflow. And we are running to the same fallacy here again.

We are seeing models doing incredible things. I’m not disputing that. This is for me like whoa, basically all my research in the 2000s is null and void because transformers can do all the things.

But we are overextending that to everything, like we always do in software, like we did in ed-tech. Yeah, we have tablets in classrooms now. I’m sure now it’s solved. Education is solved because we now have computers.

Gergely: Well, in fact, I’ve heard, I don’t know which country it was, but they’re now rolling back to Sweden. They’re taking the tablets out from the classroom.

Mario: It turns out if you do some scientific investigations into the tactics and effects on pupils, if you do just throw a bunch of tablets into a classroom, close it and hope for the best, turns out the best is terrible. So yeah, for me, I think the biggest takeaway in the past two to three years is the hype is terrible because it dehumanizes everything. And I want to not be part of that circus.
Mario Zechner and Mario Zechner in Building Pi, and what makes self-modifying software so fascinating (The Pragmatic Engineer podcast)

There’s still room for expertise. And humanity.

I was working in schools when the hype about iPads was at fever-pitch, and he’s right that the human parts of the problem are far far bigger than the hype acknowledged.

“Maybe you’ll notice another historical pattern”

Tanya Verma naming similarities between AI and colonisation:

There is something special about training a model on all of humanity’s data and then locking it up for the benefit of a few well-connected organizations that you have relationships with. Maybe you’ll notice another historical pattern here. Extract value from a population that can’t meaningfully consent, concentrate the returns within a small inner circle, and then offer some version of charity to the people you extracted from as moral cover for the arrangement. The pattern repeats itself with labs promising post-AGI UBI or encouraging EA philanthropy while continuing to concentrate frontier capability. Not saying the intent is malicious, I think many are trying to do the best they can, I’m simply noticing.
Closing of the Frontier by Tanya Verma

Work slop and the loss of trust

This blog post painted a relatable picture of both “workslop” (vibe-creating something with AI in a way that is fast for you but wastes your co-worker’s time) and also what happens when you can no longer confidently assert the that you know the thing works. You lose trust, and the trust is what people are paying for:

For firms, the competitive advantage of a firm whose work can be trusted has not disappeared; it has, if anything, appreciated, because so many of the firm’s competitors are quietly converting themselves into content-generation pipelines and counting on the client not to notice.
This is already coming to a head. Deloitte has already refunded part of a $440,000 fee over an AI-hallucinated government report. It could be a production system built on a hallucinated specification, or a senior engineer who realizes they have spent the last year nominally reviewing work they could no longer competently review. The reckoning will not be subtle. The firms still doing the work properly will be in a position to charge for it. The firms that have hollowed themselves out will discover that what they hollowed out was the thing the client was paying for.
Appearing Productive in The Workplace on blog No One’s Happy

Five months in, I think I’ve decided that I don’t want to vibecode — I want professionally managed software companies to use AI coding assistance to make more/better/cheaper software products that they sell to me for money.
Matthew Yglesias, seen via Simon Willison’s blog

I want to start capturing my notes from important articles I’m reading about the impact of AI.

AI Doesn’t Reduce Work—It Intensifies It

by Aruna Ranganathan and Xingqi Maggie Ye

While this may sound like a dream come true for leaders, the changes brought about by enthusiastic AI adoption can be unsustainable, causing problems down the line. Once the excitement of experimenting fades, workers can find that their workload has quietly grown and feel stretched from juggling everything that’s suddenly on their plate. That workload creep can in turn lead to cognitive fatigue, burnout, and weakened decision-making.

They named 3 ways it intensifies

“Task expansion” – things that wouldn’t have been in their role scope before they might try now. Eg engineers drafting communication, or designers writing code. This leads to more work for yourself, but also more work for those who have to assist or review the out-of-scope thing you’re doing.
- “Engineers increasingly found themselves coaching colleagues who were “vibe-coding” and finishing partially complete pull requests. This oversight often surfaced informally—in Slack threads or quick desk-side consultations—adding to engineers’ workloads.”
“Blurred boundaries between work and non-work” – “Many prompted AI during lunch, in meetings, or while waiting for a file to load. Some described sending a “quick last prompt” right before leaving their desk so that the AI could work while they stepped away.” Being able to direct an agent from your phone blurs the lines significantly. 100% people can now work from their toilet breaks 😂😭
“More multitasking” – you’re more tempted to start several tasks and switch between them, giving you a sense of powering through your backlog. But the extreme context switching has a cost.
- “While this sense of having a ‘partner’ enabled a feeling of momentum, the reality was a continual switching of attention, frequent checking of AI outputs, and a growing number of open tasks. This created cognitive load and a sense of always juggling, even as the work felt productive.”
- The words felt productive remind me of this study. I suspect there is more of an AI speedup now, but the mind blowing thing here was that engineers predicted they’d be more productive, and reported that they were more productive, even when the observed results were that they were less productive. Our “feeling” of productivity isn’t a good one to trust.

For Culture Amp especially, we care about these kinds of problems. A lot of our research is into how to have both a “high performance” and a “high engagement” culture (with wellbeing and sustainability being a key part of the engagement measure). If you get both right it’s incredible for the company. If AI helps you be high performing, but not in a sustainable way, then you become strained, and it’s not a good long term win for the company.

How Generative and Agentic AI Shift Concern from Technical Debt to Cognitive Debt

By Margaret-Anne Storey

Even if AI agents produce code that could be easy to understand, the humans involved may have simply lost the plot and may not understand what the program is supposed to do, how their intentions were implemented, or how to possibly change it.

Side by side comparison. Technical debt - legacy code, quick fixes, buggy logic - messy code & complexity. Cognitive debt - lost understanding, knowledge gaps, team confusion. Overwhelmed developers

She gives a simple example from a student team losing understanding of what they’ve built:

But by weeks 7 or 8, one team hit a wall. They could no longer make even simple changes without breaking something unexpected. When I met with them, the team initially blamed technical debt: messy code, poor architecture, hurried implementations. But as we dug deeper, the real problem emerged: no one on the team could explain why certain design decisions had been made or how different parts of the system were supposed to work together. The code might have been messy, but the bigger issue was that the theory of the system, their shared understanding, had fragmented or disappeared entirely. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.

Our team has talked about how architectural decisions made when vibe coding are usually not good long term decisions. We talked about a playbook that looks like:

Vibe code a v1 to learn as much as we can (about the problem, technical approach, UX issues etc)
Don’t waste time polishing the v1 and making the code nice, we’ll throw it out
For v2, plan your architecture and abstractions really carefully based on what you learned in the vibe-coded v1.
Write out key types and function signatures yourself
Document edge cases, business logic decisions, things we want tests for.
Get review from another engineer on this plan
Then get an agent to help build out v2
And this time do the work to clean the code up and make it nice. But it’ll be building within the abstractions you planned.

One other idea that comes to mind is keeping a documented understanding of the project in the README (or elsewhere in the repo) that includes:

What it is and what it does
The problems and constraints it solves for
The architecture and key abstractions chosen
The approach to automated testing, and what manual testing is required

You could then make your agent coding process include reading this context at the start of a session, and writing it up at the end of a session.

Anyway – they’re ideas at solutions, but I think this article is interesting for naming the problem: Cognitive Debt, when it doesn’t even matter if the source code is in good shape or not because there’s no understanding of what the thing is, or how it’s supposed to work.

Harness Engineering

Birgitta Böckeler

It was very interesting to read OpenAI’s recent write-up on “Harness engineering” which describes how a team used “no manually typed code at all” as a forcing function to build a harness for maintaining a large application with AI agents. After 5 months, they’ve built a real product that’s now over 1 million lines of code.

I hear about people doing this more, and I think it’s going to require a fair bit of supporting structure (a “harness” they’re calling it) to make sure the agent builds in a sustainable way and can stay useful even as it grows.

The categories she pulls from their post are interesting:

The OpenAI team’s harness components mix deterministic and LLM-based approaches across 3 categories (grouping based on my interpretation):

Context engineering: Continuously enhanced knowledge base in the codebase, plus agent access to dynamic context like observability data and browser navigation

Architectural constraints: Monitored not only by the LLM-based agents, but also deterministic custom linters and structural tests

“Garbage collection”: Agents that run periodically to find inconsistencies in documentation or violations of architectural constraints, fighting entropy and decay

I’ve put most of my AI thinking into context engineering so far, and have started pondering what architectures are going to be more successful. For example, offering abstractions that help keep code modularised so agents can navigate it effectively within their context windows, or providing better feedback loops so it can iterate towards a solution rather than write code and hope. The third one – “garbage collection”, I’ve hardly started to consider.