Essay · May 2026

Still in the Driver's Seat.

A pass through Stash's email routing with Lattice.

The setup

Stash has many capture surfaces — voice memos, share sheets, plain text, MCP, document uploads, email forwards. The system design of the email capture alone has complexity that a happy-path tester or a swarm of QA-minded agents would overlook. I did and they did.

I'd described in painstaking detail how I wanted email forwards to work — capture@stash.bar, anything I forwarded extracted and enriched like every other rant I dump into Stash. Claude built it.

Twenty-four tests

Claude and I had codified 24 tests, each focused on one failure mode I'd anticipated or discovered while building out email ingestion. They covered every obvious scenario. The suite lighting up green was falsely reassuring.

I thought that meant I'd exhausted the inputs. I had — one axis at a time. The bugs I missed weren't on these axes — they were hidden behind their interactions.

Handing it off

I handed Claude Lattice, reminded it of the problem space, and sat back.

The agent enumerated the problem space well. Which email address was the email sent to? What's the status of the DMARC result? How was the Stash address involved? Who else was involved? Can we trust the source of the email? And if it was a handle address — does the handle even exist?

The model

Six concepts, distilled down into a schema:

Dimension Values Type
recipient_pattern capture · handle · ignore · unknown base
dmarc_result pass · fail · missing base
stash_address_in to · cc · bcc base
other_recipients_present true · false base
email_source_state verified · unverified · absent conditional on capture
handle_state exists · missing conditional on handle

Four dimensions plus two conditionals. 864 possible combinations. 22 pairwise tests.

What didn't make it

What the agent didn't model is just as important as what it did. SPF and DKIM get parsed by the auth checker — but a deeper look at the implementation shows they don't gate anything. trusted_sender? is literally dmarc_pass?. Attachments, body format, subject line, sender reputation — real variables in email, but they shape content processing downstream, not the routing decision being tested.

The model should describe what changes behavior and what could plausibly change behavior.

What passed

I ran the test cases. Not all of them were green. Two of the twenty-two failed. Both were Bcc-related — a story I'll come back to.

The interesting finding was what passed. One row sent an email to my handle address — tyler@stash.bar — with no DMARC header at all and confirmed that the system would create a record. The test passed because that's what the code did. HandleMailbox didn't check DMARC and the schema had written the gap down as a fact.

The agent had written assertions matching the code, not the policy I wanted — tests shouldn't mirror the code, they should mirror expected behavior. The policy was mine. So I flipped the assertion. From "create a record" to "reject," and the test went red. The bug now had a name and a failing test pointing at it.

The fix was a five-line change. HandleMailbox got the same before_processing :verify_authentication callback CaptureMailbox already had. The existing handle tests, which had been sending unauthenticated mail and expecting records, broke — and rightly. I updated them to include auth headers and added one new test asserting that DMARC-failed mail to a real handle gets rejected.

Suite green. Except for the two Bcc rows still failing for reasons I hadn't worked out yet.

What failed

The other two failures were the Bcc rows. Both asserted that an email with the Stash address in Bcc would route to its expected handler. Neither did. They both landed in UnknownMailbox.

I sat with the failures for a minute. They didn't fit. The code had a Bcc branch in the recipient extractor — I'd read it, the agent had read it, the schema had encoded it. Something downstream was wrong.

The schema had described a routing dimension that wasn't actually a dimension. When an email gets serialized for delivery, RFC 822 strips Bcc. By the time our code touches the message, the Bcc field is empty. The branch in the extractor was dead code. The agent and I had both modeled a behavior the system couldn't actually exhibit.

The new schema helped me as the reviewer identify an assumption that was clearly wrong. I dropped Bcc from the dimension. 576 combinations now. 21 pairwise tests. They all passed.

Pruning

Then I went back to the routing code and deleted the Bcc branch. Two lines from ApplicationMailbox.recipient, two lines from EmailProcessable#extract_recipient_address. I pruned the surface area that future me and future agents could fuck up.

The final count: 21 pairwise tests. Two fewer than the 22 the first schema produced. Three fewer than the 24 single-axis tests I started with. Two real bugs found. More coverage. That's the old Hexawise pitch showing up in reality.

Still in the driver's seat

What I came away with isn't really a craft note. It's a reminder.

My role has shifted from sole system designer to system verifier, steerer, shaper — where agents are doing most of the grunt work in the iteration process. Humans are iterators, steerers, shapers of code. Some may design, but it depends on their level of rigor.

Whatever altitude you operate at — shaping schemas, steering plans, designing systems — the job is finding the right altitude. The agent does the typing. You read the schema. You decide the policy. You catch the wrong assumption. You commit the change.

Humans are still in the driver's seat.