Wiki Regeneration Convergence
Stash's WikiGenerationJob was supposed to re-queue itself when marked stale during generation.
The check it used to catch that scenario never fired.
Stash generates wikis by calling out to an LLM provider. To avoid spending tokens unnecessarily, wikis only generate when opened. When new content is added to the knowledge base that's relevant to a particular wiki, it's marked stale. To know when we need to regenerate a wiki vs. render what we've already got, we use a state machine to track its status.
Wikis can be generated for entities or topics within your Stash. Lattice modeled the lifecycle of one generation job: does the wikiable still exist when we get there? Did the controller pre-claim, or did we claim it ourselves — and did somebody else beat us to it? Did the LLM service finish cleanly, fail, or blow up? When it succeeded, did a stale mark land mid-flight, and is the wikiable active enough to bother refreshing? When it failed, was there an existing wiki worth preserving?
1,728 raw combinations. 16 pairwise rows.
Despite 18 unit tests covering this subsystem (one for the job, seventeen for the model), Lattice uncovered a non-trivial bug when four of its test cases lit up red.
If a new connection was made to something that was wikiable while the wiki was regenerating — and it was successful in doing so — the system had no way to realize that the wiki was actually stale. The in-flight process was lacking the new intel, but when it finished up, the last generation timestamp was after the timestamp where it was marked as stale. On the surface, the wiki must be up-to-date, right?
All four failing rows had the common suspects. The staleness signal showed up during a successful generation.
The same broken check ran in the broadcast that fires after a successful regeneration. The UI's "Updating with new information..." indicator was dead for the same reason.
The fix was quick. The job needed a check for its lifecycle, not the user-facing one that was already there. The re-queue fired. The UI's "updating" indicator fired too.
The bug was sitting in production until I told Claude to take a look at Lattice and find a spot in the codebase to take it for a test drive.