Memory Ep. 9: the fix that did nothing

Last episode ended with the one thing hipocampo had to fix: recall. The brain was cheap and it scaled, but it still went quiet sometimes when the answer was sitting right there. So I went looking for a fix, and I found a lovely one.

The candidate

The misses I cared about had a pattern. The system fails when the question is phrased nothing like the note that answers it. Ask “that fast scratchpad where you stash things so you don’t recompute them” and the note is titled “in-memory cache”. Same thing, not one shared word. The embedding puts them far apart, so the right note never surfaces.

There is a known, cheap fix for exactly this: a linear adapter. A small matrix you apply to the question’s vector, just before searching, to nudge it toward where its answer actually lives. You do not retrain the embedding model, you do not re-index anything. It is reversible, it trains on a laptop, and if it learns nothing useful it collapses to “do nothing” and cannot make things worse.

I did not even have to build it. Santander’s AI team had just open-sourced exactly this (a linear-adapter-trainer, of all things for a bank to publish). The code was clean, the safety design was honest, the theory was gorgeous. I was sure this was the piece.

The honest test

I had learned my lesson from the graph in Episode 5: build the ruler before you fall in love. So:

I wrote questions the way a non-expert would ask them, all vague phrasing and no jargon, which is exactly the register gap the adapter is supposed to close.
I held out by topic. The adapter trained on some notes and was tested only on notes it had never seen. If it was just memorising the training set, that would show.
Recall measured where it matters: did the right note come back, with the whole vault as distractors.

Then I trained it and compared base against adapted.

Nothing. Exactly nothing.

Not “a small gain”. Not “a small loss”. The delta was zero on every single question. The adapter that got saved was, literally, the do-nothing one.

That is the safety mechanism working as designed. It keeps whichever version scored best on the held-out questions, and the best version was the one from before any training. Every trained version made the unseen notes worse. So it shipped the identity, the no-op, and reported zero.

My first instinct was that I had broken something.

Suspect yourself first

So I went after my own setup, not the tool.

Maybe my targets were blurry. I had been embedding each whole note as a single vector, which averages a long document into mush. I chunked everything into clean paragraphs instead. Re-ran. Still nothing.

Maybe my questions were too alike. All cast in the same “that thing where…” mould, so they clustered into one tight blob in the space. I rewrote them: more distinct, more topics, a genuinely varied set. Re-ran. Still nothing.

Five honest configurations, all zero. At that point it was not an accident of my setup. It was telling me something about the shape of the problem.

The shape of the problem

A linear adapter can only do one thing: apply the same global nudge to every question. It can rotate or stretch the whole space, but the rule is identical for everyone, and a hard consequence is that questions sitting close together must land close together. It cannot send two near-identical questions to two far-apart answers.

My vault is wildly mixed: notes about code next to notes about completely unrelated corners of my life. The nudge that pulls one vague question toward its answer drags an unrelated one the wrong way. A single rule cannot serve regions that point in different directions, so it settles on a compromise that helps nobody, and “do nothing” honestly beats it.

Then came the twist that closed the case. I tried the opposite: a small, homogeneous slice, all notes of one kind, one register. There a single nudge would be exactly the right shape. And there the adapter still did nothing, for the opposite reason: on a tight, uniform set the base embedder already nailed it. The right note was already coming back first. No gap left to close.

That is the pincer. Where there is a gap, the corpus is too varied for one linear nudge to fix it. Where a single nudge would fit, there is no gap. The tool gets squeezed out from both sides. (Which, incidentally, is exactly why the tool’s own demo shows roughly zero improvement: it runs on a corpus where the base was already strong.)

Same lesson, sharper

This is Episode 5 all over again. An elegant idea, a beautiful theory, killed by an honest measurement, and the real prize was understanding the geometry underneath rather than the win I was hoping for. A straight line will not fit a curve no matter how many points you throw at it. More data does not change the shape of the tool.

I wrote the whole thing up and sent it back to the people who built it, as a public issue, with the suggestion that they document when it helps and when it cannot. Their tool is good. It is just built for a homogeneous, in-domain world, and my vault is the opposite of that.

So the fix that was meant to close the recall gap closed nothing, and taught me precisely why no single global rule ever could. Which points straight at the way out: stop trying to fix the whole space at once. Judge one pair at a time, this question against this candidate, and decide.

That is a reranker, and it does not have the linear ceiling. Next episode.