Memory Ep. 6: the hybrid that almost worked

Last episode the graph lost to plain vector search, and I refused to bury it. “Pure graph versus pure vectors” is a false choice. What if they were never meant to compete, but to cooperate, vectors to find where to look, the graph to decide what comes along?

So I built the hybrid. This is the episode where it almost worked, and where I finally noticed I had been fixing the wrong thing for weeks.

The blend

The combination was the obvious one. Vector search keeps doing what it is good at: find the handful of notes whose meaning sits closest to the question. Those become the seeds. Then the graph does what it is good at: from those seeds, follow the links and let well-connected, on-topic notes rise, even when their words never matched the question. The final ranking mixes two signals: how close in meaning, and how connected to what we already trust.

And on the associative queries, the ones where the answer lived in a note that shared no vocabulary with the question, the hybrid genuinely helped. Pure vector search missed them. The hybrid caught a few, because the link was sitting right there and now it finally counted for something.

For a moment, it looked like the save.

…that almost worked

Then the problems from last episode walked back in wearing a new coat.

The bag-note issue had not gone anywhere. A daily log still leaked activation into unrelated clusters. So I patched it: make heterogeneous notes propagate less, damp down the noisy spreaders. It helped, but now I was hand-deciding which nodes counted as “trustworthy.” A patch on top of a patch.

And every fix added a knob. How much vector versus graph? How far should activation travel? How many seeds? Each knob is one more thing to tune, and tuning against a small eval set is an excellent way to fit the test and fool yourself into thinking you improved.

When I measured it honestly, the verdict was deflating: the hybrid was a little better on a few associative cases, neutral or slightly worse on the rest, and noticeably more complex everywhere. “Almost worked” is the kind thing to say. It was the sort of result you only ship if you have already decided to ship it.

The thing I was refusing to see

I had spent weeks shuffling between vector, graph, and hybrid, tuning weights, chasing single-digit moves on the metric. Then one evening, staring at the queries that still failed, it finally landed: I was polishing the wrong layer.

Take a question like “who is Ana’s brother?” No retrieval score answers that well, not vectors, not the graph, not any blend, because it is not really a similarity question. The right behaviour is a tiny plan: find the note about Ana, read who is linked there as her brother, then go and fetch that person’s note. Two steps. The first search is not the answer at all; it is how you discover what to search for next.

No mixture of cosine and PageRank produces that. The cases I kept failing were not failing because the math was weak. They were failing because the system had no idea what kind of question it was being asked. It treated “who is Ana’s brother?” exactly like “summarise everything I know about databases,” and those two need completely different moves.

I had been tuning the engine while standing next to a car with no steering wheel.

That was the real turn of the whole project. The missing piece was never a better retrieval score. It was a layer above retrieval: something that reads the question first, works out what kind of question it is, and decides how to go and get the answer, one search or two, narrow or wide, recent or all-time.

An orchestrator. That is the next episode, and it is where hipocampo stopped being a clever search box and started becoming the thing I actually wanted.

The blend

…that almost worked

The thing I was refusing to see

Next