Memory Ep. 8: hipocampo wins on tokens, loses on recall

By now I had built a lot: embeddings, a knowledge graph, a hybrid, and an orchestrator that plans before it searches. A real pile of machinery.

And the way I actually use my vault most days is none of it. I open Claude Code, point it at the folder, and let it work. So before falling in love with my own contraption, I owed myself one honest test, against exactly that.

The baseline that actually matters

It would have been easy, and flattering, to benchmark against a toy. A naive RAG with bad chunking loses to anything, and a “grep one word, dump whole files into context” agent is barely better. I deliberately did not do that.

The baseline I care about is the one I would genuinely use instead of hipocampo: Claude Code itself, turned loose on the vault with its real tools, grep with bounded context, read by line ranges, glob, the lot. A competent file-navigating agent, not a strawman. (So when I say “grep” in this series, that is what I mean: the capable thing I use every day, not a cardboard cut-out.) If hipocampo cannot beat that, it has no reason to exist.

Same questions for both sides. And I measured in tokens rather than euros, because comparing costs across different models is noise, and token counts are not.

The clear win

On tokens, against real Claude Code, it was not close. Claude Code answers by pulling files into its context and reasoning over them, and to be thorough it pulls a lot, turn after turn. hipocampo plans, retrieves a small bounded set, and answers from that. Across the battery, hipocampo answered with roughly 85% fewer input tokens.

And the part that matters more than the headline number: it scales the right way. Claude Code’s cost grows with the corpus, more documents means more matches and more files to open. hipocampo’s stays roughly flat, because it always serves a bounded pool no matter how big the corpus gets. So the gap does not stay at 85%. It widens as the corpus grows.

Which is the right moment to be clear about something I have been sloppy about.

This was never really about my vault

I keep saying “my vault” because it is the corpus I happen to have on hand. My own notes are the lab I run everything in. But hipocampo was never meant to be a tool for my 200 notes. At that size, honestly, Claude Code grepping the folder is perfectly fine, the token cost does not hurt anyone.

The thesis is about any repository of knowledge where the corpus is big enough that letting a model rummage through files stops being cheap: a company’s documentation, a large codebase, years of support history. The personal vault is the test rig, not the use case. The cost-and-scaling argument should be judged at that scale, not at mine, where it does not yet bite.

The uncomfortable part

The quality win, though, was not the clean victory the token number suggests.

When I checked the answers by hand, real Claude Code held its own, and on a fair share of questions it answered as well or better than hipocampo. Not because it is clever, but because hipocampo’s weak spot showed up: recall. When the orchestrator misreads a question, or a filter is too aggressive, hipocampo does the honest thing and goes quiet, “I don’t have that”, while Claude Code, given enough reads, eventually stumbles onto the note.

In a memory system, that is the worst possible failure. A confident wrong answer you can at least catch. A silent miss, where the note was sitting right there and the system simply did not surface it, is the one that quietly erodes your trust in the whole thing. Claude Code’s sin is the opposite, it over-reads and occasionally pads with something it should not, but it rarely just fails to find what is there.

So the honest scoreboard today: hipocampo wins decisively on cost and scaling, and is roughly a draw, sometimes behind, on actually getting the right answer.

I’m not going to call recall a “fixable detail”

I was tempted to. It is comforting to say “the architecture is right, recall is just a bug I’ll squash.” And some of it is bugs, I found a filter that could empty the result set and fixed it.

But the core of it is not a bug. Recall depends on the orchestrator correctly reading what kind of question it is being asked, and that is a model making a fuzzy judgement call. Making that reliable is a hard, open problem, the same class of thing as the graph I killed in Episode 5, not a line item on a TODO. I would rather name it honestly than promise a fix I have not earned yet.

A detour on measuring the fuzzy stuff

I did try to automate the quality judgement with an LLM-as-judge. It was too noisy to trust: it flagged true statements as hallucinations because it only saw a slice of the source, and its verdict drifted depending on which model was doing the judging. So I threw out the automated score and read the answers myself.

The lesson generalises: measure what you can measure cleanly and defend, tokens, and be deeply suspicious of using a model to judge something fuzzy just because it is convenient. A number you cannot trust is worse than no number, because it feels like evidence.

Where this actually leaves it

So here is the sober state of hipocampo, including the part that looks like a contradiction, because someone will spot it otherwise.

Hipocampo is still research, not a finished tool. It wins on cost and scales the right way, it ties or loses on recall, and my own daily driver is still Claude Code grepping the vault.

And that fixes the order of operations, which is the reverse of what I have been chasing. Close the recall gap first, get the answers to where they match Claude Code, and then the cost-and-scaling advantage is worth cashing in, by wiring hipocampo into a real workflow on a corpus big enough to need it. Cost without recall is just a cheaper way to miss things.

If there is a thread through all eight episodes, it is not the graph or the orchestrator. It is that every time I fell for an idea because it was elegant, measuring it talked me back down, and the project got better for it. The graph that lost, the hybrid that almost worked, the judge I could not trust, and now a token win I have to refuse to oversell. Building the ruler mattered more than anything I measured with it.

Next: recall, attacked with the same honesty as the graph. The brain is cheap and it scales. Now it has to stop going quiet when it actually knows.