Home Blog Tags About

Memory Ep. 7: plan before you search

Last episode ended with the realisation that had been hiding from me for weeks: I was tuning retrieval scores when the real problem was that the system had no idea what kind of question it was being asked. It treated “who is Ana’s brother?” exactly like “summarise everything I know about databases,” and those two need completely different moves.

So I stopped touching the retrieval math and built the layer that was missing: a planner that runs before any search happens.

The idea is small and it changed everything. Before retrieving a single note, the question goes through one step whose only job is to produce a plan. Not the answer, the plan: what kind of question is this, and how should we go and get it?

A few intents, each with a completely different move:

  • Direct / factual (“what’s the deadline for X?”): one narrow search, a few notes, done.
  • Relational (“who is Ana’s brother?”): this needs two hops, not a bigger search.
  • Aggregation (“everything I know about databases”): cast wide, pull many notes, expect overlap.
  • Episodic (“what did I work on last week?”): this is not a similarity question at all. Filter by date, return everything in the window, ranked by time, not by how well the words match.
  • Conversational (“thanks, that’s it”): don’t search at all. The cheapest win is knowing when to do nothing.

Same vault, same retrieval engine underneath. What changes is the plan: how wide to cast, how many notes, whether to filter by time, whether to chain a second search. The plan turns one dumb search box into something that adapts to the question.

The opinionated bit: the plan lives in code

Here is the decision I will defend hardest, because it is the one most people do the other way.

The popular pattern is: give the model a search tool and let it decide when and how to use it. Hand it search() and trust that, mid-answer, it will choose to call it, with good terms, the right number of times. I tried exactly that first. It was unreliable in the ways that matter: sometimes it answered confidently from nothing rather than searching, sometimes it searched with terrible terms, sometimes it stopped after one hop when the answer needed two. The behaviour I needed most, “always look before you speak, and look the right way,” was precisely the one I could not count on.

So I inverted it. Step 0 is not optional and not the model’s call. It always runs, and it produces an explicit, structured plan, intent, search terms, filters, how many results, whether to chain. The model fills that plan in within tight rails; the control flow, what actually happens next, is owned by plain code. The orchestrator decides to search, to chain, to filter, to stop. The model advises; the code is in charge.

And the whole thing is biased toward recall on purpose: when the planner is unsure, it searches, and it searches wider rather than narrower. I would much rather carry a few irrelevant notes than miss the one that mattered. Missing is silent and fatal; a bit of extra context is cheap and visible.

Multi-hop, finally

This is where “who is Ana’s brother?” finally works.

The planner reads it as relational, so the code runs the two-step dance: first search finds the note about Ana, the orchestrator reads who is linked there as her brother, and then it chains a second search for that person and answers from their note. The first search was never the answer. It was how the system found out what to ask next.

No cosine score, no PageRank, no blend ever produced that, because it is not a ranking problem. It is a tiny bit of reasoning about how to look, and that belongs in the orchestrator, not in the similarity metric I had spent three episodes polishing.

Where this left me

This is the layer that made the thing finally feel like memory instead of search. Ask it something indirect and it goes and works for the answer, the way you would dig through your own notes: find the thread, follow it, come back.

It is not magic. The planner is still a model reading a question, and it can misread one, decide a relational question is just factual and skip the hop it needed. Making that step consistent turned out to be its own small saga (there is a reproducibility lesson in there I will come back to). But the architecture was finally right: understand the question, then retrieve, with the control in code.

Next

Which left one uncomfortable question I had been carefully not asking. I had built embeddings, a graph, a hybrid, and now an orchestrator with intent and multi-hop. A lot of machinery.

But the way I actually use my vault day to day is much dumber: I open Claude Code and let it grep and read files. So before getting attached to all this, I owed myself one honest test, against that, not against a toy. Is my orchestrator actually better than just grepping the vault?

That is the next episode, and the answer was not the clean win I was hoping for.

×