Getting my hands dirty with the Anthropic SDK

When I started building my first real agent, the obvious move was to grab LangChain or one of the other frameworks. Everyone uses them. There are tutorials everywhere. You can have something “working” in 20 minutes.

I tried that for two days. Then I deleted everything and started over with the raw SDK.

Not because the frameworks are bad. Because I realised I didn’t understand what was actually happening, and that felt like a problem I’d pay for later.

What I didn’t know

I’d read about tool use. I understood the concept. But there’s a difference between understanding something and having written it yourself.

The model doesn’t “call a tool”. It responds with a message that says “I’d like to call this tool with these arguments”. You call the tool. You send the result back. The model reads it and decides what to do next. That loop is the whole thing.

When a framework handles that for you, you miss the part where you realise: this is just messages. The model is stateless. The “memory” is the conversation history you’re explicitly building and passing on every request. There’s no magic.

while True:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        tools=tools,
        messages=messages,
    )

    if response.stop_reason == "end_turn":
        break

    tool_calls = [b for b in response.content if b.type == "tool_use"]
    if not tool_calls:
        break

    messages.append({"role": "assistant", "content": response.content})

    results = []
    for tool_call in tool_calls:
        result = dispatch_tool(tool_call.name, tool_call.input)
        results.append({
            "type": "tool_result",
            "tool_use_id": tool_call.id,
            "content": result,
        })

    messages.append({"role": "user", "content": results})

That’s about 30 lines. Once you’ve written it, you own it. You know exactly what’s in messages at every step. When something breaks at 3am you know where to look.

The debugging argument

This is where I feel most strongly about it.

When you’re debugging an agent that’s misbehaving, the first question is always: what did the model actually see? What was in the conversation at the point where it made the wrong decision?

If you wrote the loop yourself, you can log messages and answer that question in 30 seconds. You can reproduce the exact state, change one thing, and see if it fixes it.

If the loop lives inside a framework, you’re reading documentation to figure out how to get access to the internals. You’re not debugging your agent, you’re debugging your understanding of the framework.

For a toy project that’s fine. For something running in production with real data, I want the debugging surface to be as small and transparent as possible.

What I’d say to someone starting out

Use the SDK first. Write the loop yourself. Break it deliberately. Feed it bad tool results and watch what happens. Put the model in a situation where it has to make three tool calls in a row and inspect every message in the chain.

Once you’ve done that, you’ll know what the frameworks are abstracting. Then you can make an informed decision about whether you want that abstraction or not.

I’m not against frameworks in general. But I think there’s a version of building with AI where you skip the foundations and end up with something that works until it doesn’t, and you have no idea why. I’d rather spend an extra week understanding the plumbing.