Skip to content

Inside the Memory Layer

How I built a persistent AI memory layer — MCP server, control panel, embeddings, and a query UI that turns conversation history into usable knowledge.

I didn’t set out to build infrastructure. I set out to stop repeating myself to Claude every morning. Three months later, the thing I built runs on a cloud VM, serves all the tools I use through a single API, and includes a control panel, a query interface, and its own embedding pipeline. This is how it came together: the decisions I made, the architecture, and the parts that didn’t go to plan.

The foundation: an MCP server on Cloud Build

The whole system runs as an MCP server, built on Google Cloud Build and deployed alongside my AI assistant on a cloud VM. Model Context Protocol has become the most widely adopted standard for connecting AI tools to external capabilities, and I chose it because every tool I care about already speaks it.

Why cloud and not local? Local servers die when your laptop sleeps, which is a dealbreaker when scheduled tasks, background processing, and autonomous operations all depend on the system being reachable at 3am. Cloud-hosted means Claude Code, Claude Desktop, Google’s CLI, and Claude Co-Work all connect to the same persistent instance.

The control panel

I built a control panel to handle the operational side. Four things turned out to be essential.

Workspaces. Different projects need different contexts, so the system supports isolated workspaces, each with its own facts, notes, and knowledge graph. Switch projects, and the tools automatically pull from the right workspace.

Permissions. Every tool that connects gets explicit permissions: read, write, or admin. My tools get write access to my workspace, while my wife's API authorises her to do the same in hers. A reporting dashboard that only needs to pull data can be limited to read access. When you are giving AI tools access to everything you know, you want granular control over who can change what.

Session management. Every time a tool queries the system, it registers as a live session. I can see in real time which tools are hitting which workspaces, what they are pulling, and how often. Sessions get bundled, and the system extracts what is worth keeping and saves it as new memory automatically while discarding the rest. Without this, the knowledge graph would be drowning in conversational noise within a month.

Conflict resolution. When multiple tools write to the same store, they occasionally record conflicting facts. One might log a project deadline as Friday, while another might log it as Thursday from a different conversation. The system flags contradictions and surfaces them for resolution rather than silently keeping both. I initially treated this as a nice-to-have, but within a week, conflicting facts were silently degrading every tool’s output, which taught me that conflict resolution is infrastructure and you should build it early.

Embeddings: text and multimodal

The system uses Google’s embedding models: embedding-001 for text, embedding-002 for multimodal content. Text embeddings handle the bulk of the work through semantic search across notes, facts, and conversation history. Multimodal lets me store and retrieve from images, diagrams, and documents that are not purely text.

I went with Google over OpenAI here because the cost difference compounds fast. The system processes a lot of content, since every session, every fact, and every note gets embedded, and Google's pricing is meaningfully cheaper per million tokens for comparable retrieval quality. When you are embedding everything, that gap adds up quickly.

Having both models in the same pipeline means I can search across everything in a single query. Ask about a project, and you get relevant notes, conversation fragments, and the diagram from a whiteboard session three weeks ago. That cross-modal retrieval is what makes it behave like actual recall rather than keyword matching.

The query UI

Once the core was stable, I built a query interface on top of it. This uses Haiku, Anthropic’s smallest and fastest model, as a lightweight LLM layer between the user and the database.

It does not return raw results. It summarises what it finds, answers the question directly, and shows its working. Each piece of evidence can be saved in a “pack,” a curated collection of facts, quotes, and references on a specific topic.

The query UI also supports cross-thread search, pulling information across different conversation histories, sessions, and workspaces, which means I can find what I discussed in Claude Code last Tuesday while I am working in Claude Desktop today. The external memory layer pays for itself here.

Packs turned out to be the feature I reach for most. When I am preparing for a meeting or writing a brief, I query the system, save relevant evidence into a pack, and ask it to summarise the pack into a structured brief, which turns what used to be an hour of work into a few minutes. That alone would have justified the build.

What I’d change

I underestimated the operational cost of schema decisions early on, because the way you structure facts, relationships, and metadata in the first week shapes every query you write for the next six months. I refactored the schema twice, and both times it was painful enough that if I were starting over, I would spend a full week on schema design before writing a single line of application code.

I also initially overbuilt the permissions system. Three tiers (read, write, admin) had the right granularity, but my first version had seven permission levels that nobody needed. Simplify early; you can always add complexity when a real use case demands it.

Next: connecting to a live workspace

With the core built, the control panel working, and the query UI live, the next post covers what happens when you connect all of this to a real workspace (Drive, calendar, email) and let it operate autonomously. That is where the system stops being a tool and starts being infrastructure.