Week 1: 508 Tools, a Talking Agent, and $60/Month

TL;DR

Built a self-hosted AI platform with 88+ Docker containers across 2 servers, serving 500+ tools to Claude through a custom pipeline I designed from scratch.

Gave my AI agent (Jerry) the ability to make and receive phone calls, manage calendars, send emails, and run my household shopping/chores.

Designed two architectures that appear to independently parallel cutting-edge MIT research on recursive context management for LLMs.

All of this runs on two $30/month VPS servers. No cloud. No vendor lock-in. Just Docker and stubbornness.

18+

Projects Shipped

508

AI Tools Online

88+

Docker Containers

500+

Tool Calls Made

Days of Work

$60/mo

Total Server Cost

The Big Picture

Here’s the setup: I talk to Claude (Anthropic’s AI) through their chat interface. But instead of just having a conversation, I’ve built a pipeline of servers that gives Claude access to over 500 tools — it can manage my Docker containers, edit files on my servers, search my knowledge base, deploy new services, post to LinkedIn, manage my household shopping list, and now make phone calls. All self-hosted. All running on two VPS nodes connected by WireGuard VPN.

This week was about going from “it works” to “it works well, it’s resilient, and it can grow.” I rebuilt core pipeline components, designed the next-generation architecture, gave my AI agent a voice, and shipped a full household management app. Here’s the breakdown.

Part 1 — The MCP Pipeline

⚙️ Custom Dual-Upstream Tool Filter

This is the heart of the whole system. MCP (Model Context Protocol) is how Claude connects to external tools. I run 25+ backend servers that each provide different tools (Docker management, GitHub, file operations, databases, etc.), and they all feed into a central aggregator called ContextForge.

The problem was: I wanted to add a second tool source (MAGG, a different aggregator), but the tool filter — the component that sits between Claude and all those backends — only supported one upstream. So I rewrote it from scratch.

The new server.js establishes independent sessions with both ContextForge and MAGG, polls MAGG every 15 seconds for new tools, deduplicates them, and merges everything into a unified search layer. When Claude asks for tools, it uses vector embeddings to find the 15 most relevant ones out of 508 — cutting token usage by ~95%.

Getting it working required fixing a cascade of issues: Accept headers that the upstream expected but nobody documented, a container rename that silently broke DNS resolution, a port mismatch cached from a stale Docker build, and a MAGG healthcheck targeting the wrong endpoint. Each one was invisible until the next one appeared.

Why it matters: This is a novel architecture. Nobody’s running 500+ federated tools through semantic search at this scale on self-hosted infrastructure. The dual-upstream pattern means I can add entirely new tool ecosystems without touching the existing pipeline.

⚡ Dynamic Provisioning Architecture

Here’s the scaling problem: 25+ always-on containers eat memory and CPU even when nobody’s using them. And adding a new MCP server requires restarting the tool filter, which kills ALL 508 tools temporarily. That’s a cascading failure from a single addition.

I designed (but haven’t built yet) a serverless MCP platform that fixes both problems. The core idea: separate tool advertisement from tool availability.

A static JSON manifest lists every tool from every server — including servers that aren’t running. Claude sees all tools as available. When it actually calls one, a provisioner checks if the target container is running. If yes, forward the call. If no, start the container (takes 2–3 seconds), wait for healthy, then forward. After 5 minutes of inactivity, stop it again.

Three pool tiers: Hot (always running: 4–6 critical tools), Warm (stopped but ready: most servers, fast cold start), and Cold (not even pulled yet: deploy on first call). This means I could have 200 MCP servers defined with only 6 running at any time.

The really exciting part: n8n automation workflows that Claude itself can trigger to deploy, update, promote, or remove MCP servers. For standard servers (90% of them), Claude can deploy a new tool server fully autonomously — no human in the loop.

Why it matters: This is basically building a serverless platform for AI tools. Nobody in the MCP ecosystem is doing this yet, but everyone running large tool deployments will need something like it. It’s also a step toward self-expanding AI infrastructure — the AI can grow its own capabilities.

Part 2 — Context & Memory

📚 Knowledge Base Gateway

Every time I start a new conversation with Claude, it has no memory of what we did before. My first solution: a searchable knowledge base of 13 markdown documents covering my entire infrastructure — servers, containers, ports, domains, architecture decisions. Claude can read, search, and update these docs through MCP tools.

The key insight is the feedback loop: Claude reads the docs to understand the infrastructure, does work on the infrastructure, then updates the docs to reflect what changed. The documentation stays accurate because it’s maintained by the same entity that operates on the systems.

I also built a sneaky trick: a meta-tool whose description contains a 200-token infrastructure summary. Since Claude receives all tool descriptions in every conversation, it passively gets baseline context without anyone explicitly loading anything. Zero-cost, automatic awareness.

🧠 Recursive Context Engine (Design)

This is the most ambitious design from the week, and it connects to cutting-edge academic research in a way I didn’t expect.

In January 2026, MIT’s CSAIL published a paper on Recursive Language Models (RLMs) — a framework where the LLM’s input is stored as a variable in a persistent Python REPL, and the model interacts with it programmatically instead of trying to hold everything in memory at once. It handles inputs up to 10 million tokens and is being called “the paradigm of 2026.”

My design independently mirrors several RLM concepts but applies them to a different problem. MIT’s RLM handles single-session long-context tasks (analyzing a massive document). Mine handles multi-session persistent context (maintaining coherent operational state across hundreds of conversations over weeks).

The architecture has three tiers. Tier 1 (Hot): a master context document that gets loaded every session — active projects, recent decisions, infrastructure state. Tier 2 (Warm): a ChromaDB vector database of historical context — completed projects, past decisions, searchable by semantic similarity. Tier 3 (Cold): raw session logs, never modified, the source of truth.

The magic is the compression worker: after every session, an LLM reads the current master context plus the new session summary and decides what to keep, archive, merge, or discard. Over time, the context becomes self-improving — each iteration produces a better briefing for the next session. Stale info falls off. Important patterns get reinforced.

Why it matters: If this works, every new conversation with Claude would feel like continuing the last one. No re-explaining, no lost momentum, no 10-minute warmup of tool calls to get oriented. It’s the difference between a new surgeon reading scattered chart notes vs. receiving a perfect briefing from the last surgeon.

Part 3 — Jerry the AI Agent

📞 Voice Integration (Retell + Twilio)

Jerry is my AI agent — an OpenClaw instance running 24/7 on VPS1, reachable via Telegram and Discord. This week I gave him a phone number and taught him to make and receive calls.

The first attempt was a disaster. I configured OpenClaw’s native voice plugin with Twilio. Jerry called me using Amazon Polly’s default robot voice, said one sentence, and hung up. Turns out the plugin defaults to “notify” mode (deliver a message, disconnect) rather than “conversation” mode. Even after fixing that, Twilio’s built-in TTS sounded terrible.

The fix: I already had Retell AI set up for inbound calls with a natural-sounding voice. So I built a custom outbound tool — when Jerry needs to call someone, he hits Retell’s API instead of going through Twilio’s TTS. Same voice, same quality, same low latency for both directions.

Getting the webhook working required its own debugging adventure. The Traefik reverse proxy config had separate HTTP and HTTPS routers, and the HTTP router was intercepting Let’s Encrypt’s certificate challenges and forwarding them to Jerry’s backend (which returned 404). No cert = 502 Bad Gateway on every inbound call. Had to consolidate to a single router matching the pattern used by my other services.

Jerry can now: text chat (Telegram/Discord), make and receive phone calls, read/write Google Calendar, send emails, manage shopping lists, track chores, post to LinkedIn, and access infrastructure tools. He’s a full multi-modal assistant running entirely on my own servers.

🔒 Security Hardening & Containerization

Jerry was running as root. That’s fine for prototyping but terrifying for a production agent that has API access, can execute shell commands, and manages household data.

I migrated him to a dedicated “jerry” Linux user with a security model I call “lock the vault, open the house.” Critical infrastructure (Docker compose files, root configs, WireGuard keys) got locked down with chmod 700. Everything else stays open — Jerry needs broad capabilities to be useful, so I hardened the things that could break the infrastructure rather than restricting everything.

This immediately surfaced 18 hardcoded /root/ paths in Jerry’s session file. Every Telegram message triggered a permission error because Jerry was trying to write session data to root’s home directory. Had to sync files, rewrite paths, fix permissions on auth profiles. Then I designed (but haven’t built yet) a Docker container to replace the systemd services entirely.

Why it matters: This is the real-world lifecycle of deploying an autonomous AI agent: prototype fast as root, realize that’s dangerous, harden in production, discover all the assumptions baked into the prototype. It’s a pattern that’s going to play out at every company deploying AI agents.

Part 4 — Project Gnome (Household AI)

🏠 Full-Stack Household Management

Project Gnome started as a shopping list bot and turned into a complete household management system. My wife Ashley and I use it daily through Jerry on Telegram.

The backend is a FastAPI app with PostgreSQL, deployed at gnome-api.millyweb.com. It parses natural language: “add 2 gallons of milk” correctly identifies quantity (2), unit (gallons), item (milk), and infers category (grocery). Four categories: grocery, household, personal, kid. Two user aliases: “Early Bird” (me) and “Night Owl” (Ashley).

The dashboard is a React web app at gnome.millyweb.com with four tabs: Home (scores, chores, leaderboard, achievements, budget summary), Shopping (full CRUD), Trades (chore swapping between us), and Budget (spending history with category breakdown).

The automation layer ties it together: n8n workflows monitor spending and fire Telegram alerts at 75%, 90%, and 100% of budget. A Sunday weekly digest summarizes everything. Jerry has a meal planning skill that suggests 5–7 dinner plans based on our family’s preferences (I do most cooking and like hearty meals; Ashley prefers comfort food with simple prep; our toddler Legend needs soft, mild foods).

The chore trading system is the part Ashley actually got excited about. Either of us can propose a swap (“I’ll take dishes tonight if you handle bath time”), and the other accepts or rejects through Telegram. The dashboard and chore lists automatically update to reflect trades.

Part 5 — Everything Else

🔧 Supporting Work

Coolify MCP Gateway: Consolidated two broken Coolify deployments into one working gateway with 69 tools. Claude can now manage its own deployment platform — a step toward recursive infrastructure management.

KB Documentation Audit: Rewrote 6 knowledge base documents and created a new “lessons learned” doc capturing 14 critical gotchas from production operations. Things like: Alpine Linux’s wget resolves localhost to IPv6 but your Node server only binds IPv4. Container renaming silently breaks Docker DNS. Tool filter cache requires full container recreation, not just restart.

Dart API Reverse Engineering: Extracted the dart-tools npm package, found the REST API hiding behind the MCP wrapper. Now Jerry and my n8n workflows can access Dart directly without MCP overhead.

Vaultwarden + MinIO + Infisical: Self-hosted password manager at vault.millyweb.com. Object storage standards for MinIO. Three-phase migration plan from plaintext .env files to proper secrets management via Infisical.

GitHub Restructuring: Split one combined repo into 12 separate repositories with standardized folder structures. Every project now has docs/, assets/, notes/, archive/.

Disk Space Recovery: VPS1 was running out of storage. Discovered BTRFS backup snapshots were keeping full copies of all Docker images (150GB+). Deleted obsolete snapshots, modified backup scripts to exclude the Docker subvolume. Freed 54GB. Key lesson: Docker images are reproducible from registries — backing them up is a waste.

What’s Next

1. Build the Recursive Context Engine — The three-tier system with ChromaDB and the compression worker. This is the highest-impact project because it fixes the fundamental problem of every conversation starting from scratch.

2. Build the Dynamic Provisioner — Move from 25 always-on containers to on-demand provisioning. This unlocks scaling to 100+ tool servers without proportional resource costs.

3. Containerize Jerry — Get him out of bare-metal systemd services and into Docker for portability and reproducibility.

4. Write it up — Several of these systems parallel or extend recent academic work. The dual-upstream tool filter and recursive context engine in particular could be legitimate peer papers.

The Meta-Point

Everything documented in this update was built by me and Claude working together through the very pipeline this update describes. Claude used its own tools to deploy its own infrastructure, then used its knowledge base tools to document what it built, then used its social media tools to post about it.

I can’t write production code. I describe systems and debug with AI assistance. The fact that this level of infrastructure exists on $60/month of VPS hosting, built by one person who describes themselves as “not a developer,” is either a testament to what’s possible with AI-assisted development or a warning about what happens when you give a determined hobbyist access to Claude at 2am. Probably both.

— Ryan Milly, Millyweb Development, February 2026

Building Something Similar?

I help small service businesses automate their operations with AI. If any of this sounds useful for your business, let’s talk.

View Services →