stack Archives | ElembeMedia

The Self-Hosted AI Stack I’d Build If I Were Starting Over

I took a winding road to get my current AI homelab working. I’d make different choices if I were starting from scratch, and most of them would come down to doing less sooner rather than more.

The first thing I’d do is separate the model serving layer from everything else. Ollama as a standalone container, exposed on a private network, nothing else bundled in. A lot of guides will tell you to start with a full OpenWebUI stack, and OpenWebUI is fine, but it creates a coupling that makes things harder to reason about later. If your UI and your model server are the same deployment, you end up with friction when you want to swap one out or add a second frontend. Keep them separate from the start.

For hardware, I’d be more honest with myself about the model size tradeoff. My current build is a Ryzen 9 5900X on an ASUS TUF Gaming X570-Plus (Wi-Fi) board, and it handles everything I throw at it for inference routing and container management. A 7B parameter model runs well on consumer GPU memory, responds quickly, and handles most practical tasks. I spent too long trying to run 34B models on hardware that wasn’t really right for them, getting slow responses, and convincing myself the capability justified the latency. It usually didn’t. For day-to-day assistant work, a well-quantized 7B or 8B model is more useful than a sluggish 34B. Save the bigger models for tasks where reasoning quality actually matters.

The gateway layer is where I’d invest more early effort. This is the piece that connects LLM inference to real tools: file system access, APIs, shell commands, memory. I’m running OpenClaw for this. If I were starting fresh, I’d still choose a purpose-built gateway over trying to wire this together myself with n8n or LangChain. The operational overhead of maintaining custom orchestration code is real. A gateway that’s designed to manage agent lifecycles, credential handling, and tool permissions out of the box is worth the setup time.

Memory is something I’d take seriously from day one. The difference between an AI that knows the state of your environment and one that starts fresh every session is enormous in practice. That means deciding early on where state lives, how agents read and write it, and what format it’s in. Markdown files on a shared volume have worked well for me: human-readable, easy to edit when something’s wrong, git-friendly if you want version history.

For API keys and credentials, I’d use a secrets directory with tight permissions from the start rather than environment variables scattered across docker-compose files. It’s easier to audit, easier to rotate, and easier to scope to specific containers when something needs to change. This sounds like overkill when you’re standing up one container. It pays off when you have eight.

The thing I’d skip entirely on a first build is trying to run everything locally. Ollama handles local inference well. But for tasks that genuinely need a frontier model, the cost of API calls is low and the capability gap is large enough to matter. Don’t try to replace Claude with a local model for complex reasoning. Use local models where they’re good enough and cloud APIs where they’re not. That hybrid approach is cheaper and more capable than either extreme.

Finally, I’d document my container layout before it gets complicated. Which container serves which purpose, which ports are mapped, what credentials it needs. This sounds tedious and it is. Three months later when you’re trying to figure out why something stopped working, you’ll be glad you did it.

Hardware linked in this post:

Affiliate disclosure: Some links in this post are Amazon affiliate links. If you buy through them, I get a small commission at no cost to you. It helps keep the lights on here.