The Self-Hosted AI Stack: Privacy, Power, and Local Models

If you’re doing serious work in 2026, you’re using AI tools. The question isn’t whether — it’s what you’re handing over when you do.

Cloud AI is capable and convenient. It also logs your requests, uses interactions to improve future models, and builds a picture of what you’re working on. For most tasks that tradeoff is fine. For security research, unreleased code, and infrastructure configs with real hostnames and IP ranges in them, it’s not.

The move isn’t to refuse all cloud AI. It’s to be deliberate about what leaves your network and what doesn’t.

The two-tier model

The Mosburn Lab runs AI at two levels:

Local inference via Ollama, running open-weight models on-device. No network required, no API keys, no logging anywhere but your own machine. Quality is lower than frontier models for complex reasoning — but for code completion, summarization, and exploratory work, it’s often good enough. And it’s always private.

Cloud access via CLI tools for frontier models: Claude, Gemini, Codex. Used when the task actually needs frontier-quality reasoning, with the understanding that requests are processed by the provider. The tradeoff is explicit and accepted, not invisible and assumed.

The mosburn.ai Ansible role manages both tiers. Installs CLI tools via npm across Fedora, Debian, Ubuntu, Arch, and Gentoo. Each tool is independently toggleable. Ollama handled separately — binary install, systemd service registration.

Ollama and local models

Ollama is the most approachable entry point to local LLM inference. It handles model download, quantization selection, and serving through a REST API that’s OpenAI-compatible — tools built against the OpenAI API can point at a local Ollama instance without modification. That compatibility matters more than it sounds.

Models I keep running locally:

Llama 3.1 8B — fast, reasonable quality for most tasks, fits in 8GB VRAM
Qwen2.5-Coder 7B — noticeably better than general models for code completion and explanation
Mistral 7B — solid for summarization and classification

The honest tradeoff: these aren’t GPT-4o or Claude Opus for complex multi-step reasoning. Reviewing a pull request or explaining unfamiliar code — genuinely useful. Designing a distributed system architecture from scratch — reach for a frontier model. The capability gap is real for certain tasks.

Claude CLI and the case for frontier access

Claude is my primary cloud tool. CLI integration means it’s available from any terminal — claude "explain this error" or claude "review this Ansible role" without switching to a browser.

What makes Claude specifically useful for infrastructure work is the combination of context length and instruction-following. Feed it an entire Ansible role and ask for a review of idempotence and error handling, and you get useful output. The same task on a local 7B model produces inconsistent results.

Requests go to Anthropic’s infrastructure. I don’t send security research artifacts, unreleased project code, or anything with internal hostnames through cloud tools. That’s not paranoia — that’s just being deliberate about it.

The mosburn.ai role

# defaults/main.yml
mosburn_ai_install_claude: true
mosburn_ai_install_gemini: true
mosburn_ai_install_codex: false
mosburn_ai_install_ollama: true

Flip the booleans for what you want. The role installs Node.js if it’s not present and uses npm for the CLI tools. Ollama gets binary installation plus systemd. Runs the same across all supported distributions, which matters because my workstations are Fedora and my test VMs are Ubuntu.

The data posture in practice

Rules I actually follow:

Security research artifacts — local only. Malware analysis, exploit research, threat intelligence stays on-device.
Personal project code — local for routine tasks, frontier models for architecture review, with awareness that the latter is logged.
Public or open-source work — cloud tools freely. No privacy concern with code going public anyway.
Infrastructure code — careful. Ansible roles with internal hostnames, IP ranges, and role-specific variable names reveal your environment topology.

The Ansible approach makes this easier to enforce. AI tooling consistently available across all hosts means I can make deliberate choices about which tool for which task without worrying about what’s installed where.

What’s next

The integration I want to build is RAG against the lab’s own documentation. Docmost generates docs. Forgejo stores code. A local vector store and embedding model would let me query both without sending anything off-device.

The technology exists — Ollama for embedding, ChromaDB or Qdrant for vector storage. The missing piece is the ingestion pipeline, which is its own interesting infrastructure problem.