How I Built a Private AI Assistant on a $250 Edge Device (Full Stack, No Cloud)

Most personal AI assistants have one thing in common: your data leaves your device.

Every message you send to ChatGPT, Claude, or Gemini travels to a server you don't control, gets processed by a model you don't own, and is stored under a terms of service you probably haven't read. For casual conversation, that's fine. For anything involving business data, client information, proprietary research, or just plain privacy - it's a problem.


I'm a chemical engineer and founder running two companies simultaneously. When I started feeding supplier pricing, SDS sheets, and customer databases into AI tools to speed up my workflow, I realized I needed a different approach.


So I built one.

Maximus-X Sentinel is a fully private, GPU-accelerated multi-agent AI assistant. It runs on hardware I own, on my local network, with zero cloud inference. The full codebase is open source:

github.com/shehanmakani/MaximusX

Here's exactly how it works and how you can build it too.



What You Actually Need (The Hardware List)

You don't need a gaming PC. The entire inference stack runs on three devices:


NVIDIA Jetson Orin Nano (~$250) — This is the brain. 67 TOPS of AI performance, 8GB unified memory, ARM64 with a full CUDA stack. It runs 24/7 at about 10W. This is where all LLM inference happens.

Raspberry Pi 5 (~$80) — The voice edge. Handles wake word detection, speech-to-text, and text-to-speech entirely offline.

Mac or Linux laptop — The dashboard and messaging gateway. No compute required here — it just routes messages and serves the web UI.


One important note: get an NVMe SSD for the Jetson. SD card I/O is the single biggest bottleneck for model loading. A $30 NVMe drive makes a noticeable difference.

Total hardware cost: under $400. No subscriptions. No per-token pricing.



The Architecture — How All the Pieces Connect


Here's the data flow in plain English:

  1. A message comes in — Telegram, Discord, WhatsApp, iMessage, whatever
  2. OpenClaw on the Mac receives it and forwards to the Jetson over the local network
  3. A FastAPI server on the Jetson hands it to a LangGraph supervisor agent
  4. The supervisor classifies the intent and routes to one of four specialized sub-agents
  5. The sub-agent queries Ollama (local LLM) and/or Qdrant (local vector database)
  6. The reply goes back through OpenClaw to the original messaging channel
  7. If you asked via voice, the Pi speaks the reply through Kokoro TTS

Every step of this happens on hardware you own. Nothing touches the internet except the messaging delivery.



The Software Stack — What's Actually Running

On the Jetson (Docker)


Service What it does
Ollama (dustynv/ollama:r36.4.0) Local LLM inference, GPU-accelerated
Qdrant v1.13.0 Vector database for RAG
FastAPI server Agent API, exposes /chat endpoint
Open WebUI Web dashboard, accessible at :3000
Context Membrane Nightly document ingestion into Qdrant

On the Mac


Service
What it does
OpenClaw         Unified messaging gateway
Three custom skills         Route messages, inject domain context, 7am         briefing


On the Pi 5

Service What it does
openWakeWord         Wake word detection ("hey Maximus")
faster-whisper         Offline speech-to-text (CTranslate2 backend)
Kokoro-82M         Offline text-to-speech (HF TTS Arena #1)


The LangGraph Supervisor — Why This Is Different From a Chatbot

Most "local AI" setups are just a model with a system prompt. You ask it something, it answers. That's a chatbot, not an assistant.

What makes Maximus different is the LangGraph supervisor agent. Instead of one model trying to do everything, there are four specialized sub-agents, each with its own tools and context:

Research agent — searches your personal document RAG (the Context Membrane) and can answer questions grounded in your own files, not just training data.

ChemBiz agent — has deep context on my specific business: ChemRich Global's product catalog, pricing, active leads, IntelliForm R&D project details. It answers domain questions with actual precision.

Home agent — talks directly to Home Assistant. "Turn off the office lights" executes, not just acknowledges.

Schedule agent — reads your Google Calendar, sets reminders that fire back through Telegram, no manual cron setup.

The supervisor is a LangGraph graph that classifies intent with a single fast LLM call, then routes. The routing decision happens in under 500ms. The actual response is where latency lives — typically 3–8 seconds for a full answer at 20–30 tokens/sec on the Orin Nano.



The Context Membrane — The Feature That Actually Changes Your Day

This is the piece nobody builds in "local AI" tutorials and the piece that makes the biggest daily difference.

Every night at 2am, a background Docker container walks through /data/notes/, /data/emails/, /data/chembiz/ — whatever documents you drop there. It chunks them, generates embeddings using nomic-embed-text running locally on Ollama, and upserts them into Qdrant.

The next morning, when you ask Maximus anything, it's not guessing. It's searching your documents first.

Drop a supplier quote PDF into the folder tonight. Ask about pricing tomorrow morning. The answer comes back grounded in that actual document.

This pattern — a RAG layer over your own private document store, auto-ingested nightly — is the thing that separates a personal AI from a general AI. The model is not the product. Your context is.



The Voice Stack — Fully Offline, Actually Good

Three models chained on the Pi 5:

openWakeWord listens for "hey Maximus" passively. It uses about 2% CPU and runs continuously.

faster-whisper transcribes your request. The small model at int8 quantization runs ~2x real-time on the Pi 5 CPU. An 8-second recording transcribes in about 4 seconds.

Kokoro-82M synthesizes the reply. As of early 2026 it sits at #1 on the Hugging Face TTS Arena. The voice quality is natural enough that you stop noticing it.

Full round-trip — wake word to spoken reply — is about 6–10 seconds. Fast enough for home control and reminders. Natural pause for complex questions.



The OpenClaw Skill That Changed My Morning

The nightly-digest skill is 15 lines of YAML. Every weekday at 7am, Maximus texts me — without being asked — with:

  • Today's calendar events
  • Any pending reminders
  • Outstanding business tasks I flagged
  • One line of weather for my location

No app to open. No prompt to type. It just shows up in Telegram.

That's the version of AI I actually wanted. Not a chatbot I have to initiate. An assistant that runs in the background and surfaces relevant information at the right time.



Three Things I Got Wrong (And How to Avoid Them)

1. The base Docker image matters more than the model. Using the generic NVIDIA L4T PyTorch image caused GPU inconsistencies and package conflicts. The dustynv/ollama:r36.4.0 image — built specifically for Jetson — fixed all of it. Always use Jetson-native community images for Ollama, not the generic Docker Hub version.

2. RAG chunk size is not a detail. My first ingestion used 1000-character chunks. Retrieved context was broken mid-sentence, mid-table. Dropping to 500-char chunks with 50-char overlap, and splitting personal vs business documents into separate Qdrant collections, fixed retrieval quality dramatically.

3. STT latency is the actual bottleneck, not LLM inference. I spent time optimizing the LLM side when the real lag was faster-whisper loading the model on first use. Warming it on startup and keeping it resident in memory cut perceived latency by roughly half.



How to Get Started

The full codebase with all 20 files, documented and ready to deploy, is at:

github.com/shehanmakani/MaximusX

After cloning, the quickstart is:

# On Jetson
docker compose up -d
python3 setup.py        # pulls models, validates GPU, runs end-to-end test

# Fill in .env with your tokens, then:
openclaw start          # on Mac
python3 voice_edge.py   # on Pi 5


The setup.py script handles model pulls, Qdrant collection creation, and a GPU inference check automatically. It'll tell you if something isn't wired correctly before you try to use it.



Who This Is For

Privacy-conscious builders — hardware cost is one-time, inference is free, data stays on your network.

Domain-specific professionals — doctors, lawyers, engineers, researchers. The Context Membrane is specifically valuable when your data can't leave your control.

Founders and entrepreneurs — especially if you're running multiple ventures with context that doesn't fit neatly into a single chat session.

Anyone tired of paying per token — after initial hardware, the marginal cost of every query is zero.



The full code is open source. If you build on it, improve it, or break it in interesting ways — the repo is there. That's the point.

github.com/shehanmakani/MaximusX





Shehan Makani is a chemical engineer, co-founder of ChemRich Global and ChemeNova LLC, and an NJIT Tech MBA candidate. He builds at the intersection of specialty chemicals and machine intelligence.

Comments

Popular posts from this blog

Custom Manufacturing 2.0: Navigating the 2026 Agentic AI Revolution

We Built an AI Formulation Co-Pilot for the Specialty Chemicals Industry. Try It Free.

Beyond Predictive Modeling: The Rise of the Agentic Chemical OS