AI Engine
Architecture
Reference
From UI request to local model inference — every layer documented. This is the complete technical architecture of a production-grade AI engine: how the UI layer routes requests through the agent controller, how RAG retrieval decides when external knowledge is needed, how the coding agent and MCP tool layer extend capability, and how local models run entirely on your own hardware.
The AI engine architecture documented here is designed around a clear principle: maximum capability at minimum cost and maximum data control. By running models locally via Ollama, using free-tier hosting on Vercel and Supabase, and keeping vector storage on-device with ChromaDB or Qdrant, the entire stack can operate at zero infrastructure cost while handling production workloads for individual developers and small teams. The architecture is also a teaching document — every layer is replaceable, every tool has a clear purpose, and the data flow between layers is explicit.
The eight-layer structure separates concerns cleanly: L1 (UI) handles user interaction; L2 (Agent Controller) handles orchestration and workflow logic; L3 (Monitoring) ensures observability without interrupting execution; L4 (RAG) handles knowledge retrieval when the model alone is insufficient; L5 (Database) persists structured data; L6 (Coding Agent) handles specialised code generation tasks; L7 (MCP) extends the agent’s reach to external tools and APIs through a universal protocol; and L8 (AI Model Layer) handles the actual inference — all locally, with no data leaving the machine. The stack reflects the 2026 maturation of open-weight models and local inference tooling — Llama 3.3 70B on Ollama can now match many frontier API capabilities at zero per-token cost.
The architectural choices reflect deliberate trade-offs. LangGraph (L2) over simpler alternatives because typed state machines and checkpoint-based persistence are non-negotiable for reliable multi-step agent workflows — its 27,100 monthly developer searches confirm it as the production standard (Langfuse, 2026). LangSmith (L3) as a dedicated monitoring layer because production agents that cannot be traced cannot be debugged — and LangSmith’s integration with LangGraph makes trace correlation across complex workflows automatic. LlamaIndex (L4) for RAG orchestration because its abstraction over vector stores allows switching between ChromaDB (lightweight local) and Qdrant (production-grade local) without changing retrieval logic.
The Model Context Protocol (L7) reflects the most significant infrastructure shift of 2025–2026: MCP achieved 97 million monthly SDK downloads and is now supported across every major agent framework, making tool integration a configuration task rather than a custom development task. Connecting this architecture to any MCP-compatible tool — web search, file systems, databases, calendars, email — requires publishing an MCP server definition, not writing a custom integration. The coding agent layer (L6) — Claude Code CLI paired with Cursor — represents the specialised capability that general-purpose agents handle poorly: code generation, refactoring, and debugging where context depth, multi-file reasoning, and iterative testing matter more than general instruction-following.
“The 2026 shift in AI engineering is not about which frontier model to call — it is about architecture. Llama 3.3 70B on a local GPU matches many frontier API outputs at zero per-token cost. ChromaDB provides production-quality vector search with zero infrastructure. Vercel deploys globally for free. The architectural decisions — which orchestration framework, which protocol for tool connectivity, how to structure RAG retrieval — these are what determine whether a system works reliably at scale. The model is almost the least interesting choice.”
Aishwarya Naresh Reganti — The AI Agent Stack in 2026 · April 2026 / 47Billion — AI Agents in Production 2026 · April 2026| # | Layer | Function | Primary Tools | Key Capability | Free Tier? | Data Privacy |
|---|---|---|---|---|---|---|
| L1 | UI Layer | User interface — request entry, output display, session management | Next.js · Streamlit · Vercel | Streaming output, authentication, global CDN deployment | Yes — Vercel | Depends on hosting config |
| L2 | Agent Controller | Orchestration — workflow logic, state machines, multi-agent coordination | LangGraph · CrewAI | Typed state machines, DAGs, checkpoints, crash recovery | Yes — OSS | Local by default |
| L3 | Monitoring | Observability — distributed tracing, evaluation, cost tracking | LangSmith | Native LangGraph integration, LLM-as-judge eval, token cost tracking | Free dev tier | Traces to LangSmith cloud |
| L4 | RAG Workflow | Knowledge retrieval — conditional external knowledge injection | LlamaIndex · ChromaDB · Qdrant | Semantic retrieval, conditional context injection, local vector stores | Yes — all OSS | Fully local |
| L5 | Database Layer | Persistence — structured state, conversation history, analytical data | SQLite · DuckDB · Supabase | LangGraph checkpoint target, agent state persistence, analytical queries | Yes — all options | SQLite/DuckDB fully local |
| L6 | Coding Agent | Specialised code generation, refactoring, debugging, multi-file reasoning | Claude Code CLI · Cursor | Repo-aware context, autonomous file operations, iterative test loops | Paid tools | Sends code to Anthropic API |
| L7 | MCP Tool Access | Universal external tool connectivity via Model Context Protocol | Model Context Protocol | Any MCP server = instant tool access; 97M monthly downloads; universal standard | Yes — OSS protocol | Depends on tool called |
| L8 | AI Model Layer | Local LLM inference on open-weight models — zero data egress | Ollama · Gemma 4 · Llama 3.3 · Mistral | Zero per-token cost, complete privacy, offline-capable, no API dependency | Yes — all OSS | Fully local — zero egress |
Local First.
Open Always.
Eight Layers Deep.
The architecture’s defining characteristic is its commitment to local-first, open-source, free-tier deployment. Every layer in the primary configuration costs zero dollars to operate — Vercel free tier for hosting, Ollama with open-weight models for inference, ChromaDB for local vector storage, SQLite for persistence, LangSmith’s free developer tier for monitoring. This is not a compromise — it is a design choice that reflects the 2026 reality of the open-source AI stack: Llama 3.3 70B delivers near-frontier performance at zero per-token cost; ChromaDB provides production-quality semantic search locally; LangGraph orchestrates complex multi-step workflows with the same reliability as commercial alternatives.
The local inference layer (L8) is the most architecturally significant choice — because it determines the entire privacy and cost profile of the system. When all inference runs on local hardware via Ollama, no prompts, no outputs, no context, and no user data leave the machine. This makes the architecture viable for regulated industries (healthcare, legal, finance), privacy-sensitive personal tools, airgapped enterprise environments, and any application where data sovereignty is non-negotiable. The trade-off is hardware dependency: Llama 3.3 70B requires a GPU; Gemma 4 E4B runs on CPU. The architecture provides both options, allowing deployment to be scaled to available hardware.
The protocol choices reflect the 2026 consolidation of the agentic ecosystem around open standards. MCP (L7) is the universal tool protocol — once a tool exposes an MCP server, it works with any MCP-compatible agent regardless of framework. This eliminates the integration debt that characterised pre-MCP agent development. A2A (Agent-to-Agent) provides the horizontal complement — standardising how agents in the architecture communicate with external agent systems. Together, MCP and A2A form the protocol backbone that makes this architecture composable with the broader enterprise agent ecosystem rather than a proprietary island.
The monitoring layer (L3) deserves particular emphasis: LangSmith is positioned not as optional instrumentation but as a required architectural component, because a production agent without observability is not a production agent — it is a production liability. When an agent produces an incorrect output, the trace in LangSmith shows exactly which retrieval step returned the wrong document, which reasoning step made the wrong inference, which tool call returned unexpected data. Without that trace, debugging requires reproducing the failure from scratch. With it, diagnosis takes seconds. The eight-layer architecture is only as reliable as its least-observed layer. Build L3 early, not after the first incident.
The UI layer receives the request. The Agent Controller orchestrates the response. LangSmith traces every step. LlamaIndex retrieves what the model doesn’t know. The database persists what needs to persist. The coding agent handles what general models handle poorly. MCP connects anything the agent needs to reach. And Ollama runs Llama 3.3 70B — locally, privately, at zero cost per token — producing responses that return through the same eight layers in reverse. That is the architecture. Build it once. It runs everywhere you have hardware.