AI Engine Architecture — System Reference 2026
AI Engine Architecture
System Reference · 8-Layer Stack · 2026
Full-Stack AI Engine · 8 Architectural Layers

AI Engine
Architecture
Reference

From UI request to local model inference — every layer documented. This is the complete technical architecture of a production-grade AI engine: how the UI layer routes requests through the agent controller, how RAG retrieval decides when external knowledge is needed, how the coding agent and MCP tool layer extend capability, and how local models run entirely on your own hardware.

27K
LangGraph monthly searches — #1 production agent orchestration framework · Langfuse 2026
97M
MCP SDK monthly downloads — the universal agent-to-tool protocol · Anthropic Apr 2026
70B
Llama 3.3 70B parameters — open-weight model runnable locally on consumer hardware via Ollama
$0
Infrastructure cost on free-tier stack: Vercel + Supabase + ChromaDB local + Ollama
// Layer Index — Top to Bottom
L1
UI Layer
Next.js · Streamlit · Vercel
L2
Agent Controller
LangGraph · CrewAI
L3
Monitoring Layer
LangSmith
L4
RAG Workflow
LlamaIndex · ChromaDB · Qdrant
L5
Database Layer
SQLite · DuckDB · Supabase
L6
Coding Agent
Claude Code CLI · Cursor
L7
MCP Tool Access
Model Context Protocol
L8
AI Model Layer
Ollama · Gemma 4 · Llama 3.3 · Mistral
Architecture Philosophy — Zero to Production, Zero to Cost

The AI engine architecture documented here is designed around a clear principle: maximum capability at minimum cost and maximum data control. By running models locally via Ollama, using free-tier hosting on Vercel and Supabase, and keeping vector storage on-device with ChromaDB or Qdrant, the entire stack can operate at zero infrastructure cost while handling production workloads for individual developers and small teams. The architecture is also a teaching document — every layer is replaceable, every tool has a clear purpose, and the data flow between layers is explicit.

The eight-layer structure separates concerns cleanly: L1 (UI) handles user interaction; L2 (Agent Controller) handles orchestration and workflow logic; L3 (Monitoring) ensures observability without interrupting execution; L4 (RAG) handles knowledge retrieval when the model alone is insufficient; L5 (Database) persists structured data; L6 (Coding Agent) handles specialised code generation tasks; L7 (MCP) extends the agent’s reach to external tools and APIs through a universal protocol; and L8 (AI Model Layer) handles the actual inference — all locally, with no data leaving the machine. The stack reflects the 2026 maturation of open-weight models and local inference tooling — Llama 3.3 70B on Ollama can now match many frontier API capabilities at zero per-token cost.

The architectural choices reflect deliberate trade-offs. LangGraph (L2) over simpler alternatives because typed state machines and checkpoint-based persistence are non-negotiable for reliable multi-step agent workflows — its 27,100 monthly developer searches confirm it as the production standard (Langfuse, 2026). LangSmith (L3) as a dedicated monitoring layer because production agents that cannot be traced cannot be debugged — and LangSmith’s integration with LangGraph makes trace correlation across complex workflows automatic. LlamaIndex (L4) for RAG orchestration because its abstraction over vector stores allows switching between ChromaDB (lightweight local) and Qdrant (production-grade local) without changing retrieval logic.

The Model Context Protocol (L7) reflects the most significant infrastructure shift of 2025–2026: MCP achieved 97 million monthly SDK downloads and is now supported across every major agent framework, making tool integration a configuration task rather than a custom development task. Connecting this architecture to any MCP-compatible tool — web search, file systems, databases, calendars, email — requires publishing an MCP server definition, not writing a custom integration. The coding agent layer (L6) — Claude Code CLI paired with Cursor — represents the specialised capability that general-purpose agents handle poorly: code generation, refactoring, and debugging where context depth, multi-file reasoning, and iterative testing matter more than general instruction-following.

Eight Layers — Complete Technical Reference
L1
UI
// User Interface · Request Entry · Surface Layer
UI Layer
Sends requests to the Agent Controller — routes inputs, displays outputs, handles user sessions
The UI layer is the user’s entry point into the AI engine — it accepts queries, configuration, and uploaded files, formats them as structured requests, and routes them to the Agent Controller. Three deployment options serve different needs: Next.js provides a production-grade React framework with server-side rendering and API routes, ideal for polished user-facing applications with authentication and session management. Streamlit enables Python-native rapid prototyping — a data scientist can build a functional AI interface in hours without frontend expertise. Vercel (free tier) deploys Next.js applications globally with automatic SSL, edge CDN, and serverless function support — the standard deployment target for personal and small-team AI tools at zero hosting cost. The UI layer sends structured requests downstream and receives formatted responses back from the Agent Controller, handling streaming output, error states, and session persistence on behalf of the user.
Frameworks
Next.js
Streamlit
Hosting
Vercel (free tier)
Routes requests → Agent Controller. Displays streaming output. Zero hosting cost.
Sends request to
L2
CTL
// Core Orchestration · Workflow Logic · State Management
Agent Controller
The core system logic — manages the complete workflow from request intake to final output delivery
The Agent Controller is the intelligence coordinator of the entire architecture — the layer that receives requests from the UI, decides which capabilities to invoke, sequences tool calls and sub-agent delegations, manages state across multi-step workflows, and returns responses. LangGraph implements the primary orchestration logic via typed state machines and directed acyclic graphs: each step in the agent’s workflow is a node; control flow between nodes is defined by conditional edges that can branch based on model outputs. LangGraph’s checkpoint system writes state to persistent storage after each node, enabling crash recovery and human-in-the-loop pause/resume. CrewAI provides the multi-agent coordination layer when tasks benefit from specialised sub-agents — a research crew, a writing crew, a coding crew — each optimised for a specific task type and coordinated by a manager agent. Together, LangGraph handles deterministic workflow structure while CrewAI handles collaborative agent dynamics.
Orchestration
LangGraph
CrewAI
LangGraph: typed state machines, DAGs, checkpoints. CrewAI: role-based multi-agent teams.
Instrumented by (parallel)
L3
MON
// Observability · Tracing · Evaluation · Debug
Monitoring Layer
Captures every LLM call, agent step, tool invocation, and token count for debugging and quality tracking
The monitoring layer runs as a vertical rail alongside the Agent Controller — capturing every event without interrupting execution. LangSmith provides the primary observability platform: every LLM API call, tool invocation, sub-agent delegation, and state transition is captured as a trace span, assembled into a hierarchical trace tree that shows exactly what happened, in what order, with what latency and token cost, at every step of every run. LangSmith integrates natively with LangGraph, requiring minimal instrumentation code — the framework’s tracing hooks connect automatically. The monitoring layer enables three critical production capabilities: debugging (why did the agent produce this output?), evaluation (automated scoring of outputs against quality rubrics), and cost tracking (token usage per run, per workflow, per user). Without L3, the Agent Controller is a black box — powerful but undebuggable.
Platform
LangSmith
Trace every call. Debug every failure. Track cost per run. Vertical rail — runs alongside all layers.
Invokes when context needed
L4
RAG
// Retrieval-Augmented Generation · Knowledge Grounding
RAG Workflow
Retrieves verified external knowledge when the model’s training data alone is insufficient for the task
The RAG layer implements the most consequential pattern in production AI: grounding model responses in verified, current, domain-specific knowledge rather than relying on potentially stale or hallucinated training data. LlamaIndex orchestrates the full retrieval pipeline — document ingestion, chunking, embedding, storage, query-time retrieval, and context assembly. ChromaDB provides lightweight local vector storage with no external dependencies, ideal for development and small-scale production. Qdrant (local) provides production-grade local vector search with filtering, quantisation, and payload indexing for larger corpora. The RAG decision is conditional: if the query requires outside knowledge (domain documents, recent data, proprietary information), the context is populated and passed to the model. If the model’s training data is sufficient (general reasoning, code generation from specification, simple Q&A), retrieval is skipped to reduce latency and cost.
// Need Outside Knowledge?
YES → Retrieve from vector store → Inject context → Model call
NO → Direct model call (no retrieval)
Retrieval
LlamaIndex
Vector Stores
ChromaDB (local)
Qdrant (local)
Conditional: RAG fires only when outside knowledge is required.
Reads / writes structured data
L5
DB
// Persistence · Structured Data · State Storage
Database Layer
Persists structured agent state, conversation history, user preferences, and analytical data
The database layer stores everything that needs to persist beyond a single agent session: conversation history, user preferences, task state, workflow checkpoints, and structured data that agents query during reasoning. Three options serve different production profiles: SQLite is the zero-dependency embedded database for single-user or development deployments — a single file, no server process, perfect for local-first applications. DuckDB is the in-process analytical database optimised for columnar aggregations — ideal when agents need to reason over large structured datasets (logs, time-series, business analytics) without a separate server. Supabase (free tier) provides a fully-managed PostgreSQL database with a REST API, realtime subscriptions, authentication, and row-level security — the production path when data needs to be shared across users or accessible from multiple clients. LangGraph’s checkpoint system can write agent state to any of these backends, enabling the pause/resume and crash-recovery capabilities that production multi-step workflows require.
Local Databases
SQLite
DuckDB
Cloud (Free Tier)
Supabase
SQLite: local. DuckDB: analytics. Supabase: shared/multi-user. All free-tier viable.
Invokes specialised coding tasks
L6
CODE
// Code Generation · Refactoring · Specialised Reasoning
Coding Agent
Specialised AI agents for code generation, refactoring, debugging, and multi-file software reasoning
The coding agent layer handles the class of tasks where general-purpose agents underperform: software engineering workflows that require deep code context, multi-file reasoning, iterative testing loops, and awareness of project structure. Claude Code CLI is Anthropic’s terminal-based agentic coding tool — it reads the local file system, executes shell commands, writes and edits files, runs tests, and iterates until the coding task is complete. It operates with awareness of the entire repository context, not just a single file or prompt. Cursor is the AI-first code editor that integrates frontier model capabilities directly into the development environment — with multi-file context, codebase indexing, and inline AI editing that understands the full project structure. The two tools serve different workflows: Claude Code CLI for autonomous task completion in CI/CD or scripted pipelines; Cursor for interactive development where the engineer maintains control and uses AI as a collaborative partner. Both can be invoked by the Agent Controller for coding-specific sub-tasks within a larger workflow.
CLI Tool
Claude Code CLI
IDE Agent
Cursor
CLI: autonomous scripted pipelines. Cursor: interactive AI-assisted development.
Calls external tools via
L7
MCP
// Universal Tool Protocol · External Connectivity
MCP Tool Access
Uses Model Context Protocol — the universal standard connecting agents to any external tool or data source
The MCP layer provides the agent’s reach beyond the local system — connecting it to external tools, APIs, services, and data sources through the Model Context Protocol standard. MCP defines a universal interface between AI agents and tools: any tool that implements an MCP server can be discovered and invoked by any MCP-compatible agent, without custom integration code. This replaced the pre-MCP world where every agent framework needed its own tool wrapper for every tool — a LangChain web search tool couldn’t be reused in CrewAI without rewriting it. As the 2026 agent stack analysis documents: “Publishing an MCP server is starting to take the place of writing a custom integration for every tool. The work that used to take a sprint now takes a config file.” With 97 million monthly SDK downloads and support from every major AI framework, LLM provider, and cloud platform, MCP is the protocol backbone of the modern AI tool ecosystem. The architecture’s L7 acts as the gateway: when the Agent Controller’s reasoning determines a tool call is needed, L7 routes it to the appropriate MCP server and returns the result as structured context for the next reasoning step.
Protocol
Model Context Protocol
97M monthly downloads. Universal agent-to-tool standard. Any tool with an MCP server is instantly available.
Inference on local hardware
L8
LLM
// Local Inference · Open Weights · Zero Data Egress
AI Model Layer — Local Setup
Fully local inference on open-weight models — zero per-token cost, complete data privacy, no cloud dependency
The AI Model Layer is the foundation — where token prediction actually happens. The local setup choice is significant: Ollama is the local model runtime that makes running frontier-class open-weight models on consumer hardware accessible. A single command pulls and runs any supported model; Ollama handles quantisation, memory management, and a local API endpoint that is API-compatible with OpenAI’s SDK. Gemma 4 E4B (Google DeepMind) is a 4-billion parameter model designed for efficiency — runs on CPU with minimal RAM, suitable for low-resource environments. Llama 3.3 70B (Meta) is the full-scale open-weight model — 70 billion parameters requiring a GPU, but delivering benchmark performance comparable to many frontier API models at zero per-token cost. Mistral Small 4 (Mistral AI) is an instruction-optimised model balancing quality and speed, strong on European languages and code tasks. The local setup means all data stays on the machine — no prompts, no outputs, no context leaves the device. For privacy-sensitive applications, regulated industries, or airgapped environments, this is the architectural requirement that makes local inference non-negotiable.
Runtime
Ollama
Models
Gemma 4 E4B
Llama 3.3 70B
Mistral Small 4
Zero data egress. Zero per-token cost. Complete privacy. Runs on consumer hardware.
Complete Request Flow — UI to Inference and Back
// Request Lifecycle — Following a query through all 8 layers
User Query
Next.js / Streamlit
LangGraph Agent
LangSmith Trace ↕
L1 → L2 → L3 (monitoring fires alongside all steps below)
Agent decides:
Need outside knowledge?
YES: LlamaIndex retrieves
ChromaDB / Qdrant search
L2 decision point → L4 RAG workflow (conditional)
Coding task?
Claude Code CLI
MCP Tool Call
SQLite / DuckDB / Supabase
L6 (coding) or L7 (tool) → L5 (database reads/writes) — as needed
Context assembled
Ollama local inference
Llama 3.3 / Gemma / Mistral
Response tokens stream
L8 — all inference happens locally, data never leaves the machine
Agent processes output
Streamed to UI
Displayed to user
Trace complete in LangSmith
L2 → L1 response delivery. L3 trace closes with full run record.

“The 2026 shift in AI engineering is not about which frontier model to call — it is about architecture. Llama 3.3 70B on a local GPU matches many frontier API outputs at zero per-token cost. ChromaDB provides production-quality vector search with zero infrastructure. Vercel deploys globally for free. The architectural decisions — which orchestration framework, which protocol for tool connectivity, how to structure RAG retrieval — these are what determine whether a system works reliably at scale. The model is almost the least interesting choice.”

Aishwarya Naresh Reganti — The AI Agent Stack in 2026 · April 2026 / 47Billion — AI Agents in Production 2026 · April 2026
LangGraph monthly developer searches (#1 orchestration)
27,100
MCP monthly SDK downloads
97M
Llama 3.3 70B — local inference on consumer GPU
$0/token
Full stack infrastructure cost (free-tier config)
$0
Data leaving device in local-only setup
None
All 8 Layers — Quick Engineering Reference
#LayerFunctionPrimary ToolsKey CapabilityFree Tier?Data Privacy
L1UI LayerUser interface — request entry, output display, session managementNext.js · Streamlit · VercelStreaming output, authentication, global CDN deploymentYes — VercelDepends on hosting config
L2Agent ControllerOrchestration — workflow logic, state machines, multi-agent coordinationLangGraph · CrewAITyped state machines, DAGs, checkpoints, crash recoveryYes — OSSLocal by default
L3MonitoringObservability — distributed tracing, evaluation, cost trackingLangSmithNative LangGraph integration, LLM-as-judge eval, token cost trackingFree dev tierTraces to LangSmith cloud
L4RAG WorkflowKnowledge retrieval — conditional external knowledge injectionLlamaIndex · ChromaDB · QdrantSemantic retrieval, conditional context injection, local vector storesYes — all OSSFully local
L5Database LayerPersistence — structured state, conversation history, analytical dataSQLite · DuckDB · SupabaseLangGraph checkpoint target, agent state persistence, analytical queriesYes — all optionsSQLite/DuckDB fully local
L6Coding AgentSpecialised code generation, refactoring, debugging, multi-file reasoningClaude Code CLI · CursorRepo-aware context, autonomous file operations, iterative test loopsPaid toolsSends code to Anthropic API
L7MCP Tool AccessUniversal external tool connectivity via Model Context ProtocolModel Context ProtocolAny MCP server = instant tool access; 97M monthly downloads; universal standardYes — OSS protocolDepends on tool called
L8AI Model LayerLocal LLM inference on open-weight models — zero data egressOllama · Gemma 4 · Llama 3.3 · MistralZero per-token cost, complete privacy, offline-capable, no API dependencyYes — all OSSFully local — zero egress
Architectural Principle

Local First.
Open Always.
Eight Layers Deep.

The architecture’s defining characteristic is its commitment to local-first, open-source, free-tier deployment. Every layer in the primary configuration costs zero dollars to operate — Vercel free tier for hosting, Ollama with open-weight models for inference, ChromaDB for local vector storage, SQLite for persistence, LangSmith’s free developer tier for monitoring. This is not a compromise — it is a design choice that reflects the 2026 reality of the open-source AI stack: Llama 3.3 70B delivers near-frontier performance at zero per-token cost; ChromaDB provides production-quality semantic search locally; LangGraph orchestrates complex multi-step workflows with the same reliability as commercial alternatives.

The local inference layer (L8) is the most architecturally significant choice — because it determines the entire privacy and cost profile of the system. When all inference runs on local hardware via Ollama, no prompts, no outputs, no context, and no user data leave the machine. This makes the architecture viable for regulated industries (healthcare, legal, finance), privacy-sensitive personal tools, airgapped enterprise environments, and any application where data sovereignty is non-negotiable. The trade-off is hardware dependency: Llama 3.3 70B requires a GPU; Gemma 4 E4B runs on CPU. The architecture provides both options, allowing deployment to be scaled to available hardware.

The protocol choices reflect the 2026 consolidation of the agentic ecosystem around open standards. MCP (L7) is the universal tool protocol — once a tool exposes an MCP server, it works with any MCP-compatible agent regardless of framework. This eliminates the integration debt that characterised pre-MCP agent development. A2A (Agent-to-Agent) provides the horizontal complement — standardising how agents in the architecture communicate with external agent systems. Together, MCP and A2A form the protocol backbone that makes this architecture composable with the broader enterprise agent ecosystem rather than a proprietary island.

The monitoring layer (L3) deserves particular emphasis: LangSmith is positioned not as optional instrumentation but as a required architectural component, because a production agent without observability is not a production agent — it is a production liability. When an agent produces an incorrect output, the trace in LangSmith shows exactly which retrieval step returned the wrong document, which reasoning step made the wrong inference, which tool call returned unexpected data. Without that trace, debugging requires reproducing the failure from scratch. With it, diagnosis takes seconds. The eight-layer architecture is only as reliable as its least-observed layer. Build L3 early, not after the first incident.

The UI layer receives the request. The Agent Controller orchestrates the response. LangSmith traces every step. LlamaIndex retrieves what the model doesn’t know. The database persists what needs to persist. The coding agent handles what general models handle poorly. MCP connects anything the agent needs to reach. And Ollama runs Llama 3.3 70B — locally, privately, at zero cost per token — producing responses that return through the same eight layers in reverse. That is the architecture. Build it once. It runs everywhere you have hardware.

Sources: Aishwarya Naresh Reganti — The AI Agent Stack in 2026 (architectural decisions over model choice; MCP as universal standard; LangGraph typed state machines; April 2026) · 47Billion — AI Agents in Production: Frameworks, Protocols, What Actually Works (MCP sprint → config file; LangGraph #1 production choice; Ollama for local inference; April 2026) · Langfuse — LLM Framework Comparison 2026 (LangGraph 27,100 monthly searches #1; CrewAI 14,800; LangSmith native LangGraph integration) · Anthropic — Model Context Protocol 2025 (97M monthly SDK downloads; 10,000+ enterprise MCP servers; donated to Linux Foundation) · Meta AI — Llama 3.3 70B Release Notes (open-weight 70B model; comparable to frontier API models on benchmarks; November 2024; Ollama-compatible) · Google DeepMind — Gemma 4 E4B (4B parameter efficiency-focused model; CPU-capable local inference; 2025) · Mistral AI — Mistral Small 4 (instruction-optimised; strong on code and European languages; local deployment via Ollama) · Ollama.com — Local Model Runtime (open-source; OpenAI-compatible API; quantised model support; consumer GPU inference for 70B class models) · LangGraph Documentation — Production Deployment Guide (typed state machines; checkpoint system; PostgreSQL/SQLite backend; crash recovery; human-in-the-loop interrupt nodes) · LlamaIndex — Retrieval-Augmented Generation Documentation (ChromaDB integration; Qdrant local deployment; conditional context injection; 2025) · Vercel — Free Tier Capabilities (global CDN; serverless functions; Next.js native deployment; 2026) · Supabase — Free Tier (PostgreSQL; REST API; realtime; row-level security; 2026)