AI Engine Architecture — System Reference 2026

Full-Stack AI Engine · 8 Architectural Layers

AI Engine
Architecture
Reference

From UI request to local model inference — every layer documented. This is the complete technical architecture of a production-grade AI engine: how the UI layer routes requests through the agent controller, how RAG retrieval decides when external knowledge is needed, how the coding agent and MCP tool layer extend capability, and how local models run entirely on your own hardware.

27K

LangGraph monthly searches — #1 production agent orchestration framework · Langfuse 2026

97M

MCP SDK monthly downloads — the universal agent-to-tool protocol · Anthropic Apr 2026

70B

Llama 3.3 70B parameters — open-weight model runnable locally on consumer hardware via Ollama

Infrastructure cost on free-tier stack: Vercel + Supabase + ChromaDB local + Ollama

// Layer Index — Top to Bottom

UI Layer

Next.js · Streamlit · Vercel

Agent Controller

LangGraph · CrewAI

Monitoring Layer

LangSmith

RAG Workflow

LlamaIndex · ChromaDB · Qdrant

Database Layer

SQLite · DuckDB · Supabase

Coding Agent

Claude Code CLI · Cursor

MCP Tool Access

Model Context Protocol

AI Model Layer

Ollama · Gemma 4 · Llama 3.3 · Mistral

Architecture Philosophy — Zero to Production, Zero to Cost

The AI engine architecture documented here is designed around a clear principle: maximum capability at minimum cost and maximum data control. By running models locally via Ollama, using free-tier hosting on Vercel and Supabase, and keeping vector storage on-device with ChromaDB or Qdrant, the entire stack can operate at zero infrastructure cost while handling production workloads for individual developers and small teams. The architecture is also a teaching document — every layer is replaceable, every tool has a clear purpose, and the data flow between layers is explicit.

The eight-layer structure separates concerns cleanly: L1 (UI) handles user interaction; L2 (Agent Controller) handles orchestration and workflow logic; L3 (Monitoring) ensures observability without interrupting execution; L4 (RAG) handles knowledge retrieval when the model alone is insufficient; L5 (Database) persists structured data; L6 (Coding Agent) handles specialised code generation tasks; L7 (MCP) extends the agent’s reach to external tools and APIs through a universal protocol; and L8 (AI Model Layer) handles the actual inference — all locally, with no data leaving the machine. The stack reflects the 2026 maturation of open-weight models and local inference tooling — Llama 3.3 70B on Ollama can now match many frontier API capabilities at zero per-token cost.

The architectural choices reflect deliberate trade-offs. LangGraph (L2) over simpler alternatives because typed state machines and checkpoint-based persistence are non-negotiable for reliable multi-step agent workflows — its 27,100 monthly developer searches confirm it as the production standard (Langfuse, 2026). LangSmith (L3) as a dedicated monitoring layer because production agents that cannot be traced cannot be debugged — and LangSmith’s integration with LangGraph makes trace correlation across complex workflows automatic. LlamaIndex (L4) for RAG orchestration because its abstraction over vector stores allows switching between ChromaDB (lightweight local) and Qdrant (production-grade local) without changing retrieval logic.

The Model Context Protocol (L7) reflects the most significant infrastructure shift of 2025–2026: MCP achieved 97 million monthly SDK downloads and is now supported across every major agent framework, making tool integration a configuration task rather than a custom development task. Connecting this architecture to any MCP-compatible tool — web search, file systems, databases, calendars, email — requires publishing an MCP server definition, not writing a custom integration. The coding agent layer (L6) — Claude Code CLI paired with Cursor — represents the specialised capability that general-purpose agents handle poorly: code generation, refactoring, and debugging where context depth, multi-file reasoning, and iterative testing matter more than general instruction-following.

Eight Layers — Complete Technical Reference

// User Interface · Request Entry · Surface Layer

UI Layer

Sends requests to the Agent Controller — routes inputs, displays outputs, handles user sessions

The UI layer is the user’s entry point into the AI engine — it accepts queries, configuration, and uploaded files, formats them as structured requests, and routes them to the Agent Controller. Three deployment options serve different needs: Next.js provides a production-grade React framework with server-side rendering and API routes, ideal for polished user-facing applications with authentication and session management. Streamlit enables Python-native rapid prototyping — a data scientist can build a functional AI interface in hours without frontend expertise. Vercel (free tier) deploys Next.js applications globally with automatic SSL, edge CDN, and serverless function support — the standard deployment target for personal and small-team AI tools at zero hosting cost. The UI layer sends structured requests downstream and receives formatted responses back from the Agent Controller, handling streaming output, error states, and session persistence on behalf of the user.

Frameworks

Next.js

Streamlit

Hosting

Vercel (free tier)

Routes requests → Agent Controller. Displays streaming output. Zero hosting cost.

Sends request to

CTL

// Core Orchestration · Workflow Logic · State Management

Agent Controller

The core system logic — manages the complete workflow from request intake to final output delivery

The Agent Controller is the intelligence coordinator of the entire architecture — the layer that receives requests from the UI, decides which capabilities to invoke, sequences tool calls and sub-agent delegations, manages state across multi-step workflows, and returns responses. LangGraph implements the primary orchestration logic via typed state machines and directed acyclic graphs: each step in the agent’s workflow is a node; control flow between nodes is defined by conditional edges that can branch based on model outputs. LangGraph’s checkpoint system writes state to persistent storage after each node, enabling crash recovery and human-in-the-loop pause/resume. CrewAI provides the multi-agent coordination layer when tasks benefit from specialised sub-agents — a research crew, a writing crew, a coding crew — each optimised for a specific task type and coordinated by a manager agent. Together, LangGraph handles deterministic workflow structure while CrewAI handles collaborative agent dynamics.

Orchestration

LangGraph

CrewAI

LangGraph: typed state machines, DAGs, checkpoints. CrewAI: role-based multi-agent teams.

Instrumented by (parallel)

MON

// Observability · Tracing · Evaluation · Debug

Monitoring Layer

Captures every LLM call, agent step, tool invocation, and token count for debugging and quality tracking

The monitoring layer runs as a vertical rail alongside the Agent Controller — capturing every event without interrupting execution. LangSmith provides the primary observability platform: every LLM API call, tool invocation, sub-agent delegation, and state transition is captured as a trace span, assembled into a hierarchical trace tree that shows exactly what happened, in what order, with what latency and token cost, at every step of every run. LangSmith integrates natively with LangGraph, requiring minimal instrumentation code — the framework’s tracing hooks connect automatically. The monitoring layer enables three critical production capabilities: debugging (why did the agent produce this output?), evaluation (automated scoring of outputs against quality rubrics), and cost tracking (token usage per run, per workflow, per user). Without L3, the Agent Controller is a black box — powerful but undebuggable.

Platform

LangSmith

Trace every call. Debug every failure. Track cost per run. Vertical rail — runs alongside all layers.

Invokes when context needed

RAG

// Retrieval-Augmented Generation · Knowledge Grounding

RAG Workflow

Retrieves verified external knowledge when the model’s training data alone is insufficient for the task

The RAG layer implements the most consequential pattern in production AI: grounding model responses in verified, current, domain-specific knowledge rather than relying on potentially stale or hallucinated training data. LlamaIndex orchestrates the full retrieval pipeline — document ingestion, chunking, embedding, storage, query-time retrieval, and context assembly. ChromaDB provides lightweight local vector storage with no external dependencies, ideal for development and small-scale production. Qdrant (local) provides production-grade local vector search with filtering, quantisation, and payload indexing for larger corpora. The RAG decision is conditional: if the query requires outside knowledge (domain documents, recent data, proprietary information), the context is populated and passed to the model. If the model’s training data is sufficient (general reasoning, code generation from specification, simple Q&A), retrieval is skipped to reduce latency and cost.

// Need Outside Knowledge?

YES → Retrieve from vector store → Inject context → Model call

NO → Direct model call (no retrieval)

Retrieval

LlamaIndex

Vector Stores

ChromaDB (local)

Qdrant (local)

Conditional: RAG fires only when outside knowledge is required.

Reads / writes structured data

// Persistence · Structured Data · State Storage

Database Layer

Persists structured agent state, conversation history, user preferences, and analytical data

The database layer stores everything that needs to persist beyond a single agent session: conversation history, user preferences, task state, workflow checkpoints, and structured data that agents query during reasoning. Three options serve different production profiles: SQLite is the zero-dependency embedded database for single-user or development deployments — a single file, no server process, perfect for local-first applications. DuckDB is the in-process analytical database optimised for columnar aggregations — ideal when agents need to reason over large structured datasets (logs, time-series, business analytics) without a separate server. Supabase (free tier) provides a fully-managed PostgreSQL database with a REST API, realtime subscriptions, authentication, and row-level security — the production path when data needs to be shared across users or accessible from multiple clients. LangGraph’s checkpoint system can write agent state to any of these backends, enabling the pause/resume and crash-recovery capabilities that production multi-step workflows require.

Local Databases

SQLite

DuckDB

Cloud (Free Tier)

Supabase

SQLite: local. DuckDB: analytics. Supabase: shared/multi-user. All free-tier viable.

Invokes specialised coding tasks

CODE

// Code Generation · Refactoring · Specialised Reasoning

Coding Agent

Specialised AI agents for code generation, refactoring, debugging, and multi-file software reasoning

The coding agent layer handles the class of tasks where general-purpose agents underperform: software engineering workflows that require deep code context, multi-file reasoning, iterative testing loops, and awareness of project structure. Claude Code CLI is Anthropic’s terminal-based agentic coding tool — it reads the local file system, executes shell commands, writes and edits files, runs tests, and iterates until the coding task is complete. It operates with awareness of the entire repository context, not just a single file or prompt. Cursor is the AI-first code editor that integrates frontier model capabilities directly into the development environment — with multi-file context, codebase indexing, and inline AI editing that understands the full project structure. The two tools serve different workflows: Claude Code CLI for autonomous task completion in CI/CD or scripted pipelines; Cursor for interactive development where the engineer maintains control and uses AI as a collaborative partner. Both can be invoked by the Agent Controller for coding-specific sub-tasks within a larger workflow.

CLI Tool

Claude Code CLI

IDE Agent

Cursor

CLI: autonomous scripted pipelines. Cursor: interactive AI-assisted development.

Calls external tools via

MCP

// Universal Tool Protocol · External Connectivity

MCP Tool Access

Uses Model Context Protocol — the universal standard connecting agents to any external tool or data source

The MCP layer provides the agent’s reach beyond the local system — connecting it to external tools, APIs, services, and data sources through the Model Context Protocol standard. MCP defines a universal interface between AI agents and tools: any tool that implements an MCP server can be discovered and invoked by any MCP-compatible agent, without custom integration code. This replaced the pre-MCP world where every agent framework needed its own tool wrapper for every tool — a LangChain web search tool couldn’t be reused in CrewAI without rewriting it. As the 2026 agent stack analysis documents: “Publishing an MCP server is starting to take the place of writing a custom integration for every tool. The work that used to take a sprint now takes a config file.” With 97 million monthly SDK downloads and support from every major AI framework, LLM provider, and cloud platform, MCP is the protocol backbone of the modern AI tool ecosystem. The architecture’s L7 acts as the gateway: when the Agent Controller’s reasoning determines a tool call is needed, L7 routes it to the appropriate MCP server and returns the result as structured context for the next reasoning step.

Protocol

Model Context Protocol

97M monthly downloads. Universal agent-to-tool standard. Any tool with an MCP server is instantly available.

Inference on local hardware

LLM

// Local Inference · Open Weights · Zero Data Egress

AI Model Layer — Local Setup

Fully local inference on open-weight models — zero per-token cost, complete data privacy, no cloud dependency

The AI Model Layer is the foundation — where token prediction actually happens. The local setup choice is significant: Ollama is the local model runtime that makes running frontier-class open-weight models on consumer hardware accessible. A single command pulls and runs any supported model; Ollama handles quantisation, memory management, and a local API endpoint that is API-compatible with OpenAI’s SDK. Gemma 4 E4B (Google DeepMind) is a 4-billion parameter model designed for efficiency — runs on CPU with minimal RAM, suitable for low-resource environments. Llama 3.3 70B (Meta) is the full-scale open-weight model — 70 billion parameters requiring a GPU, but delivering benchmark performance comparable to many frontier API models at zero per-token cost. Mistral Small 4 (Mistral AI) is an instruction-optimised model balancing quality and speed, strong on European languages and code tasks. The local setup means all data stays on the machine — no prompts, no outputs, no context leaves the device. For privacy-sensitive applications, regulated industries, or airgapped environments, this is the architectural requirement that makes local inference non-negotiable.

Runtime

Ollama

Models

Gemma 4 E4B

Llama 3.3 70B

Mistral Small 4

Zero data egress. Zero per-token cost. Complete privacy. Runs on consumer hardware.

Complete Request Flow — UI to Inference and Back

// Request Lifecycle — Following a query through all 8 layers

User Query

→

Next.js / Streamlit

→

LangGraph Agent

→

LangSmith Trace ↕

L1 → L2 → L3 (monitoring fires alongside all steps below)

Agent decides:

→

Need outside knowledge?

→

YES: LlamaIndex retrieves

→

ChromaDB / Qdrant search

L2 decision point → L4 RAG workflow (conditional)

Coding task?

→

Claude Code CLI

→

MCP Tool Call

→

SQLite / DuckDB / Supabase

L6 (coding) or L7 (tool) → L5 (database reads/writes) — as needed

Context assembled

→

Ollama local inference

→

Llama 3.3 / Gemma / Mistral

→

Response tokens stream

L8 — all inference happens locally, data never leaves the machine

Agent processes output

→

Streamed to UI

→

Displayed to user

→

Trace complete in LangSmith

L2 → L1 response delivery. L3 trace closes with full run record.

“The 2026 shift in AI engineering is not about which frontier model to call — it is about architecture. Llama 3.3 70B on a local GPU matches many frontier API outputs at zero per-token cost. ChromaDB provides production-quality vector search with zero infrastructure. Vercel deploys globally for free. The architectural decisions — which orchestration framework, which protocol for tool connectivity, how to structure RAG retrieval — these are what determine whether a system works reliably at scale. The model is almost the least interesting choice.”

Aishwarya Naresh Reganti — The AI Agent Stack in 2026 · April 2026 / 47Billion — AI Agents in Production 2026 · April 2026

LangGraph monthly developer searches (#1 orchestration)

27,100

MCP monthly SDK downloads

97M

Llama 3.3 70B — local inference on consumer GPU

$0/token

Full stack infrastructure cost (free-tier config)

Data leaving device in local-only setup

None

All 8 Layers — Quick Engineering Reference

#	Layer	Function	Primary Tools	Key Capability	Free Tier?	Data Privacy
L1	UI Layer	User interface — request entry, output display, session management	Next.js · Streamlit · Vercel	Streaming output, authentication, global CDN deployment	Yes — Vercel	Depends on hosting config
L2	Agent Controller	Orchestration — workflow logic, state machines, multi-agent coordination	LangGraph · CrewAI	Typed state machines, DAGs, checkpoints, crash recovery	Yes — OSS	Local by default
L3	Monitoring	Observability — distributed tracing, evaluation, cost tracking	LangSmith	Native LangGraph integration, LLM-as-judge eval, token cost tracking	Free dev tier	Traces to LangSmith cloud
L4	RAG Workflow	Knowledge retrieval — conditional external knowledge injection	LlamaIndex · ChromaDB · Qdrant	Semantic retrieval, conditional context injection, local vector stores	Yes — all OSS	Fully local
L5	Database Layer	Persistence — structured state, conversation history, analytical data	SQLite · DuckDB · Supabase	LangGraph checkpoint target, agent state persistence, analytical queries	Yes — all options	SQLite/DuckDB fully local
L6	Coding Agent	Specialised code generation, refactoring, debugging, multi-file reasoning	Claude Code CLI · Cursor	Repo-aware context, autonomous file operations, iterative test loops	Paid tools	Sends code to Anthropic API
L7	MCP Tool Access	Universal external tool connectivity via Model Context Protocol	Model Context Protocol	Any MCP server = instant tool access; 97M monthly downloads; universal standard	Yes — OSS protocol	Depends on tool called
L8	AI Model Layer	Local LLM inference on open-weight models — zero data egress	Ollama · Gemma 4 · Llama 3.3 · Mistral	Zero per-token cost, complete privacy, offline-capable, no API dependency	Yes — all OSS	Fully local — zero egress

Architectural Principle

Local First.
Open Always.
Eight Layers Deep.

The architecture’s defining characteristic is its commitment to local-first, open-source, free-tier deployment. Every layer in the primary configuration costs zero dollars to operate — Vercel free tier for hosting, Ollama with open-weight models for inference, ChromaDB for local vector storage, SQLite for persistence, LangSmith’s free developer tier for monitoring. This is not a compromise — it is a design choice that reflects the 2026 reality of the open-source AI stack: Llama 3.3 70B delivers near-frontier performance at zero per-token cost; ChromaDB provides production-quality semantic search locally; LangGraph orchestrates complex multi-step workflows with the same reliability as commercial alternatives.

The local inference layer (L8) is the most architecturally significant choice — because it determines the entire privacy and cost profile of the system. When all inference runs on local hardware via Ollama, no prompts, no outputs, no context, and no user data leave the machine. This makes the architecture viable for regulated industries (healthcare, legal, finance), privacy-sensitive personal tools, airgapped enterprise environments, and any application where data sovereignty is non-negotiable. The trade-off is hardware dependency: Llama 3.3 70B requires a GPU; Gemma 4 E4B runs on CPU. The architecture provides both options, allowing deployment to be scaled to available hardware.

The protocol choices reflect the 2026 consolidation of the agentic ecosystem around open standards. MCP (L7) is the universal tool protocol — once a tool exposes an MCP server, it works with any MCP-compatible agent regardless of framework. This eliminates the integration debt that characterised pre-MCP agent development. A2A (Agent-to-Agent) provides the horizontal complement — standardising how agents in the architecture communicate with external agent systems. Together, MCP and A2A form the protocol backbone that makes this architecture composable with the broader enterprise agent ecosystem rather than a proprietary island.

The monitoring layer (L3) deserves particular emphasis: LangSmith is positioned not as optional instrumentation but as a required architectural component, because a production agent without observability is not a production agent — it is a production liability. When an agent produces an incorrect output, the trace in LangSmith shows exactly which retrieval step returned the wrong document, which reasoning step made the wrong inference, which tool call returned unexpected data. Without that trace, debugging requires reproducing the failure from scratch. With it, diagnosis takes seconds. The eight-layer architecture is only as reliable as its least-observed layer. Build L3 early, not after the first incident.

The UI layer receives the request. The Agent Controller orchestrates the response. LangSmith traces every step. LlamaIndex retrieves what the model doesn’t know. The database persists what needs to persist. The coding agent handles what general models handle poorly. MCP connects anything the agent needs to reach. And Ollama runs Llama 3.3 70B — locally, privately, at zero cost per token — producing responses that return through the same eight layers in reverse. That is the architecture. Build it once. It runs everywhere you have hardware.

Sources: Aishwarya Naresh Reganti — The AI Agent Stack in 2026 (architectural decisions over model choice; MCP as universal standard; LangGraph typed state machines; April 2026) · 47Billion — AI Agents in Production: Frameworks, Protocols, What Actually Works (MCP sprint → config file; LangGraph #1 production choice; Ollama for local inference; April 2026) · Langfuse — LLM Framework Comparison 2026 (LangGraph 27,100 monthly searches #1; CrewAI 14,800; LangSmith native LangGraph integration) · Anthropic — Model Context Protocol 2025 (97M monthly SDK downloads; 10,000+ enterprise MCP servers; donated to Linux Foundation) · Meta AI — Llama 3.3 70B Release Notes (open-weight 70B model; comparable to frontier API models on benchmarks; November 2024; Ollama-compatible) · Google DeepMind — Gemma 4 E4B (4B parameter efficiency-focused model; CPU-capable local inference; 2025) · Mistral AI — Mistral Small 4 (instruction-optimised; strong on code and European languages; local deployment via Ollama) · Ollama.com — Local Model Runtime (open-source; OpenAI-compatible API; quantised model support; consumer GPU inference for 70B class models) · LangGraph Documentation — Production Deployment Guide (typed state machines; checkpoint system; PostgreSQL/SQLite backend; crash recovery; human-in-the-loop interrupt nodes) · LlamaIndex — Retrieval-Augmented Generation Documentation (ChromaDB integration; Qdrant local deployment; conditional context injection; 2025) · Vercel — Free Tier Capabilities (global CDN; serverless functions; Next.js native deployment; 2026) · Supabase — Free Tier (PostgreSQL; REST API; realtime; row-level security; 2026)

AI EngineArchitectureReference

Local First.Open Always.Eight Layers Deep.

AI Engine
Architecture
Reference

Local First.
Open Always.
Eight Layers Deep.