Live demo Generative & Agentic AI Full-stack Single-container deploy

A guardrailed AI shopping assistant
that can't make up a price.

BuildRight AI is a full-stack hardware-store web app where customers search products, get grounded policy answers, plan whole projects, and check out — just by chatting. It pairs a hybrid retrieval (RAG) engine with a deterministic guardrail that validates every answer against retrieved data, a cost-aware multi-model router, and multimodal (image / voice / OCR) input — all deployed as one Docker container.

FastAPIReact + TypeScriptAnthropic Claude pgvectorfastembed / BGECLIPStripe DockerHugging Face Spaces
3
retrieval arms fused (lexical · vector · visual)
0.86→0.93
retrieval hit@5 (re-ranker lift, measured)
10K+
SKU catalog (deterministically generated)
~267
automated tests · CI on SQLite + Postgres
2-layer
guardrail (prompt + deterministic validator)
Overview

The problem, and the engineering bet

Conversational commerce is easy to fake and hard to trust. A chatbot that invents a price or a return policy is worse than no chatbot. BuildRight is built around one thesis: an assistant should be structurally incapable of stating an ungrounded fact — and still feel fast, cheap, and genuinely helpful.

🎯

What it does

A real storefront (browse, filter, cart, Stripe checkout) fronted by an AI assistant that searches products, answers policy questions with citations, plans project material lists from room dimensions, reorders past purchases, and recommends complementary items.

🛡️

Why it's different

Two independent guardrail layers — prompt rules plus a deterministic validator that checks every price and citation against the data actually retrieved for that turn. A fabricated price never renders; it's replaced by a safe fallback.

⚙️

How it's built

A hand-built agentic tool-use loop on the raw Anthropic SDK (no framework between me and the model), hybrid RAG with measured evaluation, a Haiku→Sonnet cost router, and production scaffolding: RBAC, payments, observability, migrations, and CI against two databases.

Architecture

One container, single origin

A single FastAPI process serves the built React SPA and the JSON/SSE API from one origin — no CORS, no cross-site-cookie problems, one deployable unit. It runs on ephemeral SQLite locally and on Postgres + pgvector in the cloud with the same code path.

Client
React SPA — storefront · cart · chat widget · manager & admin dashboards
↓   HTTPS · cookies (JWT access/refresh + CSRF) · JSON + Server-Sent Events   ↓
FastAPI (single origin)
auth
menu
cart
orders
chat (SSE)
payments
admin
analytics
media
middleware: CSRF · rate limit · login lockout · request-id · structured errors
Services
cart · order · payment
memory · recommender
analytics · audit · eval
AI subsystem
model router (Haiku→Sonnet)
agentic tool-use loop (11 tools)
hybrid RAG + re-rank
guardrail validator
External
Anthropic Claude
Stripe
STT (optional)
Data — SQLAlchemy ORM → SQLite (dev) | Postgres + pgvector (prod)
menu_items
orders / carts
conversations / messages
documents / chunks
*_embeddings
audit_logs

On startup the container runs an idempotent seed — generate catalog, ingest the knowledge base, compute embeddings — so a fresh deploy always boots into a complete, searchable dataset.

Retrieval & reasoning

A chat turn, end to end

Every turn is routed for cost, runs an agentic tool loop that grounds itself in real data, and is validated before a single token reaches the user.

01

Route

Heuristics + a 4-token Haiku classifier pick Haiku (simple) or Sonnet (complex / multimodal).

02

Tool loop

Up to 6 rounds of Claude tool-use — search, plan, reorder, recommend — accumulating grounded data.

03

Retrieve

Hybrid RAG fuses lexical + vector + CLIP via RRF, then a re-ranker sharpens precision.

04

Guard

Deterministic validator checks every price/citation vs. retrieved data; else safe fallback.

05

Stream

Validated answer streams over SSE with shortlist chips + a CSAT prompt.

Hybrid retrieval, fused and measured

🔤

Lexical arm

Tokenized keyword scoring with field weighting (name ×3, keyword/category ×2, description ×1) and singularization, plus an exact-SKU fast path.

🧭

Dense-vector arm

BGE-small embeddings (384-d) via fastembed, searched with pgvector's <=> on Postgres (HNSW-indexed) or NumPy cosine on SQLite. Hash-embedding fallback keeps it populated anywhere.

🖼️

Visual (CLIP) arm

CLIP image embeddings (512-d) let a text query rank products by what they look like. Optional and torch-guarded — absent torch it returns nothing and the fuser is unchanged.

⚖️

Reciprocal Rank Fusion + re-ranking

The three ranked lists are fused with RRF (k=60), then a cross-encoder re-ranks the candidate pool for precision — with a deterministic feature re-ranker fallback (term coverage, heading match, exact 2-gram, rank prior) when torch isn't present.

📊

Measured, not guessed

An evaluation harness scores hit@k, MRR, and nDCG@k over labeled question→doc sets. It was built first, so the re-ranker's lift (hit@5 0.86→0.93) is a measurement, not a claim. A query-decomposition step handles "X vs Y" comparisons via multi-hop retrieval.

Under the hood

Engineering depth, subsystem by subsystem

The pieces that make it trustworthy and cheap, not just functional.

TrustTwo-layer guardrail

Layer one is prompt discipline (only discuss tool-returned products/policies; never invent prices; cite as Title › Section). Layer two is a deterministic Python validator that regex-extracts every claimed price, distinguishes thresholds ("under $5") from assertions ("it's $49.99"), and checks each against the grounded items for the turn — allowing quantity × unit-price line totals. On a mismatch the answer is replaced with a safe fallback and a guardrail_violation flag is recorded. The safety net is code, not a prompt.

validate_response()price faithfulnesscitation soft-check
CostMulti-model router

Each turn is classified cheaply: an image forces the heavy model; obvious project language ("paint my room", "build a deck") escalates without a round-trip; very short asks stay simple; ambiguous mid-length asks get a 4-token Haiku micro-classification that fails safe to the cheap model. Simple turns run on Haiku, only complex/multimodal turns pay for Sonnet — and the chosen route is recorded per message for the cost dashboard.

Haiku → Sonnetheuristic fast-pathfail-safe
AgencyTool-use & project planner

An 11-tool agent loop on the raw Anthropic SDK: product & knowledge-base search, order history, reorder (fuzzy id/SKU/name match), preferences, recommenders (collaborative + content-based), and a project planner that computes a bill of materials from room dimensions (paint / tile / laminate / drywall) and batch-adds it to the cart — every quantity and line total grounded so the guardrail can verify it.

compute_materialsreorderfrequently_bought_with
ModalityVision · Voice · OCR

Image product search (photo → Claude vision → catalog retrieval), voice ordering (Whisper STT via an OpenAI-compatible provider, returning a clean 503 rather than faking a transcript when unconfigured), and handwritten stock-sheet OCR with a human-confirm-then-audit write path. Manual/spec-sheet OCR also feeds documents back into the knowledge base.

find-by-imagetranscribeocr-stock
MemoryPersonalization

Durable user preferences (brand, pro/DIY, project) are captured during chat and injected as a fenced, injection-hardened preamble into every turn; each conversation also carries a one-line need-summary. A "what we remember" panel surfaces this for logged-in users, and chat→cart→order attribution measures revenue the assistant actually drove.

UserPreferenceconversation.summaryAI attribution
OpsObservability

Every assistant message persists telemetry — model, route, per-turn input/output tokens, tools used, guardrail violation. A manager dashboard surfaces per-turn cost, Haiku-vs-Sonnet route mix, guardrail-violation rate, an offline answer-quality eval (price-faithfulness / relevance / context-utilization), CSAT, inventory, and COGS-based margins.

AI-Opscost / route mixCSAT
Product surface

A complete store, not a chatbot demo

The assistant lives inside a real, secure e-commerce app.

🛒

Storefront & cart

Browse 10K+ generated SKUs across 19 categories with search, faceted filters, product options, and a guest-capable cart (session-based, no login required).

💳

Stripe checkout

Real test-mode payments — PaymentIntents, server-side confirmation (no webhook tunnel needed), manager-gated refunds, integer-cents money handling.

🔐

RBAC & security

4-tier roles (customer / staff / manager / admin), JWT with refresh-token rotation, CSRF double-submit, rate limiting, login lockout, and an audit log.

📦

Catalog generation

Deterministic generator: categories × product types × variant axes (voltage, grade, size…) × brands, scaling to 10K+ SKUs with round-robin category coverage.

📚

Knowledge base

Policy / FAQ / warranty docs plus ~195 buying guides and 19 category guides (with spec-comparison tables) — heading-aware chunked and vector-indexed for cited answers.

📈

Manager analytics

AI Operations, AI-attributed revenue, chat-quality eval, CSAT, inventory health, and profit/margins — each role-gated behind least-privilege RBAC.

Results

Quantified, because it was built to be measured

The retrieval-evaluation harness was written before the re-ranker, so every change is a measurement against a labeled set.

Metric (labeled KB set)Baseline (hybrid)+ Re-rankerΔ
hit@50.860.93+0.07
MRR0.540.59+0.05
nDCG0.600.63+0.03

Plus a deterministic answer-quality evaluator (price-faithfulness, answer-relevance, context-utilization) scored offline per turn, and a 1–5 CSAT loop — all without an LLM-as-judge, so results are reproducible and free.

Technology

Stack

AI / Retrieval
Anthropic Claude (Haiku + Sonnet)fastembed · BGE-smallCLIP (ViT-B/32)cross-encoder re-rankerRRFLlamaIndex (alt retriever)
Backend
FastAPISQLAlchemy 2.0AlembicPydanticUvicornslowapi
Data
PostgreSQL + pgvectorSQLiteHNSW indexes
Frontend
React 18 + TypeScriptViteTailwind CSSTanStack QueryZustandStripe.js
Payments & Auth
StripeJWT (refresh rotation)Argon2CSRF
Deploy / MLOps
Docker (multi-stage)Hugging Face SpacesGitHub Actions CI/CDpytest (~267)gitleaks
Design decisions

Choices I can defend

Guardrail as code, not prompt

The deterministic validator is the real safety net; prompt rules are the first line. A hallucinated price never reaches the user.

Hand-built agent loop

Built on the raw Anthropic SDK rather than LangChain/LangGraph, so guardrails, routing, and grounding are fully under my control.

Cheap-by-default routing

Haiku handles simple turns; Sonnet is reserved for hard ones — lower cost with no quality loss where it matters.

Same code, SQLite ↔ Postgres

A vector-store abstraction lets the app run on SQLite locally and Postgres + pgvector in prod; CI proves both. "Scale up" is a config change.

Graceful degradation everywhere

torch / CLIP / STT / cross-encoder are all optional with deterministic fallbacks, so the deployed image stays lean and tests stay green.

Measured before optimized

The eval harness came first, so retrieval changes are quantified — and I knew when not to add complexity (e.g. skipping a trained router that a heuristic matched).