BuildRight AI is a full-stack hardware-store web app where customers search products, get grounded policy answers, plan whole projects, and check out — just by chatting. It pairs a hybrid retrieval (RAG) engine with a deterministic guardrail that validates every answer against retrieved data, a cost-aware multi-model router, and multimodal (image / voice / OCR) input — all deployed as one Docker container.
Conversational commerce is easy to fake and hard to trust. A chatbot that invents a price or a return policy is worse than no chatbot. BuildRight is built around one thesis: an assistant should be structurally incapable of stating an ungrounded fact — and still feel fast, cheap, and genuinely helpful.
A real storefront (browse, filter, cart, Stripe checkout) fronted by an AI assistant that searches products, answers policy questions with citations, plans project material lists from room dimensions, reorders past purchases, and recommends complementary items.
Two independent guardrail layers — prompt rules plus a deterministic validator that checks every price and citation against the data actually retrieved for that turn. A fabricated price never renders; it's replaced by a safe fallback.
A hand-built agentic tool-use loop on the raw Anthropic SDK (no framework between me and the model), hybrid RAG with measured evaluation, a Haiku→Sonnet cost router, and production scaffolding: RBAC, payments, observability, migrations, and CI against two databases.
A single FastAPI process serves the built React SPA and the JSON/SSE API from one origin — no CORS, no cross-site-cookie problems, one deployable unit. It runs on ephemeral SQLite locally and on Postgres + pgvector in the cloud with the same code path.
On startup the container runs an idempotent seed — generate catalog, ingest the knowledge base, compute embeddings — so a fresh deploy always boots into a complete, searchable dataset.
Every turn is routed for cost, runs an agentic tool loop that grounds itself in real data, and is validated before a single token reaches the user.
Heuristics + a 4-token Haiku classifier pick Haiku (simple) or Sonnet (complex / multimodal).
Up to 6 rounds of Claude tool-use — search, plan, reorder, recommend — accumulating grounded data.
Hybrid RAG fuses lexical + vector + CLIP via RRF, then a re-ranker sharpens precision.
Deterministic validator checks every price/citation vs. retrieved data; else safe fallback.
Validated answer streams over SSE with shortlist chips + a CSAT prompt.
Tokenized keyword scoring with field weighting (name ×3, keyword/category ×2, description ×1) and singularization, plus an exact-SKU fast path.
BGE-small embeddings (384-d) via fastembed, searched with pgvector's <=> on Postgres (HNSW-indexed) or NumPy cosine on SQLite. Hash-embedding fallback keeps it populated anywhere.
CLIP image embeddings (512-d) let a text query rank products by what they look like. Optional and torch-guarded — absent torch it returns nothing and the fuser is unchanged.
The three ranked lists are fused with RRF (k=60), then a cross-encoder re-ranks the candidate pool for precision — with a deterministic feature re-ranker fallback (term coverage, heading match, exact 2-gram, rank prior) when torch isn't present.
An evaluation harness scores hit@k, MRR, and nDCG@k over labeled question→doc sets. It was built first, so the re-ranker's lift (hit@5 0.86→0.93) is a measurement, not a claim. A query-decomposition step handles "X vs Y" comparisons via multi-hop retrieval.
The pieces that make it trustworthy and cheap, not just functional.
Layer one is prompt discipline (only discuss tool-returned products/policies; never invent prices; cite as Title › Section). Layer two is a deterministic Python validator that regex-extracts every claimed price, distinguishes thresholds ("under $5") from assertions ("it's $49.99"), and checks each against the grounded items for the turn — allowing quantity × unit-price line totals. On a mismatch the answer is replaced with a safe fallback and a guardrail_violation flag is recorded. The safety net is code, not a prompt.
Each turn is classified cheaply: an image forces the heavy model; obvious project language ("paint my room", "build a deck") escalates without a round-trip; very short asks stay simple; ambiguous mid-length asks get a 4-token Haiku micro-classification that fails safe to the cheap model. Simple turns run on Haiku, only complex/multimodal turns pay for Sonnet — and the chosen route is recorded per message for the cost dashboard.
An 11-tool agent loop on the raw Anthropic SDK: product & knowledge-base search, order history, reorder (fuzzy id/SKU/name match), preferences, recommenders (collaborative + content-based), and a project planner that computes a bill of materials from room dimensions (paint / tile / laminate / drywall) and batch-adds it to the cart — every quantity and line total grounded so the guardrail can verify it.
Image product search (photo → Claude vision → catalog retrieval), voice ordering (Whisper STT via an OpenAI-compatible provider, returning a clean 503 rather than faking a transcript when unconfigured), and handwritten stock-sheet OCR with a human-confirm-then-audit write path. Manual/spec-sheet OCR also feeds documents back into the knowledge base.
Durable user preferences (brand, pro/DIY, project) are captured during chat and injected as a fenced, injection-hardened preamble into every turn; each conversation also carries a one-line need-summary. A "what we remember" panel surfaces this for logged-in users, and chat→cart→order attribution measures revenue the assistant actually drove.
Every assistant message persists telemetry — model, route, per-turn input/output tokens, tools used, guardrail violation. A manager dashboard surfaces per-turn cost, Haiku-vs-Sonnet route mix, guardrail-violation rate, an offline answer-quality eval (price-faithfulness / relevance / context-utilization), CSAT, inventory, and COGS-based margins.
The assistant lives inside a real, secure e-commerce app.
Browse 10K+ generated SKUs across 19 categories with search, faceted filters, product options, and a guest-capable cart (session-based, no login required).
Real test-mode payments — PaymentIntents, server-side confirmation (no webhook tunnel needed), manager-gated refunds, integer-cents money handling.
4-tier roles (customer / staff / manager / admin), JWT with refresh-token rotation, CSRF double-submit, rate limiting, login lockout, and an audit log.
Deterministic generator: categories × product types × variant axes (voltage, grade, size…) × brands, scaling to 10K+ SKUs with round-robin category coverage.
Policy / FAQ / warranty docs plus ~195 buying guides and 19 category guides (with spec-comparison tables) — heading-aware chunked and vector-indexed for cited answers.
AI Operations, AI-attributed revenue, chat-quality eval, CSAT, inventory health, and profit/margins — each role-gated behind least-privilege RBAC.
The retrieval-evaluation harness was written before the re-ranker, so every change is a measurement against a labeled set.
| Metric (labeled KB set) | Baseline (hybrid) | + Re-ranker | Δ |
|---|---|---|---|
| hit@5 | 0.86 | 0.93 | +0.07 |
| MRR | 0.54 | 0.59 | +0.05 |
| nDCG | 0.60 | 0.63 | +0.03 |
Plus a deterministic answer-quality evaluator (price-faithfulness, answer-relevance, context-utilization) scored offline per turn, and a 1–5 CSAT loop — all without an LLM-as-judge, so results are reproducible and free.
The deterministic validator is the real safety net; prompt rules are the first line. A hallucinated price never reaches the user.
Built on the raw Anthropic SDK rather than LangChain/LangGraph, so guardrails, routing, and grounding are fully under my control.
Haiku handles simple turns; Sonnet is reserved for hard ones — lower cost with no quality loss where it matters.
A vector-store abstraction lets the app run on SQLite locally and Postgres + pgvector in prod; CI proves both. "Scale up" is a config change.
torch / CLIP / STT / cross-encoder are all optional with deterministic fallbacks, so the deployed image stays lean and tests stay green.
The eval harness came first, so retrieval changes are quantified — and I knew when not to add complexity (e.g. skipping a trained router that a heuristic matched).