BuildRight AI — Guardrailed RAG Commerce Assistant

Overview

The problem, and the engineering bet

Conversational commerce is easy to fake and hard to trust. A chatbot that invents a price or a return policy is worse than no chatbot. BuildRight is built around one thesis: an assistant should be structurally incapable of stating an ungrounded fact — and still feel fast, cheap, and genuinely helpful.

🎯

What it does

A real storefront (browse, filter, cart, Stripe checkout) fronted by an AI assistant that searches products, answers policy questions with citations, plans project material lists from room dimensions, reorders past purchases, and recommends complementary items.

🛡️

Why it's different

Two independent guardrail layers — prompt rules plus a deterministic validator that checks every price and citation against the data actually retrieved for that turn. A fabricated price never renders; it's replaced by a safe fallback.

⚙️

How it's built

A hand-built agentic tool-use loop on the raw Anthropic SDK (no framework between me and the model), hybrid RAG with measured evaluation, a Haiku→Sonnet cost router, and production scaffolding: RBAC, payments, observability, migrations, and CI against two databases.

Architecture

One container, single origin

A single FastAPI process serves the built React SPA and the JSON/SSE API from one origin — no CORS, no cross-site-cookie problems, one deployable unit. It runs on ephemeral SQLite locally and on Postgres + pgvector in the cloud with the same code path.

Client

React SPA — storefront · cart · chat widget · manager & admin dashboards

↓ HTTPS · cookies (JWT access/refresh + CSRF) · JSON + Server-Sent Events ↓

FastAPI (single origin)

auth

cart

orders

chat (SSE)

payments

admin

analytics

media

middleware: CSRF · rate limit · login lockout · request-id · structured errors

↓

Services

cart · order · payment

memory · recommender

analytics · audit · eval

AI subsystem

model router (Haiku→Sonnet)

agentic tool-use loop (11 tools)

hybrid RAG + re-rank

guardrail validator

External

Anthropic Claude

Stripe

STT (optional)

↓

Data — SQLAlchemy ORM → SQLite (dev) | Postgres + pgvector (prod)

menu_items

orders / carts

conversations / messages

documents / chunks

*_embeddings

audit_logs

On startup the container runs an idempotent seed — generate catalog, ingest the knowledge base, compute embeddings — so a fresh deploy always boots into a complete, searchable dataset.

Retrieval & reasoning

A chat turn, end to end

Every turn is routed for cost, runs an agentic tool loop that grounds itself in real data, and is validated before a single token reaches the user.

Route

Heuristics + a 4-token Haiku classifier pick Haiku (simple) or Sonnet (complex / multimodal).

→

Tool loop

Up to 6 rounds of Claude tool-use — search, plan, reorder, recommend — accumulating grounded data.

→

Retrieve

Hybrid RAG fuses lexical + vector + CLIP via RRF, then a re-ranker sharpens precision.

→

Guard

Deterministic validator checks every price/citation vs. retrieved data; else safe fallback.

→

Stream

Validated answer streams over SSE with shortlist chips + a CSAT prompt.

Hybrid retrieval, fused and measured

🔤

Lexical arm

Tokenized keyword scoring with field weighting (name ×3, keyword/category ×2, description ×1) and singularization, plus an exact-SKU fast path.

🧭

Dense-vector arm

BGE-small embeddings (384-d) via fastembed, searched with pgvector's <=> on Postgres (HNSW-indexed) or NumPy cosine on SQLite. Hash-embedding fallback keeps it populated anywhere.

🖼️

Visual (CLIP) arm

CLIP image embeddings (512-d) let a text query rank products by what they look like. Optional and torch-guarded — absent torch it returns nothing and the fuser is unchanged.

⚖️

Reciprocal Rank Fusion + re-ranking

The three ranked lists are fused with RRF (k=60), then a cross-encoder re-ranks the candidate pool for precision — with a deterministic feature re-ranker fallback (term coverage, heading match, exact 2-gram, rank prior) when torch isn't present.

📊

Measured, not guessed

An evaluation harness scores hit@k, MRR, and nDCG@k over labeled question→doc sets. It was built first, so the re-ranker's lift (hit@5 0.86→0.93) is a measurement, not a claim. A query-decomposition step handles "X vs Y" comparisons via multi-hop retrieval.

Under the hood

Engineering depth, subsystem by subsystem

The pieces that make it trustworthy and cheap, not just functional.

TrustTwo-layer guardrail

Layer one is prompt discipline (only discuss tool-returned products/policies; never invent prices; cite as Title › Section). Layer two is a deterministic Python validator that regex-extracts every claimed price, distinguishes thresholds ("under $5") from assertions ("it's $49.99"), and checks each against the grounded items for the turn — allowing quantity × unit-price line totals. On a mismatch the answer is replaced with a safe fallback and a guardrail_violation flag is recorded. The safety net is code, not a prompt.

validate_response()price faithfulnesscitation soft-check

CostMulti-model router

Each turn is classified cheaply: an image forces the heavy model; obvious project language ("paint my room", "build a deck") escalates without a round-trip; very short asks stay simple; ambiguous mid-length asks get a 4-token Haiku micro-classification that fails safe to the cheap model. Simple turns run on Haiku, only complex/multimodal turns pay for Sonnet — and the chosen route is recorded per message for the cost dashboard.

Haiku → Sonnetheuristic fast-pathfail-safe

AgencyTool-use & project planner

An 11-tool agent loop on the raw Anthropic SDK: product & knowledge-base search, order history, reorder (fuzzy id/SKU/name match), preferences, recommenders (collaborative + content-based), and a project planner that computes a bill of materials from room dimensions (paint / tile / laminate / drywall) and batch-adds it to the cart — every quantity and line total grounded so the guardrail can verify it.

compute_materialsreorderfrequently_bought_with

ModalityVision · Voice · OCR

Image product search (photo → Claude vision → catalog retrieval), voice ordering (Whisper STT via an OpenAI-compatible provider, returning a clean 503 rather than faking a transcript when unconfigured), and handwritten stock-sheet OCR with a human-confirm-then-audit write path. Manual/spec-sheet OCR also feeds documents back into the knowledge base.

find-by-imagetranscribeocr-stock

MemoryPersonalization

Durable user preferences (brand, pro/DIY, project) are captured during chat and injected as a fenced, injection-hardened preamble into every turn; each conversation also carries a one-line need-summary. A "what we remember" panel surfaces this for logged-in users, and chat→cart→order attribution measures revenue the assistant actually drove.

UserPreferenceconversation.summaryAI attribution

OpsObservability

Every assistant message persists telemetry — model, route, per-turn input/output tokens, tools used, guardrail violation. A manager dashboard surfaces per-turn cost, Haiku-vs-Sonnet route mix, guardrail-violation rate, an offline answer-quality eval (price-faithfulness / relevance / context-utilization), CSAT, inventory, and COGS-based margins.

AI-Opscost / route mixCSAT

Product surface

A complete store, not a chatbot demo

The assistant lives inside a real, secure e-commerce app.

🛒

Storefront & cart

Browse 10K+ generated SKUs across 19 categories with search, faceted filters, product options, and a guest-capable cart (session-based, no login required).

💳

Stripe checkout

Real test-mode payments — PaymentIntents, server-side confirmation (no webhook tunnel needed), manager-gated refunds, integer-cents money handling.

🔐

RBAC & security

4-tier roles (customer / staff / manager / admin), JWT with refresh-token rotation, CSRF double-submit, rate limiting, login lockout, and an audit log.

📦

Catalog generation

Deterministic generator: categories × product types × variant axes (voltage, grade, size…) × brands, scaling to 10K+ SKUs with round-robin category coverage.

📚

Knowledge base

Policy / FAQ / warranty docs plus ~195 buying guides and 19 category guides (with spec-comparison tables) — heading-aware chunked and vector-indexed for cited answers.

📈

Manager analytics

AI Operations, AI-attributed revenue, chat-quality eval, CSAT, inventory health, and profit/margins — each role-gated behind least-privilege RBAC.

Metric (labeled KB set)	Baseline (hybrid)	+ Re-ranker	Δ
hit@5	0.86	0.93	+0.07
MRR	0.54	0.59	+0.05
nDCG	0.60	0.63	+0.03

Design decisions

Choices I can defend

Guardrail as code, not prompt

The deterministic validator is the real safety net; prompt rules are the first line. A hallucinated price never reaches the user.

Hand-built agent loop

Built on the raw Anthropic SDK rather than LangChain/LangGraph, so guardrails, routing, and grounding are fully under my control.

Cheap-by-default routing

Haiku handles simple turns; Sonnet is reserved for hard ones — lower cost with no quality loss where it matters.

Same code, SQLite ↔ Postgres

A vector-store abstraction lets the app run on SQLite locally and Postgres + pgvector in prod; CI proves both. "Scale up" is a config change.

Graceful degradation everywhere

torch / CLIP / STT / cross-encoder are all optional with deterministic fallbacks, so the deployed image stays lean and tests stay green.

Measured before optimized

The eval harness came first, so retrieval changes are quantified — and I knew when not to add complexity (e.g. skipping a trained router that a heuristic matched).

A guardrailed AI shopping assistant
that can't make up a price.