ArchitectureAI & DataAWS

Building a RAG Diagnostic Agent on AWS Bedrock: Lessons from Production

We went from zero Bedrock experience to a working production-grade TMS diagnostic system in six weeks. The architectural decisions that got us there — and the ones we wish we'd made earlier.

In late 2024 we were embedded with a logistics client whose TMS had become a black box. Engineers spent hours each week manually reading dense error logs to diagnose route failures and carrier sync issues. Our brief: turn that tribal knowledge into an agent.

The problem we were solving

The system ingested structured events from BlueYonder and Samsara — carrier updates, ETA revisions, exception flags — plus unstructured documents: carrier contracts, SLA definitions, historical incident reports. No single engineer held the full picture. Onboarding took weeks.

The diagnostic flow was always the same: a route would fail, an engineer would grep logs, cross-reference the carrier contract PDF, check the SLA threshold, and write a root-cause note. We wanted an agent that could do the first 80% of that automatically and surface a structured diagnosis with confidence scores.

"The goal wasn't to replace the engineer. It was to give them back the hour they spent context-switching before they could even start thinking."

Architecture overview

We built on AWS Bedrock with Claude 3 Sonnet as the reasoning model and Titan Embeddings v2 for the vector layer. The retrieval store is pgvector running on RDS PostgreSQL — a deliberate choice we'll get to. The orchestration layer is a Spring Boot service with a custom tool-use loop rather than a framework like LangChain.

The high-level flow: inbound events are chunked and embedded, stored in pgvector, then retrieved at query time to augment the model prompt. The model returns a structured diagnosis JSON with a confidence score and a list of cited sources.

The hard parts

Chunking strategy

Carrier contracts are written by lawyers, not engineers. A single clause about weight-based surcharges can span three paragraphs with backward references. Naive fixed-size chunking (512 tokens, 10% overlap) produced retrieval results that were syntactically correct but semantically incomplete — the model would retrieve the surcharge definition but miss the exception clause two paragraphs earlier.

We switched to a hierarchical chunking strategy: document-level summaries as parent chunks, clause-level chunks as children, with parent IDs stored as metadata. At retrieval time we fetch the child by similarity, then always pull its parent summary as additional context. Retrieval quality improved significantly on our evaluation set.

For legal and contractual documents, semantic boundaries matter more than token budgets. Chunk at clause boundaries, not character counts.

Why pgvector over OpenSearch

We evaluated OpenSearch with the k-NN plugin, Pinecone, and pgvector. For our scale — roughly 2M vectors — pgvector on a db.r7g.large was 83% cheaper per month than a comparable Pinecone index and eliminated an additional managed service from the stack. Query latency at p95 was 38ms with an HNSW index, well within our 200ms budget.

The real win: our structured metadata (carrier ID, contract version, effective date) could be filtered using standard SQL predicates before the vector search, which neither Pinecone nor OpenSearch handled as cleanly without additional plumbing.

Production lessons

Latency budget

Our end-to-end target was under two seconds for a diagnosis. Bedrock inference for Claude 3 Sonnet typically runs 900ms–1.4s for our prompt sizes. That left us roughly 600ms for retrieval, reranking, and prompt assembly. We had to be aggressive: top-k=5 (not 10), no reranker in the hot path, and prompt templates pre-compiled at startup.

Hallucination surface

The model would occasionally cite SLA thresholds with high confidence that weren't in the retrieved context — it was completing from training knowledge. We addressed this with two changes: an explicit system prompt instruction ("only cite figures present in the provided context documents") and a post-processing step that validates every numeric claim in the response against the retrieved chunks using regex matching.

"The model's confidence is not the same as factual grounding. Treat them as separate signals."

Cost model

At launch we were naively embedding every inbound event, including heartbeat pings with no diagnostic value. Filtering those at the ingest layer reduced our Titan Embeddings spend by 61%. Rule of thumb: embeddings are cheap per call but expensive at volume — filter aggressively before you embed.

What we'd do differently

We'd invest earlier in an evaluation harness. We built ours in week four, after we'd already made several chunking decisions we later reversed. Having even a small golden dataset of 50 question/answer pairs from week one would have made those decisions data-driven rather than intuition-driven.

We'd also separate the retrieval service from the orchestration service earlier. Running both in the same Spring Boot instance made local development fast but created a deployment coupling that slowed us down in the final weeks when retrieval needed to scale independently of the agent logic.

Five things to take away

  1. Chunk at semantic boundaries, not token budgets — especially for legal documents.
  2. pgvector + SQL predicates beats a dedicated vector DB at moderate scale and a fraction of the cost.
  3. Build your evaluation harness in week one, not week four.
  4. Filter before you embed — heartbeats and noise inflate cost with no retrieval benefit.
  5. Model confidence ≠ factual grounding. Validate numeric claims post-generation.