From RAG Pipeline to AI Agent: 14 Months of Building a Production Chatbot

RAG AI Agents LLM Azure OpenAI Architecture

Every RAG strategy we carefully engineered — HyDE, reranking, Lazy Graph RAG — was eventually made redundant by the next paradigm shift. This is the story of 14 months building a production chatbot for 12 000 users, and what I learned about the shelf life of AI architecture.

The Setting

The project is a support chatbot for a Swiss defense client. The application it serves is a large HR and logistics web portal — hundreds of screens across 9 modules. New recruits were drowning in support tickets because the documentation was sparse and outdated.

One constraint shaped everything: Azure OpenAI only. The client required all LLM inference to run on Swiss-hosted servers. No OpenAI API, no Anthropic, no open-source models on our own infra. Whatever Azure made available in the Switzerland region, that was our toolbox.

In early 2024, that meant GPT-3.5 Turbo.

Phase 1: Chunks in the System Prompt (Early 2024)

The first version was as naive as it gets. We chunked the existing documentation, ran a vector search against the user's question, and stuffed the top results into the system prompt. The LLM's job was simply to answer based on whatever context it received.

It barely worked. GPT-3.5 had two critical weaknesses:

It hallucinated constantly. When the retrieved chunks didn't contain a clear answer, GPT-3.5 would confidently fabricate one. Users quickly lost trust. Our accuracy measurement showed 40% correct answers — worse than useless, because wrong answers delivered with confidence are more damaging than no answer at all.

It couldn't handle context. Feeding more chunks to compensate for poor retrieval just made things worse. GPT-3.5's small context window and weak instruction-following meant that more context led to more confusion, not better answers. The model would latch onto irrelevant chunks or blend information from unrelated topics.

We had a chatbot that was live, but nobody trusted it.

Phase 2: The RAG Tricks Era (Mid 2024)

RAG was still relatively new, and the community was producing a steady stream of techniques to improve retrieval quality. We tried everything:

HyDE (Hypothetical Document Embeddings) — instead of embedding the user's question directly, you ask the LLM to generate a hypothetical answer first, then embed that to find similar real documents. The theory: a hypothetical answer is semantically closer to the actual answer than a short question is. In practice, it helped marginally on some query types and hurt on others. The improvement was inconsistent and hard to measure.

Reranking — after the initial vector search returns candidates, a cross-encoder reranker scores each chunk against the original question and reorders them. This did help. The reranker caught cases where vector similarity was misleading — chunks that were topically related but didn't actually answer the question.

Chunk size tuning, overlap strategies, metadata filtering — the usual suspects. Each one moved the needle by a few percentage points in our RAGAS evaluation suite, but none was transformative.

We were playing whack-a-mole. Every technique we added improved one failure mode and introduced another. The pipeline grew more complex with each iteration: query expansion, multi-stage retrieval, chunk deduplication, context window management. It felt like we were building an increasingly elaborate machine to compensate for the model's limitations.

Then GPT-4 arrived on Azure Switzerland.

Phase 3: GPT-4 Changes Everything (Q3 2024)

The upgrade from GPT-3.5 to GPT-4 was the single biggest improvement I have witnessed in this project — bigger than any retrieval trick, any pipeline optimization, any architectural change.

Overnight, the hallucination problem was cut in half. GPT-4 could actually follow instructions: "if the context doesn't contain the answer, say so." GPT-3.5 had treated this instruction as a suggestion. GPT-4 treated it as a rule.

The model could also handle substantially more context without losing coherence. Chunks that confused GPT-3.5 were synthesized correctly by GPT-4. Our accuracy jumped and the user feedback shifted from skepticism to cautious adoption.

This was a humbling moment. We had spent months engineering retrieval tricks to compensate for model weakness. A single model upgrade delivered more improvement than all of those tricks combined.

We kept the retrieval improvements — they still helped at the margins — but the lesson was clear: don't over-engineer around model limitations that will be solved by the next model.

When GPT-4o became available around Q2 2025, we switched again. No dramatic quality improvement this time — just significantly cheaper tokens, which mattered at 100+ messages per day.

Phase 4: Lazy Graph RAG — The Last Pipeline Trick (Late 2025)

Microsoft published a paper on Graph RAG, and a variant called Lazy Graph RAG caught my attention. The idea: instead of pre-summarizing your entire dataset into a knowledge graph (expensive and brittle), you build graph structure lazily at query time, allowing the search to traverse relationships across the full dataset.

I implemented it. In theory, it meant our chatbot could reason across the entire portal's knowledge base instead of being limited to whatever the vector search happened to retrieve.

In practice, we couldn't measure significant improvement. The evaluation metrics barely moved. The approach was elegant on paper but didn't translate to meaningful gains in our use case. The same pattern as the other RAG tricks: marginal improvement at the cost of substantial complexity.

I was starting to suspect that the problem wasn't retrieval quality. The problem was the entire paradigm of pre-built retrieval pipelines.

Phase 5: The Agent Revolution (Early 2026)

The turning point came from an unexpected direction. I was experimenting with browser-use, an open-source library that lets an LLM control a web browser. It's built as a simple ReAct agent: observe the page, think about what to do, take an action, observe the result, repeat.

What struck me wasn't the browser automation. It was how well the agent loop worked. The ReAct pattern — think, act, observe — was solving problems that would have required elaborate pre-built pipelines if you tried to engineer them manually. The agent didn't need a sophisticated retrieval strategy. It just searched, evaluated what it found, refined its query, and searched again. It reasoned about retrieval instead of following a fixed pipeline.

I studied browser-use's architecture and realized the same pattern could replace our entire RAG pipeline. Instead of: user question → query expansion → vector search → reranking → context assembly → LLM answer, the agent could:

  1. Read the user's question and think about what information it needs
  2. Search the knowledge base with a self-chosen query
  3. Evaluate whether the results are sufficient
  4. Refine and search again if needed, with different terms or in different data sources
  5. Synthesize an answer once it has enough context

No HyDE. No reranking. No chunk size optimization. The agent handles all of that implicitly through reasoning. It tries a search, sees that the results aren't great, and figures out a better query on its own — something our hand-crafted pipeline could never do.

The improvement was immediate and obvious. The agent found answers that our pipeline had missed, because it could adapt its search strategy in real time. A user asking a vague question no longer depended on our query expansion logic getting it right on the first try. The agent would try multiple angles until it found what it needed.

Agents are as big a revolution as GPT-3.5 itself was. Not the models getting smarter — the architecture around them getting smarter. An LLM in an agent loop is fundamentally more capable than the same LLM at the end of a fixed pipeline.

Phase 6: The Next Horizon — Off-the-Shelf Agents + MCP

Here's the uncomfortable conclusion I've arrived at: agent architectures are now too sophisticated to build yourself.

Our custom ReAct agent works well. But when I look at Claude Code, OpenAI Codex, or other production agent frameworks, they've solved problems we haven't even encountered yet — context management, tool orchestration, error recovery, multi-step planning. These teams have hundreds of engineers iterating on agent architectures daily.

The next evolution of our chatbot won't be another custom agent. It will be an off-the-shelf agent product that we connect to our domain knowledge through MCP servers. Instead of our Python code directly calling Azure AI Search and querying the database, we'll expose those capabilities as MCP tools. The agent will decide when and how to use them.

This is why I built the MCP server for the codebase and generated structured action data from the portal's UI. Those aren't just improvements to the current chatbot — they're the foundation for the next architecture. The data and tools will outlive any specific agent implementation.

The Pattern

Looking back at 14 months, one pattern stands out:

Phase What we built What made it obsolete
Chunks in system prompt Basic vector search RAG techniques improved retrieval
RAG tricks (HyDE, reranking) Sophisticated retrieval pipeline GPT-4 made half the tricks unnecessary
Lazy Graph RAG Graph-based cross-dataset reasoning Agent loop with simple search beat it
Custom ReAct agent Domain-specific agent architecture Off-the-shelf agents with MCP will replace it

Every layer of engineering eventually got absorbed by the next paradigm shift. The RAG tricks compensated for weak models. The agent loop made the tricks redundant. Production-grade agent frameworks will make custom agent code redundant.

What Actually Survived

Not everything was throwaway. Looking at what's still valuable after 14 months:

The evaluation framework. RAGAS in CI, running on every pull request. Regardless of the architecture underneath, we can always measure whether a change improves or degrades answer quality. This survived every paradigm shift because it measures outcomes, not implementation.

The structured data. The 715 action files, the code intelligence extractions, the curated UI walkthroughs. Good data outlives every architecture. It worked for the RAG pipeline, it works for the agent, and it will work for whatever comes next.

The MCP interfaces. Tools that expose structured access to domain knowledge. These are architecture-agnostic by design — any agent that speaks MCP can use them.

What didn't survive: every retrieval trick, every custom pipeline stage, every piece of code that encoded assumptions about how search should work rather than letting the model figure it out.

Takeaways

  1. Don't over-engineer around current model limitations. The limitations you're working around today will likely be solved by the next model or the next architectural pattern. Build the simplest thing that works and invest in what's durable — data, evaluation, and clean interfaces.

  2. Agents are a paradigm shift, not an incremental improvement. The jump from RAG pipeline to agent loop is not "a better retrieval strategy." It's a fundamentally different approach where the LLM drives the process instead of being a component at the end of a fixed pipeline.

  3. Invest in data and interfaces, not pipeline code. After 14 months, the custom retrieval logic is heading for replacement. The structured action data, the MCP server, and the evaluation suite are more valuable than ever. Build things that are useful regardless of the agent architecture consuming them.

  4. Know when to stop building and start buying. There's a window where building a custom agent is the right call — when off-the-shelf options don't exist or don't fit. That window is closing. The engineering effort is better spent on domain-specific data and tooling that makes any agent effective in your context.