Your RAG Pipeline Is Lying to You

Your RAG Pipeline Is Lying to You
Image by Gerd Altmann from Pixabay

You built a RAG pipeline. You tested it manually with a few queries, got reasonable answers, and shipped it. Your stakeholders are happy. The chatbot returns answers that sound authoritative.

But here is the uncomfortable question: do you actually know if it's right?

Not "does it return something", but does it return the correct thing, consistently, across the full distribution of queries your users will actually ask?

If you can't answer that with a number, you don't have a production system. You have a demo with a deployment URL.

This article is for senior engineers who build RAG systems and want to know if they actually work. We will walk through a concrete RAG example - a pipeline over a corporate annual report - and build the testing layer that most teams skip entirely. The code is real and runnable. The failures are not hypothetical.

All code in this article is available in the companion GitHub repository: github.com/nunombispo/rag-pipeline-article. The repository includes a sample golden dataset, the full pytest suite, and a minimal monitoring wrapper.


Why RAG Is Harder Than It Looks

RAG sounds simple on paper: retrieve relevant context, pass it to the model, get a grounded answer. The reality is that there are three independent failure layers, any one of which can silently corrupt your results.

Layer 1 - Retrieval failure: The vector search returns chunks that are topically adjacent to the query but don't contain the actual answer. Similarity is not the same as relevance. A question about Q3 revenue may surface chunks about Q3 strategy, not the income statement.

Layer 2 - Context failure: The right information is in the document, but it was split across a chunk boundary during preprocessing. The number is on page 47, the label is on page 46, and your chunker put them in separate vectors. Neither chunk alone answers the question correctly.

Layer 3 - Generation failure: The model receives good context but still produces a wrong answer - either by hallucinating when the context is ambiguous, by confusing units or time periods, or by confidently synthesizing an answer from two chunks that should not be combined.

Each layer fails independently and silently. There is no exception thrown. The user gets an answer. The answer looks plausible.

Query ──► [Embedding] ──► [Vector Search] ──► [Context Assembly] ──► [LLM] ──► Answer
              │                  │                     │               │
           Layer 0            Layer 1               Layer 2         Layer 3
         (usually OK)      (common fail)         (common fail)   (occasional fail)

Layer 0 is the embedding model itself: given a query string, does it produce a vector? This step almost never fails in practice — hosted embedding APIs are reliable, and vector dimension mismatches surface immediately at index time. Most teams verify it once and move on.

The layers that actually kill production systems are 1 and 2. They are typically invisible until a user complains - and most users don't complain. They just stop trusting the system.

RAG failures are not crashes. They are confidence-eroding wrong answers delivered at scale. The risk is not a system outage - it is the gradual destruction of user trust in your AI investment.

The Concrete Example: RAG Over an Annual Report

The easiest way to see all three layers fail is on a real system, with a real document, running real queries. Annual reports are the canonical enterprise RAG use case. Dense prose, financial tables, footnotes, forward-looking statements, and regulatory boilerplate - all in one PDF. They are also a perfect failure showcase because the data is precise, verifiable, and publicly available.

We will use Microsoft's 2025 Annual Report (publicly available). It's 83 pages of shareholder letters, financial tables, segment breakdowns, accounting notes, and legal disclosures - exactly the kind of document that gets thrown at enterprise RAG pipelines and produces interesting failures. The principles apply to any dense financial document.

Setup

pip install anthropic pydantic>=2 chromadb pypdf python-dotenv rich pytest

Step 1: Ingest the PDF

# ingest.py
import hashlib
from pypdf import PdfReader
import chromadb

CHUNK_SIZE = 800        # characters
CHUNK_OVERLAP = 100


def load_pdf(path: str) -> list[dict]:
    """Extract text from PDF, page by page."""
    reader = PdfReader(path)
    pages = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        pages.append({"page": i + 1, "text": text.strip()})
    return pages


def chunk_text(pages: list[dict]) -> list[dict]:
    """Naive fixed-size chunking with overlap."""
    chunks = []
    for page in pages:
        text = page["text"]
        start = 0
        while start < len(text):
            end = start + CHUNK_SIZE
            chunk = text[start:end]
            chunk_id = hashlib.md5(chunk.encode()).hexdigest()
            chunks.append({
                "id": chunk_id,
                "text": chunk,
                "page": page["page"],
                "start": start,
            })
            start += CHUNK_SIZE - CHUNK_OVERLAP
    return chunks


def build_collection(pdf_path: str, collection_name: str = "annual_report"):
    client = chromadb.Client()
    collection = client.get_or_create_collection(collection_name)

    pages = load_pdf(pdf_path)
    chunks = chunk_text(pages)

    collection.add(
        ids=[c["id"] for c in chunks],
        documents=[c["text"] for c in chunks],
        metadatas=[{"page": c["page"]} for c in chunks],
    )

    print(f"Indexed {len(chunks)} chunks from {len(pages)} pages.")
    return collection

Step 2: Query the Pipeline

# rag.py
import anthropic
import chromadb
import os
from dotenv import load_dotenv

MODEL = "claude-sonnet-4-6"
TOP_K = 3

load_dotenv()
API_KEY = os.getenv("ANTHROPIC_API_KEY")
client = anthropic.Anthropic(api_key=API_KEY)


def retrieve(collection, query: str, top_k: int = TOP_K) -> list[dict]:
    results = collection.query(query_texts=[query], n_results=top_k)
    chunks = []
    for doc, meta, distance in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        chunks.append({"text": doc, "page": meta["page"], "score": 1 - distance})
    return chunks


def answer(query: str, chunks: list[dict]) -> str:
    context = "\n\n---\n\n".join(
        f"[Page {c['page']}]\n{c['text']}" for c in chunks
    )
    system = (
        "You are a financial analyst assistant. Answer questions based strictly "
        "on the provided context. If the answer is not present in the context, "
        "say 'I could not find this in the provided document.' "
        "Do not speculate or use external knowledge."
    )
    response = client.messages.create(
        model=MODEL,
        max_tokens=512,
        thinking={"type": "disabled"},  # disable extended thinking; not needed here
        system=system,
        messages=[
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}",
            }
        ],
    )
    parts = [b.text for b in response.content if getattr(b, "type", None) == "text"]
    return "\n".join(parts).strip() if parts else ""


def rag_query(collection, query: str) -> dict:
    chunks = retrieve(collection, query)
    response = answer(query, chunks)
    return {
        "query": query,
        "chunks": chunks,
        "answer": response,
    }

Step 3: Run It

# main.py
from ingest import build_collection
from rag import rag_query

collection = build_collection("2025_AnnualReport.pdf")  # optional: collection_name="my_docs"
result = rag_query(collection, "What were the main risks discussed?")

print(result["answer"])
# result also has "chunks" (text, page, score) and "query"

This is a typical minimal RAG implementation. Straightforward, readable - and full of silent failure modes. Let's expose them.


What Happens Without Evaluation

Before building the test suite, let's make the failures concrete by running the pipeline against four representative queries. When retrieval and generation align, the pipeline delivers a clean, grounded answer with no intervention needed - Query 1 shows this. But "correct" and "useful" are not the same thing, and the remaining queries show how quietly the gap between them can widen.

Query 1: "What was Microsoft's total revenue in fiscal year 2025?"

Based on the provided context, Microsoft's total revenue in fiscal year 2025 was
**$281,724 million ($281.7 billion)**, representing a **15% increase** compared
to fiscal year 2024's revenue of $245,122 million.

The answer is correct. But the same figure also appears narratively in the shareholder letter on page 2 - "revenue was $281.7 billion, up 15 percent" - without the structured financial context an analyst would expect. If the retriever ranks the letter chunk higher than the financial tables on page 25, the answer is technically right but stripped of the gross margin, operating income, and EPS data that make it meaningful. It passes a keyword check. It fails a usefulness check.

Query 2: "How many employees does Microsoft have?"

I could not find this in the provided document. The provided context does not
contain any information about Microsoft's employee headcount.

The answer - 228,000 full-time employees as of June 30, 2025 - is in the Human Capital Resources section on page 16. But the section spans pages 15–17, and naive fixed-size chunking splits it mid-paragraph. Depending on where the chunk boundary falls, the retriever may rank pages 15 or 17 higher than page 16, returning context about compensation programs or operations centers rather than the headcount figure. The model refuses. The refusal looks correct. The underlying cause is a retrieval failure.

A less disciplined model - or the same model without an explicit "do not speculate" instruction in the system prompt - may hallucinate a figure by latching onto other numeric patterns in the retrieved chunks. The correct behavior is not guaranteed; it is a product of your system prompt design, and without a test asserting it, a prompt change can silently remove it.

Query 3: "What percentage of revenue came from cloud services?"

Based on the provided context, I can identify the following cloud-related revenue
items for fiscal year 2025:

| Product/Service                                    | Revenue (in millions) |
|----------------------------------------------------|-----------------------|
| Server products and cloud services                 | $98,435               |
| Microsoft 365 Commercial products and cloud services| $87,767              |
| Microsoft 365 Consumer products and cloud services | $7,404                |

**However, I cannot provide a precise "cloud-only" percentage** for the following
reasons:

1. Several line items **bundle** cloud services with non-cloud products (e.g.,
   "Server products **and** cloud services"), making it impossible to isolate
   pure cloud revenue from the data provided.
2. Total revenue is stated as **$281,724 million** for FY2025.

If the bundled categories above are used as a rough proxy, those three lines sum
to approximately **$193,606 million**, representing roughly **~69%** of total
revenue — but this would **overstate** true cloud-only revenue since non-cloud
products are included in those figures.

The retriever surfaces chunks that bundle cloud and non-cloud revenue under the same line items. The model correctly flags the ambiguity, but still calculates a ~69% figure using the bundled totals as a proxy. The answer looks rigorous: it has a table, a caveat, and a percentage. It is also wrong in a way that is easy to miss. A reader who skims the headline number walks away with a figure that overstates cloud revenue by an unknown margin. Transparency about uncertainty is not the same as a correct answer.

Query 4: "Did Microsoft mention any regulatory risks related to AI?"

Based on the provided context, **no specific mention of regulatory risks related
to AI** is made in the excerpts provided. The pages included focus primarily on:

- Microsoft's mission and AI platform strategy (Page 22)
- The company's positioning in the AI platform shift (Page 2)
- The Microsoft Elevate initiative and AI skills investments (Page 4)

**I could not find this in the provided document.** To get a comprehensive view
of Microsoft's AI-related regulatory risks, you would likely need to refer to the
**Risk Factors section** of their Annual Report on Form 10-K, which is not
included in the provided context.

The model correctly declines - and it is correct to do so. The actual regulatory risk language lives in the 10-K filing, a separate document not in the pipeline. The refusal is technically right. But the user doesn't know that. From their perspective, they asked a reasonable question and got a dead end. The real failure is architectural: the pipeline has no mechanism to tell the user why it can't answer - whether the information doesn't exist, exists in the document but wasn't retrieved, or exists in a document that was never ingested. All three scenarios return the same response. Without instrumentation, you cannot distinguish between them.

Without a ground truth dataset and a structured test harness, you have no way to know which of these answers are right - or whether a change to your chunking strategy or system prompt just made things worse.


The Testing Layer

Good RAG evaluation has three stages. Each catches a different class of failure.

Stage 1: Chunk Quality

Before any query is run, validate that your chunking strategy produces coherent, complete units of information.

# tests/test_chunks.py
import pytest
from ingest import load_pdf, chunk_text

PDF_PATH = "2025_AnnualReport.pdf"

@pytest.fixture(scope="module")
def chunks():
    pages = load_pdf(PDF_PATH)
    return chunk_text(pages)


def test_no_empty_chunks(chunks):
    empty = [c for c in chunks if not c["text"].strip()]
    assert len(empty) == 0, f"{len(empty)} empty chunks found"


def test_chunk_size_bounds(chunks):
    oversized = [c for c in chunks if len(c["text"]) > 1000]
    assert len(oversized) == 0, f"{len(oversized)} chunks exceed 1000 chars"


def test_no_orphaned_numbers(chunks):
    """Flag chunks that are pure numeric tables with no prose context.
    These are extraction artifacts from PDF tables and answer nothing reliably."""
    import re
    suspicious = []
    for c in chunks:
        words = re.findall(r"[a-zA-Z]{3,}", c["text"])
        if len(words) < 5 and len(c["text"]) > 100:
            suspicious.append(c)
    assert len(suspicious) < 10, (
        f"{len(suspicious)} chunks look like structureless table extracts. "
        "Consider a table-aware PDF parser."
    )


def test_chunk_overlap_preserves_context(chunks):
    """Verify that consecutive chunks share some overlapping text."""
    # Sample the first 50 chunk pairs
    misses = 0
    for i in range(min(50, len(chunks) - 1)):
        a = chunks[i]["text"]
        b = chunks[i + 1]["text"]
        # Overlap window: last N chars of a should appear in start of b
        overlap_window = a[-100:]
        if overlap_window not in b:
            misses += 1
    # Allow some misses (page boundaries)
    assert misses < 15, f"Too many chunk pairs with no detectable overlap ({misses})"

These tests catch configuration mistakes before they touch the vector store.

Stage 2: Retrieval Evaluation

This is the most important stage and the most commonly skipped. Build a golden dataset: a small set of (query, expected page, expected content) pairs where you know where the answer lives.

# tests/golden_dataset.py
GOLDEN_QA = [
    {
        "query": "What was Microsoft's total revenue in fiscal year 2025?",
        "expected_page": 25,        # Summary Results of Operations table
        "expected_answer_contains": ["281", "billion"],
    },
    {
        "query": "How many employees does Microsoft have?",
        "expected_page": 16,        # Human Capital Resources section
        "expected_answer_contains": ["228,000", "employees"],
    },
    {
        "query": "What was the net income for fiscal year 2025?",
        "expected_page": 37,        # Income Statements
        "expected_answer_contains": ["101", "billion"],
    },
    {
        "query": "What was Microsoft Cloud revenue in fiscal year 2025?",
        "expected_page": 22,        # Overview highlights section
        "expected_answer_contains": ["168.9", "billion"],
    },
    {
        "query": "Did Microsoft pay dividends in fiscal year 2025?",
        "expected_page": 8,         # Dividends table
        "expected_answer_contains": ["3.32", "dividend"],
    },
]

And the test for the retrieval:

Nuno Bispo

Nuno Bispo

Solutions Architect · Senior Python & AI Engineer · AI Audits · Helping teams fix what they shipped too fast
Netherlands