5 things I check before calling a RAG pipeline production-ready

Most RAG pipelines aren’t production-ready when teams think they are. I’ve reviewed a dozen of them across different projects and the failure modes are surprisingly consistent — not in the LLM outputs themselves, but in the evaluation and monitoring infrastructure around them.

Here are the five things I actually check before I’d sign off on calling a RAG pipeline production-ready.

1. Faithfulness is Measured and Has a Threshold

Faithfulness — whether the generated answer is supported by the retrieved context — is the most important single metric for RAG quality, and the most commonly missing.

A pipeline that isn’t measuring faithfulness is flying blind. It might be hallucinating at a 20% rate and no one would know until a user reports it.

The check I run:

from deepeval.metrics import FaithfulnessMetric
from deepeval.test_case import LLMTestCase

faithfulness_metric = FaithfulnessMetric(threshold=0.7)
test_case = LLMTestCase(
    input=query,
    actual_output=pipeline_response,
    retrieval_context=retrieved_chunks,
)
faithfulness_metric.measure(test_case)
print(f"Score: {faithfulness_metric.score}")
print(f"Reason: {faithfulness_metric.reason}")

The threshold depends on domain and risk tolerance, but 0.7 is a reasonable starting floor for most use cases. For anything medical, legal, or financial, I’d push that to 0.85+.

If the team can’t tell me what their faithfulness threshold is and show me it’s being enforced somewhere in the pipeline, the pipeline isn’t production-ready.

2. A Golden Dataset Exists and Is Maintained

A golden dataset is a curated set of (question, expected answer) pairs that represents the real distribution of queries the system will face in production.

Without it, you can’t do regression testing. You can’t know if a prompt change made things better or worse. You can’t report to stakeholders on quality trajectory. You’re just guessing.

The minimum bar I set:

50 examples to start (more is better, but 50 covers the main patterns)
Domain expert reviewed — at least one person who knows the source material validated the expected answers
Updated when production queries reveal gaps — a living document, not a one-time artifact

Most teams I’ve seen have either no golden dataset, or one that was created in a rush during a demo and never touched since. Neither is acceptable for production.

Building the dataset is unglamorous work. It’s also the thing that makes everything else work.

3. Retrieval Quality is Measured Separately from Generation Quality

This one is subtle but important: retrieval failures and generation failures look identical from the output side, but they require different fixes.

If the retrieval step returns irrelevant chunks, the generation step can’t be blamed for a bad answer — it had nothing to work with. If the retrieval is good but the generation is unfaithful, that’s a different problem entirely.

The way to separate these is by measuring retrieval metrics independently:

from ragas.metrics import context_precision, context_recall
from ragas import evaluate

# This requires ground truth documents, but is worth building
results = evaluate(
    dataset=eval_dataset,
    metrics=[context_precision, context_recall],
)

The practical implication: when a user reports a bad answer, I want to be able to tell within 10 minutes whether the retrieval failed or the generation failed. Pipelines without separate retrieval metrics make this take hours.

4. There Is a Regression Test That Runs on Every Significant Change

“Significant change” means: any modification to prompt templates, retrieval parameters, embedding models, chunking strategy, or the base LLM.

The regression test doesn’t need to be exhaustive — it needs to be fast enough to run in CI and comprehensive enough to catch the most common failure modes.

A practical setup:

# .github/workflows/eval.yml
- name: Run eval suite
  run: deepeval test run tests/eval/
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

# tests/eval/test_rag.py
@pytest.mark.parametrize("case", load_golden_dataset())
def test_rag_faithfulness(case):
    response = run_pipeline(case["question"])
    test_case = LLMTestCase(
        input=case["question"],
        actual_output=response,
        retrieval_context=case["retrieved_context"],
        expected_output=case["expected_answer"],
    )
    assert_test(test_case, [
        FaithfulnessMetric(threshold=0.7),
        AnswerRelevancyMetric(threshold=0.75),
    ])

The number I watch: what percentage of the golden dataset is covered by this test? If it’s less than 80%, the test suite isn’t catching enough of the real distribution.

5. There Is a Plan for Monitoring in Production

You can have perfect evals in CI and still get surprised in production. Real user queries are different from your golden dataset. The distribution shifts. New failure modes appear that you didn’t anticipate.

The minimum production monitoring setup I’d consider acceptable:

Sample-based eval: Run faithfulness and relevancy metrics on a random 5–10% sample of production queries. Not all of them (too slow and expensive), but enough to detect distribution-level problems.

Explicit feedback collection: A simple thumbs up/down or “was this helpful?” button on responses. Not because users will always give useful feedback, but because negative feedback spikes are the fastest way to catch a serious regression.

Drift detection on retrieval: Track context precision and recall over time. If your retrieval quality starts degrading (because the document corpus grew, or embeddings drifted, or query patterns shifted), you want to know before it affects answer quality.

I usually recommend LangSmith for this layer — the tracing is easy to add and the dashboard makes production monitoring approachable without building custom infrastructure.

The Common Thread

All five of these checks are about one thing: knowing whether the system is working.

That sounds obvious. In practice, most RAG pipelines ship without any systematic way to answer that question. They work in demos, they look fine in limited testing, and then they fail in production in ways that take weeks to diagnose.

The eval infrastructure isn’t the interesting engineering problem — it’s never as exciting as the retrieval architecture or the prompt engineering. But it’s the thing that makes the interesting work trustworthy enough to actually ship.

If you can’t confidently answer “is this getting better or worse?”, it’s not production-ready yet.

I write about LLM evaluation and testing infrastructure from the trenches at Mercedes-Benz R&D India. Subscribe to the newsletter for new posts.