RAGAS vs TruLens vs DeepEval — which eval framework should you actually use?
Evaluating RAG pipelines is one of those things where everyone agrees it matters, but very few teams do it rigorously. After running evals on production systems, I’ve spent real time with all three of the major frameworks. Here’s what I actually think.
The Short Version
- RAGAS if you want fast, metric-based pipeline evals with minimal setup
- TruLens if you want deep tracing + LLM-as-judge feedback integrated into your dev workflow
- DeepEval if you want a pytest-style testing experience with the most comprehensive metric library
RAGAS: The Benchmark Workhorse
RAGAS (Retrieval Augmented Generation Assessment) gives you five core metrics out of the box:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
answer_correctness,
)
result = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy, context_precision],
)
print(result)
What’s good: Zero config. You feed it a dataset with question, answer, contexts, and optional ground_truth, and you get scores back. The faithfulness metric is genuinely useful — it catches hallucinations by verifying whether every claim in the answer is supported by the retrieved context.
What’s not: RAGAS metrics are LLM-dependent. You’re essentially asking an LLM to evaluate your LLM, which introduces variance. Run the same eval twice and you’ll get slightly different scores. Also, context precision and recall require ground truth documents, which most teams don’t have.
Verdict: Best for a quick pipeline health check. Not a full testing strategy.
TruLens: The Observability Play
TruLens positions itself more as an observability tool than a pure eval framework. It instruments your LLM calls and lets you log traces alongside feedback functions.
from trulens_eval import Tru, TruChain
from trulens_eval.feedback import Groundedness
tru = Tru()
grounded = Groundedness(groundedness_provider=openai)
tru_recorder = TruChain(
chain,
app_id="my_rag_v1",
feedbacks=[grounded.groundedness_measure_with_cot_reasons]
)
with tru_recorder as recording:
response = chain(query)
What’s good: The dashboard is excellent. You can see every retrieval step, every LLM call, and the corresponding feedback scores in a timeline view. For debugging why your pipeline is failing, nothing beats it. The Chain-of-Thought reasons in feedback functions are particularly useful — you don’t just get a score, you get an explanation.
What’s not: Setup is heavier. TruLens wraps your existing chain, which means you need to structure your code in a way it can instrument. If you’re using a custom pipeline that doesn’t fit neatly into LangChain/LlamaIndex conventions, you’ll spend time on integration before you see any value.
Verdict: Excellent for iterative development and debugging. Less practical for CI/CD eval gates.
DeepEval: The Test-Driven Approach
DeepEval is the most developer-friendly of the three. It looks like pytest, feels like pytest, and integrates with pytest.
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
HallucinationMetric,
)
def test_rag_response():
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="The capital of France is Paris.",
retrieval_context=["France is a country in Western Europe. Its capital city is Paris."],
)
faithfulness = FaithfulnessMetric(threshold=0.7)
relevancy = AnswerRelevancyMetric(threshold=0.8)
assert_test(test_case, [faithfulness, relevancy])
Run with deepeval test run test_rag.py and you get a structured test report.
What’s good: The metric library is the most comprehensive of the three. Beyond standard RAG metrics, DeepEval has metrics for conversational AI, summarization, toxicity, bias, and more. The threshold parameter makes pass/fail gates explicit. It also has Confident AI cloud integration for team dashboards.
What’s not: Some of the advanced metrics are slow — running a full eval suite on 100 test cases can take 10+ minutes depending on the LLM you’re using as judge. Cost can add up if you’re running evals frequently in CI.
Verdict: The most production-ready option for teams that want eval as part of their CI pipeline.
What I Actually Use
For production RAG at MBRDI, our setup is:
- DeepEval in CI — a small golden test set (~50 cases) runs on every PR. Anything below threshold blocks merge.
- RAGAS for dataset-level reporting — weekly report on pipeline metrics across our full eval set.
- TruLens during active development — when I’m debugging a specific retrieval or generation issue, I spin up the dashboard.
The three frameworks are more complementary than competitive. The mistake teams make is picking one and expecting it to do everything.
One Thing None of Them Solve
None of these frameworks help you build a good eval dataset. That’s still a human problem. Without high-quality test cases with known-good answers, you’re just measuring noise. The eval framework is only as good as the data you feed it.
If you’re starting from scratch, spend the first week on dataset curation. Not on picking a framework.
Questions? I’m on X at @amitk_builds or you can reach me via the newsletter.