Module 4: Evaluation
Having established a strong tracing foundation, it’s time to examine evaluation—the process of measuring how well your LLM (or agent) performs and how to evolve it over time.
Evaluating LLMs can feel like trying to untangle a giant ball of yarn—there’s a lot going on, and it’s often not obvious which thread to pull first. From wrangling unpredictable user inputs to picking the right metrics, the process can be overwhelming.
But don’t panic! In this section, we’ll walk through some tried-and-true best practices, common pitfalls, and handy tips to help you benchmark your LLM’s performance. Whether you’re just starting out or looking for a quick refresher, these guidelines will keep your evaluation strategy on solid ground.
Evaluation Challenges
When it comes to LLMs, “evaluation” is more than just a single metric or one-time test. Their outputs can be astonishingly diverse—sometimes correct, sometimes creative, and sometimes surprisingly off-base.
One major hurdle is defining clear evaluation goals. Traditional software metrics (like error rates) may not translate well when your model might encounter any question under the sun. You’ll want to pin down what “good” looks like—whether it’s accuracy, helpfulness, or creativity—before you even begin.
Because LLMs generate text rather than just classifying it, subjective interpretation creeps into the equation. Deciding how to measure factors like “clarity” or “coherence” can be tricky without well-defined rubrics or specialized metrics.
And then there’s the operational side of evaluation:
- Cost and Latency: Testing at scale (especially with human annotators) can run up costs quickly. Automated approaches may be faster, but they’re not always reliable enough on their own.
- Trust in Automated Tools: Automated evaluators (including ones powered by smaller models) can drift or fail in unexpected ways. Keeping them aligned with real human perspectives takes constant upkeep.
- Cross-Team Collaboration: Getting engineers, data scientists, product managers, and domain experts to work in sync is crucial. Without a clear process or shared vocabulary, you risk chaotic handoffs and scattered efforts.
Example: Simplified Customer Support RAG Chat
When you’re dealing with a workflow that spans multiple steps—like the above RAG pipeline—each stage needs its own evaluation criteria. Otherwise, you’ll struggle to pinpoint exactly where things go wrong (or right).
Data Model for Evaluation – Traces
To make sense of all these moving parts, it helps to have an organized way to record exactly what’s happening at each step. That’s where traces come in. Traces capture detailed logs of user interactions, intermediate steps, and final outputs, giving you a treasure trove of data for diagnosing issues and measuring performance over time.
Example of a trace in Langfuse
Evaluation Workflow
The continuous evaluation loop begins offline with building or curating test datasets that range from “happy path” scenarios to adversarial inputs, ensuring good coverage of potential user interactions.
Next, teams run experiments by iterating on models, prompts, and any relevant tools or logic, followed by an evaluation step. This can include both manual annotation for nuanced judgment and automated methods for speed and scale.
Once these experiments meet the necessary quality and safety criteria, the changes are deployed alongside guardrails that filter inputs or outputs in real time.
Once the application is in production, we continuously capture data about actual performance and user behavior, while monitoring tools support debugging and manual review to surface any unexpected issues.
Whenever new problems or edge cases are identified, they are funneled back into the test dataset so that future experiments can address them proactively, creating a self-sustaining feedback loop that improves the system with each iteration.
Now let’s take a closer look at offline and online evaluations.
Offline Evaluation:
Offline evaluation means that you are testing your application during development state. You’ll typically run your model on curated datasets—maybe as part of your CI pipeline or local dev tests. Smaller datasets are great for quick, “gut check” experiments; larger ones provide a broader sweep of performance indicators. The main challenge is making sure these test sets stay relevant and actually resemble what you’ll see in the wild.
Below is an example of a dataset in Langfuse with inputs and expected outputs:
Collaboratively manage datasets via UI, API, or SDKs.
For instance, if you built a math word-problem solving application, you might have a test dataset of 100 math problems with known answers which you all feed into your application. Once your application answered all math questions, you compare the results with the known answers and calculate the accuracy of the agent. You then run this test with different settings (e.g. add calculator tool, chain of thought prompting, larger LLM) and compare the results.
The key challenge with offline eval is ensuring your test dataset is comprehensive and stays relevant – your application might perform well on a fixed test set but encounter very different queries in production. Therefore, you should keep test sets updated with new edge cases and examples that reflect real-world scenarios. A mix of small “smoke test” cases and larger evaluation sets is useful: small sets for quick checks and larger ones for broader performance metrics.
For an end-to-end example on running Dataset experiments in Langfuse, check out this cookbook.
Online Evaluation:
Besides evaluating your application during development, you can also evaluate your app while it runs in production.
For example, you might track success rates, user satisfaction scores, or other metrics on live traffic. The advantage of online evaluation is that it captures things you might not anticipate in a lab setting – you can observe model drift over time (if the agent’s effectiveness degrades as input patterns shift) and catch unexpected queries or situations that weren’t in your test data. It provides a true picture of how the agent behaves in the wild.
Online evaluation often involves collecting implicit and explicit user feedback, as discussed, and possibly running shadow tests or A/B tests (where a new version of the agent runs in parallel to compare against the old). The challenge is that it can be tricky to get reliable labels or scores for live interactions – you might rely on user feedback or downstream metrics (like did the user click the result).
The image below shows an example of a live LLM-as-a-Judge Evaluator in Langfuse that scores new traces for toxicity:
Find out how to set up a live LLM-as-a-Judge Evaluator in Langfuse here.
Common Evaluation Metrics
For both online and offline evaluations, it is important to know which datapoints you would like to collect and use as evaluation metrics. No single method will capture everything about your model’s behavior, so it often pays to mix and match.
Below are some of the most common metrics to track:
Direct feedback—like user ratings or open-ended comments—offers the clearest signal of whether your LLM is hitting the mark. However, collecting and organizing these insights at scale can be expensive and time-consuming.
Example of user feedback captured in ChatGPT:
Traces are the underlying thread across all these methods—by systematically logging interactions, you create a structured record that each evaluation technique can draw from.
Automated Evaluation Techniques
For certain applications—like extraction and classification tasks—precision, recall, and F-score offer clear, quantifiable measures. But not all tasks are that straightforward, especially when an LLM is expected to generate paragraphs of text or whole chat conversations.
You can enlist another machine learning model—or even a specialized LLM-based evaluator (sometimes called “model based evals”)—to score outputs. These can be flexible, but there’s always the risk of propagating the same biases or blind spots. Calibrating them against human-annotated samples helps. Find out more on using LLM-as-a-Judge evals here.
Ultimately, common toolkits like built in LLM-as-a-Judge evals or external libraries like OpenAI Evals and RAGAS help streamline the setup for automated checks. Still, every application has its own quirks. Tailored evaluators or custom heuristics often deliver the best insights if you invest the time to build them correctly.
Application-Specific Challenges
What makes LLM evaluation both fascinating and challenging is how different each use case can be:
Because you’re evaluating both the retrieval step and the generative step, it’s helpful to measure them separately. For example, you might track relevance and precision for document retrieval, then apply generative metrics (like RAGAS) to the summarized output. (Guide on using RAGAS here)
LLM Red Teaming / Security
Another aspect of evaluation is Red Teaming. In the previous section we focused on optimizing your application. This part shows how to secure your application against attacks and edge cases.
Protecting against security risks and attacks is becoming increasingly important for ensuring LLM apps are production ready. Not only do LLM applications need to be secure to protect users’ private and sensitive information, they also need ensure a level of quality and safety of responses to maintain product standards.
The OWASP Top 10 list is a useful resource on the topic. It provides a consensus of the most critical security risks for LLM applications.
In the video below, we walk through an example of how to use the open-source security library LLM Guard, and how to integrate Langfuse to monitor and protect against common security risks.
LLM Security can be addressed with a combination of:
- LLM Security libraries for run-time security measures
- Langfuse for the ex-post evaluation of the effectiveness of these measures
Run-time security measures with LLM security libraries
There are several popular security libraries that can be used to mitigate security risks in LLM-based applications. These include: LLM Guard, Prompt Armor, NeMo Guardrails, Microsoft Azure AI Content Safety, Lakera. These libraries help with security measures in the following ways:
- Catching and blocking a potentially harmful or inappropriate prompt before sending to the model
- Redacting sensitive PII before being sending into the model and then un-redacting in the response
- Evaluating prompts and completions on toxicity, relevance, or sensitive material at run-time and blocking the response if necessary
Asynchronous monitoring and evaluation of security measures
Use Langfuse tracing to gain visibility and confidence in each step of the security mechanism. These are common workflows:
- Investigate security issues by manually inspect traces.
- Monitor security scores over time in the Langfuse Dashboard.
- Evaluate effectiveness of security measures. Integrating Langfuse scores into your team’s workflow can help teams identify which security risks are most prevalent and build more robust tools around those specific issues. There are two main workflows to consider:
- Annotations (in UI). If you establish a baseline by annotating a share of production traces, you can compare the security scores returned by the security tools with these annotations.
- Automated evaluations. Langfuse’s model-based evaluations will run asynchronously and can scan traces for things such as toxicity or sensitivity to flag potential risks and identify any gaps in your LLM security setup.
- Track latency to balance tradeoffs. Some LLM security checks need to be awaited before the model can be called, others block the response to the user. Thus they quickly are an essential driver of overall latency of an LLM application. Langfuse can help dissect the latencies of these checks within a trace to understand whether the checks are worth the wait.
Example Workflow: Anonymizing Personally Identifiable Information (PII)
We redact and un-redact sensitive information using a security library before and after it is fed into the model. We wrap the whole process with the Langfuse observe decorator to trace and monitor the security process. In the following example below we use the open source library LLM Guard, an open-source security tool. All examples easily translate to other libraries.
Exposing Personally Identifiable Information (PII) to models can pose security and privacy risks, such as violating contractual obligations or regulatory compliance requirements, or mitigating the risks of data leakage or a data breach.
Run the end-to-end cookbook or check out our documentation.
Further reading:
- Evaluating the Effectiveness of LLM-Evaluators (aka LLM-as-Judge), blog post, by Eugene Yan
- AI Agent Observability & Evaluation, course, by Hugging Face
- Your AI Product Needs Evals, blog post, by Hamel Husain
- Creating a LLM-as-a-Judge That Drives Business Results, blog post, by Hamel Husain
- Evaluating Voice AI Agents, blog post and video, by Marc Klingen and Brooke Hopkins
Get Started with Evaluation
Evaluations are the most important part of the LLM Application development workflow. Langfuse adapts to your needs and supports:
- LLM-as-a-judge: Fully managed evaluators run on production or development traces within Langfuse
- User feedback: Collect feedback from your users and add it to traces in Langfuse
- Manual labeling: Annotate traces with human feedback in managed workflows
- Custom: Build your own evaluation pipelines via Langfuse APIs/SDKs for full flexibility
Plot evaluation results in the Langfuse Dashboard.
Evaluating LLMs is never a one-and-done task. As your model and user base evolve, your evaluation strategies need to keep pace. Next, we’ll look at how to build and manage your prompts.