Skip to content

Commit 96cff7a

Browse files
authored
docs: fix nav bar (#1416)
1 parent 12e9a3e commit 96cff7a

29 files changed

+194
-173
lines changed
File renamed without changes.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# List of available metrics
2+
3+
Ragas provides a set of evaluation metrics that can be used to measure the performance of your LLM application. These metrics are designed to help you objectively measure the performance of your application. Metrics are available for different applications and tasks, such as RAG and Agentic workflows.
4+
5+
Each metric are essentially paradigms that is designed to evaluate a particular aspect of the application. LLM Based metrics might use one or more LLM calls to arrive at the score or result. One can also modify or write your own metrics using ragas.
6+
7+
## Retrieval Augmented Generation
8+
- [Context Precision](context_precision.md)
9+
- [Context Recall](context_recall.md)
10+
- [Context Entities Recall](context_entities_recall.md)
11+
- [Noise Sensitivity](noise_sensitivity.md)
12+
- [Response Relevancy](answer_relevance.md)
13+
- [Faithfulness](faithfulness.md)
14+
15+
## Agents or Tool use cases
16+
17+
- [Topic adherence](topic_adherence.md)
18+
- [Tool call Accuracy](agents.md#tool-call-accuracy)
19+
- [Agent Goal Accuracy](agents.md#agent-goal-accuracy)
20+
21+
## Natural Language Comparison
22+
23+
- [Factual Correctness](factual_correctness.md)
24+
- [Semantic Similarity](semantic_similarity.md)
25+
- [Non LLM String Similarity](traditional.md#non-llm-string-similarity)
26+
- [BLEU Score](traditional.md#bleu-score)
27+
- [ROUGE Score](traditional.md#rouge-score)
28+
- [String Presence](traditional.md#string-presence)
29+
- [Exact Match](traditional.md#exact-match)
30+
31+
32+
## SQL
33+
34+
- [Execution based Datacompy Score](sql.md#execution-based-metrics)
35+
- [SQL query Equivalence](sql.md#sql-query-semantic-equivalence)
36+
37+
## General purpose
38+
39+
- [Aspect critic](general_purpose.md#aspect-critic)
40+
- [Simple Criteria Scoring](general_purpose.md#simple-criteria-scoring)
41+
- [Rubrics based scoring](general_purpose.md#rubrics-based-scoring)
42+
- [Instance specific rubrics scoring](general_purpose.md#instance-specific-rubrics-scoring)
43+
44+
## Other tasks
45+
46+
- [Summarization](summarization_score.md)
File renamed without changes.

docs/concepts/metrics/index.md

+4-121
Original file line numberDiff line numberDiff line change
@@ -1,125 +1,8 @@
11

22
# Metrics
33

4-
A metric is a quantitative measure used to evaluate the performance of a AI application. Metrics help in assessing how well the application and individual components that makes up application is performing relative to the given test data. They provide a numerical basis for comparison, optimization, and decision-making throughout the application development and deployment process. Metrics are crucial for:
5-
6-
1. **Component Selection**: Metrics can be used to compare different components of the AI application like LLM, Retriever, Agent configuration, etc with your own data and select the best one from different options.
7-
2. **Error Diagnosis and Debugging**: Metrics help identify which part of the application is causing errors or suboptimal performance, making it easier to debug and refine.
8-
3. **Continuous Monitoring and Maintenance**: Metrics enable the tracking of an AI application’s performance over time, helping to detect and respond to issues such as data drift, model degradation, or changing user requirements.
9-
10-
11-
## Different types of metrics
12-
13-
<div style="text-align: center;">
14-
<img src="../../_static/imgs/metrics_mindmap.png" alt="Metrics Mindmap" width="500" height="500">
15-
</div>
16-
17-
**Metrics can be classified into two categories based on the mechanism used underneath the hood**:
18-
19-
&nbsp;&nbsp;&nbsp;&nbsp; **LLM-based metrics**: These metrics use LLM underneath to do the evaluation. There might be one or more LLM calls that are performed to arrive at the score or result. These metrics can be somewhat non deterministic as the LLM might not always return the same result for the same input. On the other hand, these metrics has shown to be more accurate and closer to human evaluation.
20-
21-
All LLM based metrics in ragas are inherited from `MetricWithLLM` class. These metrics expects a [LLM]() object to be set before scoring.
22-
23-
```python
24-
from ragas.metrics import FactualCorrectness
25-
scorer = FactualCorrectness(llm=evaluation_llm)
26-
```
27-
28-
Each LLM based metrics also will have prompts associated with it written using [Prompt Object]().
29-
30-
31-
&nbsp;&nbsp;&nbsp;&nbsp; **Non-LLM-based metrics**: These metrics do not use LLM underneath to do the evaluation. These metrics are deterministic and can be used to evaluate the performance of the AI application without using LLM. These metrics rely on traditional methods to evaluate the performance of the AI application, such as string similarity, BLEU score, etc. Due to the same, these metrics are known to have a lower correlation with human evaluation.
32-
33-
All LLM based metrics in ragas are inherited from `Metric` class.
34-
35-
**Metrics can be broadly classified into two categories based on the type of data they evaluate**:
36-
37-
&nbsp;&nbsp;&nbsp;&nbsp; **Single turn metrics**: These metrics evaluate the performance of the AI application based on a single turn of interaction between the user and the AI. All metrics in ragas that supports single turn evaluation are inherited from `SingleTurnMetric` class and scored using `single_turn_ascore` method. It also expects a [Single Turn Sample]() object as input.
38-
39-
```python
40-
from ragas.metrics import FactualCorrectness
41-
42-
metric = FactualCorrectness()
43-
await metric.single_turn_ascore(sample)
44-
```
45-
46-
&nbsp;&nbsp;&nbsp;&nbsp; **Multi-turn metrics**: These metrics evaluate the performance of the AI application based on multiple turns of interaction between the user and the AI. All metrics in ragas that supports multi turn evaluation are inherited from `MultiTurnMetric` class and scored using `multi_turn_ascore` method. It also expects a [Multi Turn Sample]() object as input.
47-
48-
```python
49-
from ragas.metrics import AgentGoalAccuracy
50-
from ragas import MultiTurnSample
51-
52-
scorer = AgentGoalAccuracy()
53-
await metric.multi_turn_ascore(sample)
54-
```
55-
56-
## Metric Design Principles
57-
58-
Designing effective metrics for AI applications requires following to a set of core principles to ensure their reliability, interpretability, and relevance. Here are five key principles we follow in ragas when designing metrics:
59-
60-
**1. Single-Aspect Focus**
61-
A single metric should target only one specific aspect of the AI application's performance. This ensures that the metric is both interpretable and actionable, providing clear insights into what is being measured.
62-
63-
**2. Intuitive and Interpretable**
64-
Metrics should be designed to be easy to understand and interpret. Clear and intuitive metrics make it simpler to communicate results and draw meaningful conclusions.
65-
66-
**3. Effective Prompt Flows**
67-
When developing metrics using large language models (LLMs), use intelligent prompt flows that align closely with human evaluation. Decomposing complex tasks into smaller sub-tasks with specific prompts can improve the accuracy and relevance of the metric.
68-
69-
**4. Robustness**
70-
Ensure that LLM-based metrics include sufficient few-shot examples that reflect the desired outcomes. This enhances the robustness of the metric by providing context and guidance for the LLM to follow.
71-
72-
**5.Consistent Scoring Ranges**
73-
It is crucial to normalize metric score values or ensure they fall within a specific range, such as 0 to 1. This facilitates comparison between different metrics and helps maintain consistency and interpretability across the evaluation framework.
74-
75-
These principles serve as a foundation for creating metrics that are not only effective but also practical and meaningful in evaluating AI applications.
76-
77-
78-
79-
## List of available metrics
80-
81-
Ragas provides a set of evaluation metrics that can be used to measure the performance of your LLM application. These metrics are designed to help you objectively measure the performance of your application. Metrics are available for different applications and tasks, such as RAG and Agentic workflows.
82-
83-
Each metric are essentially paradigms that is designed to evaluate a particular aspect of the application. LLM Based metrics might use one or more LLM calls to arrive at the score or result. One can also modify or write your own metrics using ragas.
84-
85-
### Retrieval Augmented Generation
86-
- [Context Precision](context_precision.md)
87-
- [Context Recall](context_recall.md)
88-
- [Context Entities Recall](context_entities_recall.md)
89-
- [Noise Sensitivity](noise_sensitivity.md)
90-
- [Response Relevancy](answer_relevance.md)
91-
- [Faithfulness](faithfulness.md)
92-
93-
### Agents or Tool use cases
94-
95-
- [Topic adherence](topic_adherence.md)
96-
- [Tool call Accuracy](agents.md#tool-call-accuracy)
97-
- [Agent Goal Accuracy](agents.md#agent-goal-accuracy)
98-
99-
### Natural Language Comparison
100-
101-
- [Factual Correctness](factual_correctness.md)
102-
- [Semantic Similarity](semantic_similarity.md)
103-
- [Non LLM String Similarity](traditional.md#non-llm-string-similarity)
104-
- [BLEU Score](traditional.md#bleu-score)
105-
- [ROUGE Score](traditional.md#rouge-score)
106-
- [String Presence](traditional.md#string-presence)
107-
- [Exact Match](traditional.md#exact-match)
108-
109-
110-
### SQL
111-
112-
- [Execution based Datacompy Score](sql.md#execution-based-metrics)
113-
- [SQL query Equivalence](sql.md#sql-query-semantic-equivalence)
114-
115-
### General purpose
116-
117-
- [Aspect critic](general_purpose.md#aspect-critic)
118-
- [Simple Criteria Scoring](general_purpose.md#simple-criteria-scoring)
119-
- [Rubrics based scoring](general_purpose.md#rubrics-based-scoring)
120-
- [Instance specific rubrics scoring](general_purpose.md#instance-specific-rubrics-scoring)
121-
122-
### Other tasks
123-
124-
- [Summarization](summarization_score.md)
4+
<div class="grid cards" markdown>
1255

6+
- :fontawesome-solid-database:[__Overview__ Learn more about overview and design principles](overview/index.md)
7+
- :fontawesome-solid-robot: [__Available Metrics__ Learn about available metrics and their inner workings](available_metrics/index.md)
8+
</div>
+74
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
# Overview of Metrics
2+
3+
A metric is a quantitative measure used to evaluate the performance of a AI application. Metrics help in assessing how well the application and individual components that makes up application is performing relative to the given test data. They provide a numerical basis for comparison, optimization, and decision-making throughout the application development and deployment process. Metrics are crucial for:
4+
5+
1. **Component Selection**: Metrics can be used to compare different components of the AI application like LLM, Retriever, Agent configuration, etc with your own data and select the best one from different options.
6+
2. **Error Diagnosis and Debugging**: Metrics help identify which part of the application is causing errors or suboptimal performance, making it easier to debug and refine.
7+
3. **Continuous Monitoring and Maintenance**: Metrics enable the tracking of an AI application’s performance over time, helping to detect and respond to issues such as data drift, model degradation, or changing user requirements.
8+
9+
10+
## Different types of metrics
11+
12+
<div style="text-align: center;">
13+
<img src="/docs/_static/imgs/metrics_mindmap.png" alt="Metrics Mindmap" width="500" height="500">
14+
</div>
15+
16+
**Metrics can be classified into two categories based on the mechanism used underneath the hood**:
17+
18+
&nbsp;&nbsp;&nbsp;&nbsp; **LLM-based metrics**: These metrics use LLM underneath to do the evaluation. There might be one or more LLM calls that are performed to arrive at the score or result. These metrics can be somewhat non deterministic as the LLM might not always return the same result for the same input. On the other hand, these metrics has shown to be more accurate and closer to human evaluation.
19+
20+
All LLM based metrics in ragas are inherited from `MetricWithLLM` class. These metrics expects a [LLM]() object to be set before scoring.
21+
22+
```python
23+
from ragas.metrics import FactualCorrectness
24+
scorer = FactualCorrectness(llm=evaluation_llm)
25+
```
26+
27+
Each LLM based metrics also will have prompts associated with it written using [Prompt Object]().
28+
29+
30+
&nbsp;&nbsp;&nbsp;&nbsp; **Non-LLM-based metrics**: These metrics do not use LLM underneath to do the evaluation. These metrics are deterministic and can be used to evaluate the performance of the AI application without using LLM. These metrics rely on traditional methods to evaluate the performance of the AI application, such as string similarity, BLEU score, etc. Due to the same, these metrics are known to have a lower correlation with human evaluation.
31+
32+
All LLM based metrics in ragas are inherited from `Metric` class.
33+
34+
**Metrics can be broadly classified into two categories based on the type of data they evaluate**:
35+
36+
&nbsp;&nbsp;&nbsp;&nbsp; **Single turn metrics**: These metrics evaluate the performance of the AI application based on a single turn of interaction between the user and the AI. All metrics in ragas that supports single turn evaluation are inherited from `SingleTurnMetric` class and scored using `single_turn_ascore` method. It also expects a [Single Turn Sample]() object as input.
37+
38+
```python
39+
from ragas.metrics import FactualCorrectness
40+
41+
metric = FactualCorrectness()
42+
await metric.single_turn_ascore(sample)
43+
```
44+
45+
&nbsp;&nbsp;&nbsp;&nbsp; **Multi-turn metrics**: These metrics evaluate the performance of the AI application based on multiple turns of interaction between the user and the AI. All metrics in ragas that supports multi turn evaluation are inherited from `MultiTurnMetric` class and scored using `multi_turn_ascore` method. It also expects a [Multi Turn Sample]() object as input.
46+
47+
```python
48+
from ragas.metrics import AgentGoalAccuracy
49+
from ragas import MultiTurnSample
50+
51+
scorer = AgentGoalAccuracy()
52+
await metric.multi_turn_ascore(sample)
53+
```
54+
55+
## Metric Design Principles
56+
57+
Designing effective metrics for AI applications requires following to a set of core principles to ensure their reliability, interpretability, and relevance. Here are five key principles we follow in ragas when designing metrics:
58+
59+
**1. Single-Aspect Focus**
60+
A single metric should target only one specific aspect of the AI application's performance. This ensures that the metric is both interpretable and actionable, providing clear insights into what is being measured.
61+
62+
**2. Intuitive and Interpretable**
63+
Metrics should be designed to be easy to understand and interpret. Clear and intuitive metrics make it simpler to communicate results and draw meaningful conclusions.
64+
65+
**3. Effective Prompt Flows**
66+
When developing metrics using large language models (LLMs), use intelligent prompt flows that align closely with human evaluation. Decomposing complex tasks into smaller sub-tasks with specific prompts can improve the accuracy and relevance of the metric.
67+
68+
**4. Robustness**
69+
Ensure that LLM-based metrics include sufficient few-shot examples that reflect the desired outcomes. This enhances the robustness of the metric by providing context and guidance for the LLM to follow.
70+
71+
**5.Consistent Scoring Ranges**
72+
It is crucial to normalize metric score values or ensure they fall within a specific range, such as 0 to 1. This facilitates comparison between different metrics and helps maintain consistency and interpretability across the evaluation framework.
73+
74+
These principles serve as a foundation for creating metrics that are not only effective but also practical and meaningful in evaluating AI applications.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Testset Generation for Agents or Tool use cases
2+
3+
Evaluating agentic or tool use workflows can be challenging as it involves multiple steps and interactions. It can be especially hard to curate a test suite that covers all possible scenarios and edge cases. We are working on a set of tools to generate synthetic test data for evaluating agent workflows.
4+
5+
[Talk to founders](https://cal.com/shahul-ragas/30min) to work together on this and discover what's coming for upcoming releases.

docs/howtos/customisations/customise_models.md docs/howtos/customizations/customize_models.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Customise Models
1+
## Customize Models
22

33
Ragas may use a LLM and or Embedding for evaluation and synthetic data generation. Both of these models can be customised according to you availabiity.
44

docs/howtos/customisations/index.md docs/howtos/customizations/index.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ How to customize various aspects of Ragas to suit your needs.
55
## General
66

77
- [Customize models](customise_models.md)
8-
- [Managing timeouts, retries and others](run_config.ipynb)
8+
- [Customize timeouts, retries and others](run_config.ipynb)
99

1010
## Metrics
1111
- [Modify prompts in metrics](metrics/modifying-prompts-metrics.ipynb)

docs/howtos/customisations/run_config.ipynb docs/howtos/customizations/run_config.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Max Workers, Timeouts, Retries and more with `RunConfig`\n",
7+
"# RunConfig\n",
88
"\n",
99
"The `RunConfig` allows you to pass in the run parameters to functions like `evaluate()` and `TestsetGenerator.generate()`. Depending on your LLM providers rate limits, SLAs and traffic, controlling these parameters can improve the speed and reliablility of Ragas runs.\n",
1010
"\n",

docs/howtos/index.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,13 @@ Each guide in this section provides a focused solution to real-world problems th
44

55
<div class="grid cards" markdown>
66

7-
- :material-tune:{ .lg .middle } [__Customization__](customisations/index.md)
7+
- :material-tune:{ .lg .middle } [__Customization__](customizations/index.md)
88

99
---
1010

1111
How to customize various aspects of Ragas to suit your needs.
1212

13-
Customize features such as [Metrics](customisations/index.md#metrics) and [Testset Generation](customisations/index.md#testset-generation).
13+
Customize features such as [Metrics](customizations/index.md#metrics) and [Testset Generation](customizations/index.md#testset-generation).
1414

1515
- :material-cube-outline:{ .lg .middle } [__Applications__](applications/index.md)
1616

0 commit comments

Comments
 (0)