feat: make general purpose metrics more general (#1666)

## Metrics Converted - [x] Aspect Critic - [x] Simple Criteria - [x] Rubric Based - both Instance and Domain specific a few different examples ### Aspect Critic ```py from ragas.metrics import AspectCritic from ragas.dataset_schema import SingleTurnSample only_response = SingleTurnSample( response="The Eiffel Tower is located in Paris." ) grammar_critic = AspectCritic( name="grammar", definition="Is the response grammatically correct?", llm=evaluator_llm ) await grammar_critic.single_turn_ascore(only_response) ``` with reference ```py answer_correctness_critic = AspectCritic( name="answer_correctness", definition="Is the response and reference answer are the same?", llm=evaluator_llm ) # data row sample = SingleTurnSample( user_input="Where is the Eiffel Tower located?", response="The Eiffel Tower is located in Paris.", reference="London" ) await answer_correctness_critic.single_turn_ascore(sample) ``` **Note:** this only works for multi-turn metrics for now
explodinggradients · Nov 19, 2024 · f14cd85 · f14cd85
1 parent 29f70cf
commit f14cd85
Show file tree

Hide file tree

Showing 27 changed files with 1,173 additions and 1,582 deletions.
diff --git a/docs/concepts/metrics/available_metrics/general_purpose.md b/docs/concepts/metrics/available_metrics/general_purpose.md
@@ -6,7 +6,6 @@ General purpose evaluation metrics are used to evaluate any given task.
 
 `AspectCritic` is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. 
 
-**Without reference**
 
 ### Example
 
@@ -28,32 +27,6 @@ scorer =  AspectCritic(
 await scorer.single_turn_ascore(sample)
 ```
 
-**With reference**
-
-### Example
-
-```python
-from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics import AspectCriticWithReference
-
-
-sample = SingleTurnSample(
-    user_input="Where is the Eiffel Tower located?",
-    response="The Eiffel Tower is located in Paris.",
-    reference="The Eiffel Tower is located in Paris.",
-)
-
-scorer =  AspectCritic(
-        name="correctness",
-        definition="Is the response factually similar to the reference?",
-        llm=evaluator_llm
-
-    )
-
-await scorer.single_turn_ascore(sample)
-
-```
-
 ### How it works
 
 Critics are essentially basic LLM calls using the defined criteria. For example, let's see how the harmfulness critic works:
@@ -74,41 +47,22 @@ Critics are essentially basic LLM calls using the defined criteria. For example,
 
 Course graned evaluation method is an evaluation metric that can be used to score (integer) responses based on predefined single free form scoring criteria. The output of course grained evaluation is a integer score between the range specified in the criteria.
 
-**Without Reference**
-
-```python
-from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics import SimpleCriteriaScoreWithoutReference
-
-
-sample = SingleTurnSample(
-    user_input="Where is the Eiffel Tower located?",
-    response="The Eiffel Tower is located in Paris.",
-)
-
-scorer =  SimpleCriteriaScoreWithoutReference(name="course_grained_score", 
-        definition="Score 0 to 5 for correctness",
-        llm=evaluator_llm
-)
-await scorer.single_turn_ascore(sample)
-```
-
-**With Reference**
-
 ```python
 from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics import SimpleCriteriaScoreWithReference
+from ragas.metrics import SimpleCriteriaScore
 
 
 sample = SingleTurnSample(
-    user_input="Where is the Eiffel Tower located?",
+    user_input="Where is the Eiffel Tower loc
     response="The Eiffel Tower is located in Paris.",
     reference="The Eiffel Tower is located in Egypt"
 )
 
-scorer =  SimpleCriteriaScoreWithReference(name="course_grained_score", 
-        definition="Score 0 to 5 by similarity",
-        llm=evaluator_llm)
+scorer =  SimpleCriteriaScore(
+    name="course_grained_score", 
+    definition="Score 0 to 5 by similarity",
+    llm=evaluator_llm
+)
 
 await scorer.single_turn_ascore(sample)
 ```
@@ -117,14 +71,10 @@ await scorer.single_turn_ascore(sample)
 
 Domain specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific domain. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations.
 
-### With Reference
-
-Used when you have reference answer to evaluate the responses against.
-
 #### Example
 ```python
 from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics import RubricsScoreWithReference
+from ragas.metrics import RubricsScore
 sample = SingleTurnSample(
     user_input="Where is the Eiffel Tower located?",
     response="The Eiffel Tower is located in Paris.",
@@ -137,67 +87,18 @@ rubrics = {
     "score4_description": "The response is mostly accurate and aligns well with the ground truth, with only minor issues or missing details.",
     "score5_description": "The response is fully accurate, aligns completely with the ground truth, and is clear and detailed.",
 }
-scorer =  RubricsScoreWithReference(rubrics=rubrics, llm=evaluator_llm)
+scorer =  RubricsScore(rubrics=rubrics, llm=evaluator_llm)
 await scorer.single_turn_ascore(sample)
 ```
 
-### Without Reference
-
-Used when you don't have reference answer to evaluate the responses against.
-
-#### Example
-```python
-from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics import RubricsScoreWithoutReference
-sample = SingleTurnSample(
-    user_input="Where is the Eiffel Tower located?",
-    response="The Eiffel Tower is located in Paris.",
-)
-
-scorer =  RubricsScoreWithoutReference(rubrics=rubrics, llm=evaluator_llm)
-await scorer.single_turn_ascore(sample)
-```
-
-
 ## Instance Specific rubrics criteria scoring
 
 Instance specific evaluation metric is a rubric-based evaluation metric that is used to evaluate responses on a specific instance, ie each instance to be evaluated is annotated with a rubric based evaluation criteria. The rubric consists of descriptions for each score, typically ranging from 1 to 5. The response here is evaluation and scored using the LLM using description specified in the rubric. This metric also have reference free and reference based variations. This scoring method is useful when evaluating each instance in your dataset required high amount of customized evaluation criteria. 
 
-### With Reference
-
-Used when you have reference answer to evaluate the responses against.
-
-#### Example
-```python
-from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics import InstanceRubricsWithReference
-
-
-SingleTurnSample(
-    user_input="Where is the Eiffel Tower located?",
-    response="The Eiffel Tower is located in Paris.",
-    reference="The Eiffel Tower is located in Paris.",
-    rubrics = {
-        "score1": "The response is completely incorrect or irrelevant (e.g., 'The Eiffel Tower is in London.' or no mention of the Eiffel Tower).",
-        "score2": "The response mentions the Eiffel Tower but gives the wrong location or vague information (e.g., 'The Eiffel Tower is in Europe.' or 'It is in France.' without specifying Paris).",
-        "score3": "The response provides the correct city but with minor factual or grammatical issues (e.g., 'The Eiffel Tower is in Paris, Germany.' or 'The tower is located at Paris.').",
-        "score4": "The response is correct but lacks some clarity or extra detail (e.g., 'The Eiffel Tower is in Paris, France.' without other useful context or slightly awkward phrasing).",
-        "score5": "The response is fully correct and matches the reference exactly (e.g., 'The Eiffel Tower is located in Paris.' with no errors or unnecessary details)."
-    }
-)
-
-scorer =  InstanceRubricsWithReference(llm=evaluator_llm)
-await scorer.single_turn_ascore(sample)
-``` 
-
-### Without Reference
-
-Used when you don't have reference answer to evaluate the responses against.
-
 #### Example
 ```python
 from ragas.dataset_schema import SingleTurnSample
-from ragas.metrics import InstanceRubricsScoreWithoutReference
+from ragas.metrics import InstanceRubricsScore
 
 
 SingleTurnSample(
@@ -212,6 +113,6 @@ SingleTurnSample(
 }
 )
 
-scorer =  InstanceRubricsScoreWithoutReference(llm=evaluator_llm)
+scorer =  InstanceRubricsScore(llm=evaluator_llm)
 await scorer.single_turn_ascore(sample)
 ```
diff --git a/docs/howtos/customizations/metrics/_cost.md b/docs/howtos/customizations/metrics/_cost.md
@@ -13,6 +13,7 @@ For an example here is one that will parse OpenAI by using a parser we have defi
 
 ```python
 import os
+
 os.environ["OPENAI_API_KEY"] = "your-api-key"
 ```
 
@@ -61,8 +62,6 @@ metric = AspectCriticWithReference(
     name="answer_correctness",
     definition="is the response correct compared to reference",
 )
-
-
 ```
 
     Repo card metadata block was not found. Setting CardData to empty.
@@ -73,8 +72,12 @@ metric = AspectCriticWithReference(
 from ragas import evaluate
 from ragas.cost import get_token_usage_for_openai
 
-results = evaluate(eval_dataset[:5], metrics=[metric],  llm=gpt4o,
-    token_usage_parser=get_token_usage_for_openai,)
+results = evaluate(
+    eval_dataset[:5],
+    metrics=[metric],
+    llm=gpt4o,
+    token_usage_parser=get_token_usage_for_openai,
+)
 ```
 
     Evaluating: 100%|██████████| 5/5 [00:01<00:00,  2.81it/s]

diff --git a/docs/howtos/customizations/metrics/_write_your_own_metric.md b/docs/howtos/customizations/metrics/_write_your_own_metric.md
@@ -90,9 +90,9 @@ Now lets init the metric with the rubric and evaluator llm and evaluate the data
 
 
 ```python
-from ragas.metrics import RubricsScoreWithoutReference
+from ragas.metrics import RubricsScore
 
-hallucinations_rubric = RubricsScoreWithoutReference(
+hallucinations_rubric = RubricsScore(
     name="hallucinations_rubric", llm=evaluator_llm, rubrics=rubric
 )
 

diff --git a/docs/howtos/customizations/metrics/cost.ipynb b/docs/howtos/customizations/metrics/cost.ipynb
@@ -29,6 +29,7 @@
    "outputs": [],
    "source": [
     "import os\n",
+    "\n",
     "os.environ[\"OPENAI_API_KEY\"] = \"your-api-key\""
    ]
   },
@@ -105,8 +106,7 @@
     "metric = AspectCriticWithReference(\n",
     "    name=\"answer_correctness\",\n",
     "    definition=\"is the response correct compared to reference\",\n",
-    ")\n",
-    "\n"
+    ")"
    ]
   },
   {
@@ -126,8 +126,12 @@
     "from ragas import evaluate\n",
     "from ragas.cost import get_token_usage_for_openai\n",
     "\n",
-    "results = evaluate(eval_dataset[:5], metrics=[metric],  llm=gpt4o,\n",
-    "    token_usage_parser=get_token_usage_for_openai,)"
+    "results = evaluate(\n",
+    "    eval_dataset[:5],\n",
+    "    metrics=[metric],\n",
+    "    llm=gpt4o,\n",
+    "    token_usage_parser=get_token_usage_for_openai,\n",
+    ")"
    ]
   },
   {

diff --git a/docs/howtos/customizations/metrics/write_your_own_metric.ipynb b/docs/howtos/customizations/metrics/write_your_own_metric.ipynb
@@ -160,9 +160,9 @@
     }
    ],
    "source": [
-    "from ragas.metrics import RubricsScoreWithoutReference\n",
+    "from ragas.metrics import RubricsScore\n",
     "\n",
-    "hallucinations_rubric = RubricsScoreWithoutReference(\n",
+    "hallucinations_rubric = RubricsScore(\n",
     "    name=\"hallucinations_rubric\", llm=evaluator_llm, rubrics=rubric\n",
     ")\n",
     "\n",

diff --git a/docs/howtos/customizations/testgenerator/_persona_generator.md b/docs/howtos/customizations/testgenerator/_persona_generator.md
@@ -14,9 +14,18 @@ Which we can define as follows:
 ```python
 from ragas.testset.persona import Persona
 
-persona_new_joinee = Persona(name="New Joinee", role_description="Don't know much about the company and is looking for information on how to get started.")
-persona_manager = Persona(name="Manager", role_description="Wants to know about the different teams and how they collaborate with each other.")
-persona_senior_manager = Persona(name="Senior Manager", role_description="Wants to know about the company vision and how it is executed.")
+persona_new_joinee = Persona(
+    name="New Joinee",
+    role_description="Don't know much about the company and is looking for information on how to get started.",
+)
+persona_manager = Persona(
+    name="Manager",
+    role_description="Wants to know about the different teams and how they collaborate with each other.",
+)
+persona_senior_manager = Persona(
+    name="Senior Manager",
+    role_description="Wants to know about the company vision and how it is executed.",
+)
 
 personas = [persona_new_joinee, persona_manager, persona_senior_manager]
 personas
@@ -49,7 +58,6 @@ testset_generator = TestsetGenerator(knowledge_graph=kg, persona_list=personas,
 # Generate the Testset
 testset = testset_generator.generate(testset_size=10)
 testset
-
 ```
 
 

diff --git a/docs/howtos/customizations/testgenerator/persona_generator.ipynb b/docs/howtos/customizations/testgenerator/persona_generator.ipynb
@@ -38,9 +38,18 @@
    "source": [
     "from ragas.testset.persona import Persona\n",
     "\n",
-    "persona_new_joinee = Persona(name=\"New Joinee\", role_description=\"Don't know much about the company and is looking for information on how to get started.\")\n",
-    "persona_manager = Persona(name=\"Manager\", role_description=\"Wants to know about the different teams and how they collaborate with each other.\")\n",
-    "persona_senior_manager = Persona(name=\"Senior Manager\", role_description=\"Wants to know about the company vision and how it is executed.\")\n",
+    "persona_new_joinee = Persona(\n",
+    "    name=\"New Joinee\",\n",
+    "    role_description=\"Don't know much about the company and is looking for information on how to get started.\",\n",
+    ")\n",
+    "persona_manager = Persona(\n",
+    "    name=\"Manager\",\n",
+    "    role_description=\"Wants to know about the different teams and how they collaborate with each other.\",\n",
+    ")\n",
+    "persona_senior_manager = Persona(\n",
+    "    name=\"Senior Manager\",\n",
+    "    role_description=\"Wants to know about the company vision and how it is executed.\",\n",
+    ")\n",
     "\n",
     "personas = [persona_new_joinee, persona_manager, persona_senior_manager]\n",
     "personas"
@@ -72,7 +81,7 @@
     "testset_generator = TestsetGenerator(knowledge_graph=kg, persona_list=personas, llm=llm)\n",
     "# Generate the Testset\n",
     "testset = testset_generator.generate(testset_size=10)\n",
-    "testset\n"
+    "testset"
    ]
   },
   {

diff --git a/docs/howtos/integrations/_langgraph_agent_evaluation.md b/docs/howtos/integrations/_langgraph_agent_evaluation.md
@@ -289,7 +289,7 @@ ragas_trace = convert_to_ragas_messages(result["messages"])
 
 
 ```python
-ragas_trace # List of Ragas messages
+ragas_trace  # List of Ragas messages
 ```
 
 

diff --git a/docs/howtos/integrations/langchain.ipynb b/docs/howtos/integrations/langchain.ipynb
@@ -25,7 +25,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
    "id": "fb5deb25",
    "metadata": {},
    "outputs": [],
@@ -59,10 +59,31 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "id": "4aa9a986",
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/jjmachan/.pyenv/versions/ragas/lib/python3.10/site-packages/langchain/indexes/vectorstore.py:128: UserWarning: Using InMemoryVectorStore as the default vectorstore.This memory store won't persist data. You should explicitlyspecify a vectorstore when using VectorstoreIndexCreator\n",
+      "  warnings.warn(\n"
+     ]
+    },
+    {
+     "ename": "ValidationError",
+     "evalue": "1 validation error for VectorstoreIndexCreator\nembedding\n  Field required [type=missing, input_value={}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.9/v/missing",
+     "output_type": "error",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mValidationError\u001b[0m                           Traceback (most recent call last)",
+      "Cell \u001b[0;32mIn[2], line 7\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01mlangchain_openai\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m ChatOpenAI\n\u001b[1;32m      6\u001b[0m loader \u001b[38;5;241m=\u001b[39m TextLoader(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m./nyc_wikipedia/nyc_text.txt\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m----> 7\u001b[0m index \u001b[38;5;241m=\u001b[39m \u001b[43mVectorstoreIndexCreator\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241m.\u001b[39mfrom_loaders([loader])\n\u001b[1;32m     10\u001b[0m llm \u001b[38;5;241m=\u001b[39m ChatOpenAI(temperature\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0\u001b[39m)\n\u001b[1;32m     11\u001b[0m qa_chain \u001b[38;5;241m=\u001b[39m RetrievalQA\u001b[38;5;241m.\u001b[39mfrom_chain_type(\n\u001b[1;32m     12\u001b[0m     llm,\n\u001b[1;32m     13\u001b[0m     retriever\u001b[38;5;241m=\u001b[39mindex\u001b[38;5;241m.\u001b[39mvectorstore\u001b[38;5;241m.\u001b[39mas_retriever(),\n\u001b[1;32m     14\u001b[0m     return_source_documents\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mTrue\u001b[39;00m,\n\u001b[1;32m     15\u001b[0m )\n",
+      "File \u001b[0;32m~/.pyenv/versions/ragas/lib/python3.10/site-packages/pydantic/main.py:212\u001b[0m, in \u001b[0;36mBaseModel.__init__\u001b[0;34m(self, **data)\u001b[0m\n\u001b[1;32m    210\u001b[0m \u001b[38;5;66;03m# `__tracebackhide__` tells pytest and some other tools to omit this function from tracebacks\u001b[39;00m\n\u001b[1;32m    211\u001b[0m __tracebackhide__ \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mTrue\u001b[39;00m\n\u001b[0;32m--> 212\u001b[0m validated_self \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__pydantic_validator__\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mvalidate_python\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mself_instance\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m    213\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m validated_self:\n\u001b[1;32m    214\u001b[0m     warnings\u001b[38;5;241m.\u001b[39mwarn(\n\u001b[1;32m    215\u001b[0m         \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mA custom validator is returning a value other than `self`.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m'\u001b[39m\n\u001b[1;32m    216\u001b[0m         \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mReturning anything other than `self` from a top level model validator isn\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mt supported when validating via `__init__`.\u001b[39m\u001b[38;5;130;01m\\n\u001b[39;00m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m    217\u001b[0m         \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mSee the `model_validator` docs (https://docs.pydantic.dev/latest/concepts/validators/#model-validators) for more details.\u001b[39m\u001b[38;5;124m'\u001b[39m,\n\u001b[1;32m    218\u001b[0m         category\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m,\n\u001b[1;32m    219\u001b[0m     )\n",
+      "\u001b[0;31mValidationError\u001b[0m: 1 validation error for VectorstoreIndexCreator\nembedding\n  Field required [type=missing, input_value={}, input_type=dict]\n    For further information visit https://errors.pydantic.dev/2.9/v/missing"
+     ]
+    }
+   ],
    "source": [
     "from langchain_community.document_loaders import TextLoader\n",
     "from langchain.indexes import VectorstoreIndexCreator\n",
@@ -495,7 +516,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.5"
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,
-Original file line number
+Diff line change
@@ Expand Up / @@ -289,7 +289,7 @@ ragas_trace = convert_to_ragas_messages(result["messages"]) @@
     ```python
-    ragas_trace # List of Ragas messages
+    ragas_trace  # List of Ragas messages
     ```
@@ Expand Down @@