feat: support for multiple n in llms (#197)

- adds support for all the opensource models - Make `OPENAI_API_KEY` not required when initing the library fixes: #115 #74
explodinggradients · Oct 20, 2023 · c2a64d5 · c2a64d5
1 parent c477e1e
commit c2a64d5
Show file tree

Hide file tree

Showing 18 changed files with 376 additions and 171 deletions.
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -95,7 +95,7 @@ jobs:
             OPTS=(--dist loadfile -n auto)
           fi
           # Now run the unit tests
-          OPENAI_API_KEY="test" pytest tests/unit "${OPTS[@]}"
+          pytest tests/unit "${OPTS[@]}"
 
   codestyle_check:
     runs-on: ubuntu-latest

diff --git a/docs/getstarted/evaluation.md b/docs/getstarted/evaluation.md
@@ -5,9 +5,10 @@ welcome to the ragas quickstart. We're going to get you up and running with raga
 
 to kick things of lets start with the data
 
-```{note}
-Are you using Azure OpenAI endpoints? Then checkout [this quickstart guide](./guides/quickstart-azure-openai.ipynb)
-```
+:::{note}
+Are you using Azure OpenAI endpoints? Then checkout [this quickstart
+guide](../howtos/customisations/azure-openai.ipynb)
+:::
 
 ```bash
 pip install ragas
@@ -21,22 +22,15 @@ os.environ["OPENAI_API_KEY"] = "your-openai-key"
 ```
 ## The Data
 
-Ragas performs a `ground_truth` free evaluation of your RAG pipelines. This is because for most people building a gold labeled dataset which represents in the distribution they get in production is a very expensive process.
-
-```{note}
-While originally ragas was aimed at `ground_truth` free evaluations there is some aspects of the RAG pipeline that need `ground_truth` in order to measure. We're in the process of building a testset generation features that will make it easier. Checkout [issue#136](https://github.com/explodinggradients/ragas/issues/136) for more details.
-```
+For this tutorial we are going to use an example dataset from one of the baselines we created for the [Financial Opinion Mining and Question Answering (fiqa) Dataset](https://sites.google.com/view/fiqa/). The dataset has the following columns.
 
-Hence to work with ragas all you need are the following data
 - question: `list[str]` - These are the questions your RAG pipeline will be evaluated on.
 - answer: `list[str]` - The answer generated from the RAG pipeline and given to the user.
 - contexts: `list[list[str]]` - The contexts which were passed into the LLM to answer the question.
 - ground_truths: `list[list[str]]` - The ground truth answer to the questions. (only required if you are using context_recall)
 
 Ideally your list of questions should reflect the questions your users give, including those that you have been problematic in the past.
 
-Here we're using an example dataset from on of the baselines we created for the [Financial Opinion Mining and Question Answering (fiqa) Dataset](https://sites.google.com/view/fiqa/) we created.
-
 
 ```{code-block} python
 :caption: import sample dataset
@@ -46,10 +40,10 @@ fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
 fiqa_eval
 ```
 
-```{seealso}
-See [prepare-data](/docs/concepts/prepare_data.md) to learn how to prepare your own custom data for evaluation.
+:::{seealso}
+See [testset generation](./testset_generation.md) to learn how to generate your own synthetic data for evaluation.
+:::
 
-```
 ## Metrics
 
 Ragas provides you with a few metrics to evaluate the different aspects of your RAG systems namely
@@ -78,9 +72,9 @@ here you can see that we are using 4 metrics, but what do the represent?
 4. context_recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question.
 
 
-```{note}
-by default these metrics are using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key. You can also try other LLMs for evaluation, check the [llm guide](./guides/llms.ipynb) to learn more
-```
+:::{note}
+by default these metrics are using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key. You can also try other LLMs for evaluation, check the [llm guide](../howtos/customisations/llms.ipynb) to learn more
+:::
 
 ## Evaluation
 
@@ -91,13 +85,12 @@ Running the evaluation is as simple as calling evaluate on the `Dataset` with th
 from ragas import evaluate
 
 result = evaluate(
-    fiqa_eval["baseline"].select(range(1)),
+    fiqa_eval["baseline"].select(range(3)), # selecting only 3
     metrics=[
         context_precision,
         faithfulness,
         answer_relevancy,
         context_recall,
-        harmfulness,
     ],
 )
 

diff --git a/...tomisations/quickstart-azure-openai.ipynb → .../howtos/customisations/azure-openai.ipynb b/...tomisations/quickstart-azure-openai.ipynb → .../howtos/customisations/azure-openai.ipynb
@@ -5,25 +5,19 @@
    "id": "7c249b40",
    "metadata": {},
    "source": [
-    "# Using Azure OpenAI Endpoints\n"
+    "# Using Azure OpenAI\n",
+    "\n",
+    "This tutorial will show you how to use Azure OpenAI endpoints instead of OpenAI endpoints."
    ]
   },
   {
    "cell_type": "markdown",
    "id": "2e63f667",
    "metadata": {},
    "source": [
-    "<p>\n",
-    "    <a href=\"https://colab.research.google.com/github/explodinggradients/ragas/blob/main/docs/quickstart.ipynb\">\n",
-    "        <img alt=\"Open In Colab\" \n",
-    "             align=\"left\"\n",
-    "             src=\"https://colab.research.google.com/assets/colab-badge.svg\">\n",
-    "    </a>\n",
-    "    <br>\n",
-    "</p>\n",
-    "\n",
-    "\n",
-    "> **Note:** this guide is for folks who are using the Azure OpenAI endpoints. Check the [quickstart guide](../../getstarted/evaluation.md) if your using OpenAI endpoints."
+    ":::{Note}\n",
+    "this guide is for folks who are using the Azure OpenAI endpoints. Check the [evaluation guide](../../getstarted/evaluation.md) if your using OpenAI endpoints.\n",
+    ":::"
    ]
   },
   {

diff --git a/docs/howtos/customisations/index.md b/docs/howtos/customisations/index.md
@@ -4,5 +4,5 @@ How to customize Ragas for your needs
 
 :::{toctree}
 llms.ipynb
-quickstart-azure-openai.ipynb
+azure-openai.ipynb
 :::
diff --git a/docs/howtos/customisations/llms.ipynb b/docs/howtos/customisations/llms.ipynb
@@ -12,17 +12,25 @@
     "- [Completion LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.llms)\n",
     "- [Chat based LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.chat_models)\n",
     "\n",
-    "This guide will show you how to use another or LLM API for evaluation.\n",
-    "\n",
-    "> **Note**: If your looking to use Azure OpenAI for evaluation checkout [this guide](./quickstart-azure-openai.ipynb)"
+    "This guide will show you how to use another or LLM API for evaluation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43b57fcd-5f3f-4dc5-9ba1-c3b152c501cc",
+   "metadata": {},
+   "source": [
+    ":::{Note}\n",
+    "If your looking to use Azure OpenAI for evaluation checkout [this guide](./azure-openai.ipynb)\n",
+    ":::"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "55f0f9b9",
    "metadata": {},
    "source": [
-    "### Evaluating with GPT4\n",
+    "## Evaluating with GPT4\n",
     "\n",
     "Ragas uses gpt3.5 by default but using gpt4 for evaluation can improve the results so lets use that for the `Faithfulness` metric\n",
     "\n",
@@ -71,7 +79,7 @@
    "source": [
     "from ragas.metrics import faithfulness\n",
     "\n",
-    "faithfulness.llm = gpt4"
+    "faithfulness.llm.langchain_llm = gpt4"
    ]
   },
   {
@@ -100,7 +108,7 @@
     {
      "data": {
       "application/vnd.jupyter.widget-view+json": {
-       "model_id": "9fb581d4057d4e70a0b70830b2f5f487",
+       "model_id": "6ecc1636c4f84c7292fc9d8675e691c7",
        "version_major": 2,
        "version_minor": 0
       },
@@ -152,13 +160,13 @@
      "name": "stderr",
      "output_type": "stream",
      "text": [
-      "100%|████████████████████████████████████████████████████████████| 2/2 [22:28<00:00, 674.38s/it]\n"
+      "100%|████████████████████████████████████████████████████████████| 1/1 [07:10<00:00, 430.26s/it]\n"
      ]
     },
     {
      "data": {
       "text/plain": [
-       "{'faithfulness': 0.7237}"
+       "{'faithfulness': 0.8867}"
       ]
      },
      "execution_count": 5,
@@ -170,7 +178,132 @@
     "# evaluate\n",
     "from ragas import evaluate\n",
     "\n",
-    "result = evaluate(fiqa_eval[\"baseline\"], metrics=[faithfulness])\n",
+    "result = evaluate(\n",
+    "    fiqa_eval[\"baseline\"].select(range(5)), # showing only 5 for demonstration \n",
+    "    metrics=[faithfulness]\n",
+    ")\n",
+    "\n",
+    "result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f490031e-fb73-4170-8762-61cadb4031e6",
+   "metadata": {},
+   "source": [
+    "## Evaluating with Open-Source LLMs\n",
+    "\n",
+    "You can also use any of the Open-Source LLM for evaluating. Ragas support most the the deployment methods like [HuggingFace TGI](https://python.langchain.com/docs/integrations/llms/huggingface_textgen_inference), [Anyscale](https://python.langchain.com/docs/integrations/llms/anyscale), [vLLM](https://python.langchain.com/docs/integrations/llms/vllm) and many [more](https://python.langchain.com/docs/integrations/llms/) through Langchain. \n",
+    "\n",
+    "When it comes to selecting open-source language models, there are some rules of thumb to follow, given that the quality of evaluation metrics depends heavily on the model's quality:\n",
+    "\n",
+    "1. Opt for models with more than 7 billion parameters. This choice ensures a minimum level of quality in the results for ragas metrics. Models like Llama-2 or Mistral can be an excellent starting point.\n",
+    "2. Always prioritize finetuned models over base models. Finetuned models tend to follow instructions more effectively, which can significantly improve their performance.\n",
+    "3. If your project focuses on a specific domain, such as science or finance, prioritize models that have been pre-trained on a larger volume of tokens from your domain of interest. For instance, if you are working with research data, consider models pre-trained on a substantial number of tokens from platforms like arXiv or Semantic Scholar.\n",
+    "\n",
+    ":::{note}\n",
+    "Choosing the right Open-Source LLM for evaluation can by tricky. You can also fine-tune these models to get even better performance on Ragas meterics. If you need some help/advice on that feel free to [talk to us](https://calendly.com/shahules/30min)\n",
+    ":::\n",
+    "\n",
+    "In this example we are going to use [vLLM](https://github.com/vllm-project/vllm) for hosting a `HuggingFaceH4/zephyr-7b-alpha`. Checkout the [quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) for more details on how to get started with vLLM."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "85e313f2-e45c-4551-ab20-4e526e098740",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# start the vLLM server\n",
+    "!python -m vllm.entrypoints.openai.api_server \\\n",
+    "    --model HuggingFaceH4/zephyr-7b-alpha \\\n",
+    "    --host 0.0.0.0 \\\n",
+    "    --port 8080"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c9ddf74a-9830-4e1a-a4dd-7e5ec17a71e4",
+   "metadata": {},
+   "source": [
+    "Now lets create an Langchain llm instance. Because vLLM can run in OpenAI compatibilitiy mode, we can use the `ChatOpenAI` class like this."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2fd4adf3-db15-4c95-bf7c-407266517214",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.chat_models import ChatOpenAI\n",
+    "\n",
+    "inference_server_url = \"http://localhost:8080/v1\"\n",
+    "\n",
+    "chat = ChatOpenAI(\n",
+    "    model=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
+    "    openai_api_key=\"no-key\",\n",
+    "    openai_api_base=inference_server_url,\n",
+    "    max_tokens=5,\n",
+    "    temperature=0,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2dd7932a-7933-4de8-a6af-2830457e02a0",
+   "metadata": {},
+   "source": [
+    "Now lets import all the metrics you want to use and change the llm."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20882d05-1b54-4d17-88a0-f7ada2d6a576",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ragas.metrics import (\n",
+    "    context_precision,\n",
+    "    answer_relevancy,\n",
+    "    faithfulness,\n",
+    "    context_recall,\n",
+    ")\n",
+    "from ragas.metrics.critique import harmfulness\n",
+    "\n",
+    "# change the LLM\n",
+    "\n",
+    "faithfulness.llm.langchain_llm = chat\n",
+    "answer_relevancy.llm.langchain_llm = chat\n",
+    "context_precision.llm.langchain_llm = chat\n",
+    "context_recall.llm.langchain_llm = chat\n",
+    "harmfulness.llm.langchain_llm = chat"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "58a610f2-19e5-40ec-bb7d-760c1d608a85",
+   "metadata": {},
+   "source": [
+    "Now you can run the evaluations with and analyse the results."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d8858300-7985-4c79-8d03-c671afd645ac",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# evaluate\n",
+    "from ragas import evaluate\n",
+    "\n",
+    "result = evaluate(\n",
+    "    fiqa_eval[\"baseline\"].select(range(5)), # showing only 5 for demonstration \n",
+    "    metrics=[faithfulness]\n",
+    ")\n",
     "\n",
     "result"
    ]

diff --git a/src/ragas/exceptions.py b/src/ragas/exceptions.py
@@ -9,3 +9,10 @@ class RagasException(Exception):
     def __init__(self, message: str):
         self.message = message
         super().__init__(message)
+
+
+class OpenAIKeyNotFound(RagasException):
+    message: str = "OpenAI API key not found! Seems like your trying to use Ragas metrics with OpenAI endpoints. Please set 'OPENAI_API_KEY' environment variable"  # noqa
+
+    def __init__(self):
+        super().__init__(self.message)
diff --git a/src/ragas/metrics/answer_correctness.py b/src/ragas/metrics/answer_correctness.py
@@ -51,9 +51,6 @@ def __post_init__(self: t.Self):
         if self.faithfulness is None:
             self.faithfulness = Faithfulness(llm=self.llm, batch_size=self.batch_size)
 
-    def init_model(self: t.Self):
-        pass
-
     def _score_batch(
         self: t.Self,
         dataset: Dataset,

diff --git a/src/ragas/metrics/answer_relevance.py b/src/ragas/metrics/answer_relevance.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import os
 import typing as t
 from dataclasses import dataclass
 
@@ -10,8 +11,8 @@
 from langchain.embeddings.base import Embeddings
 from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate
 
+from ragas.exceptions import OpenAIKeyNotFound
 from ragas.metrics.base import EvaluationMode, MetricWithLLM
-from ragas.metrics.llms import generate
 
 if t.TYPE_CHECKING:
     from langchain.callbacks.manager import CallbackManager
@@ -57,13 +58,16 @@ class AnswerRelevancy(MetricWithLLM):
     embeddings: Embeddings | None = None
 
     def __post_init__(self: t.Self):
-        self.temperature = 0.2 if self.strictness > 0 else 0
-
         if self.embeddings is None:
-            self.embeddings = OpenAIEmbeddings()  # type: ignore
+            oai_key = os.getenv("OPENAI_API_KEY", "no-key")
+            self.embeddings = OpenAIEmbeddings(openai_api_key=oai_key)  # type: ignore
+
+    def init_model(self):
+        super().init_model()
 
-    def init_model(self: t.Self):
-        pass
+        if isinstance(self.embeddings, OpenAIEmbeddings):
+            if self.embeddings.openai_api_key == "no-key":
+                raise OpenAIKeyNotFound
 
     def _score_batch(
         self: t.Self,
@@ -80,11 +84,9 @@ def _score_batch(
                 human_prompt = QUESTION_GEN.format(answer=ans)
                 prompts.append(ChatPromptTemplate.from_messages([human_prompt]))
 
-            results = generate(
+            results = self.llm.generate(
                 prompts,
-                self.llm,
                 n=self.strictness,
-                temperature=self.temperature,
                 callbacks=batch_group,
             )
             results = [[i.text for i in r] for r in results.generations]