Skip to content

Commit

Permalink
feat: support for multiple n in llms (#197)
Browse files Browse the repository at this point in the history
- adds support for all the opensource models
- Make `OPENAI_API_KEY` not required when initing the library
fixes: #115 #74
  • Loading branch information
jjmachan authored Oct 20, 2023
1 parent c477e1e commit c2a64d5
Show file tree
Hide file tree
Showing 18 changed files with 376 additions and 171 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ jobs:
OPTS=(--dist loadfile -n auto)
fi
# Now run the unit tests
OPENAI_API_KEY="test" pytest tests/unit "${OPTS[@]}"
pytest tests/unit "${OPTS[@]}"
codestyle_check:
runs-on: ubuntu-latest
Expand Down
31 changes: 12 additions & 19 deletions docs/getstarted/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@ welcome to the ragas quickstart. We're going to get you up and running with raga

to kick things of lets start with the data

```{note}
Are you using Azure OpenAI endpoints? Then checkout [this quickstart guide](./guides/quickstart-azure-openai.ipynb)
```
:::{note}
Are you using Azure OpenAI endpoints? Then checkout [this quickstart
guide](../howtos/customisations/azure-openai.ipynb)
:::

```bash
pip install ragas
Expand All @@ -21,22 +22,15 @@ os.environ["OPENAI_API_KEY"] = "your-openai-key"
```
## The Data

Ragas performs a `ground_truth` free evaluation of your RAG pipelines. This is because for most people building a gold labeled dataset which represents in the distribution they get in production is a very expensive process.

```{note}
While originally ragas was aimed at `ground_truth` free evaluations there is some aspects of the RAG pipeline that need `ground_truth` in order to measure. We're in the process of building a testset generation features that will make it easier. Checkout [issue#136](https://github.com/explodinggradients/ragas/issues/136) for more details.
```
For this tutorial we are going to use an example dataset from one of the baselines we created for the [Financial Opinion Mining and Question Answering (fiqa) Dataset](https://sites.google.com/view/fiqa/). The dataset has the following columns.

Hence to work with ragas all you need are the following data
- question: `list[str]` - These are the questions your RAG pipeline will be evaluated on.
- answer: `list[str]` - The answer generated from the RAG pipeline and given to the user.
- contexts: `list[list[str]]` - The contexts which were passed into the LLM to answer the question.
- ground_truths: `list[list[str]]` - The ground truth answer to the questions. (only required if you are using context_recall)

Ideally your list of questions should reflect the questions your users give, including those that you have been problematic in the past.

Here we're using an example dataset from on of the baselines we created for the [Financial Opinion Mining and Question Answering (fiqa) Dataset](https://sites.google.com/view/fiqa/) we created.


```{code-block} python
:caption: import sample dataset
Expand All @@ -46,10 +40,10 @@ fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
fiqa_eval
```

```{seealso}
See [prepare-data](/docs/concepts/prepare_data.md) to learn how to prepare your own custom data for evaluation.
:::{seealso}
See [testset generation](./testset_generation.md) to learn how to generate your own synthetic data for evaluation.
:::

```
## Metrics

Ragas provides you with a few metrics to evaluate the different aspects of your RAG systems namely
Expand Down Expand Up @@ -78,9 +72,9 @@ here you can see that we are using 4 metrics, but what do the represent?
4. context_recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question.


```{note}
by default these metrics are using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key. You can also try other LLMs for evaluation, check the [llm guide](./guides/llms.ipynb) to learn more
```
:::{note}
by default these metrics are using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key. You can also try other LLMs for evaluation, check the [llm guide](../howtos/customisations/llms.ipynb) to learn more
:::

## Evaluation

Expand All @@ -91,13 +85,12 @@ Running the evaluation is as simple as calling evaluate on the `Dataset` with th
from ragas import evaluate
result = evaluate(
fiqa_eval["baseline"].select(range(1)),
fiqa_eval["baseline"].select(range(3)), # selecting only 3
metrics=[
context_precision,
faithfulness,
answer_relevancy,
context_recall,
harmfulness,
],
)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,19 @@
"id": "7c249b40",
"metadata": {},
"source": [
"# Using Azure OpenAI Endpoints\n"
"# Using Azure OpenAI\n",
"\n",
"This tutorial will show you how to use Azure OpenAI endpoints instead of OpenAI endpoints."
]
},
{
"cell_type": "markdown",
"id": "2e63f667",
"metadata": {},
"source": [
"<p>\n",
" <a href=\"https://colab.research.google.com/github/explodinggradients/ragas/blob/main/docs/quickstart.ipynb\">\n",
" <img alt=\"Open In Colab\" \n",
" align=\"left\"\n",
" src=\"https://colab.research.google.com/assets/colab-badge.svg\">\n",
" </a>\n",
" <br>\n",
"</p>\n",
"\n",
"\n",
"> **Note:** this guide is for folks who are using the Azure OpenAI endpoints. Check the [quickstart guide](../../getstarted/evaluation.md) if your using OpenAI endpoints."
":::{Note}\n",
"this guide is for folks who are using the Azure OpenAI endpoints. Check the [evaluation guide](../../getstarted/evaluation.md) if your using OpenAI endpoints.\n",
":::"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion docs/howtos/customisations/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@ How to customize Ragas for your needs

:::{toctree}
llms.ipynb
quickstart-azure-openai.ipynb
azure-openai.ipynb
:::
151 changes: 142 additions & 9 deletions docs/howtos/customisations/llms.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,25 @@
"- [Completion LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.llms)\n",
"- [Chat based LLMs Supported](https://api.python.langchain.com/en/latest/api_reference.html#module-langchain.chat_models)\n",
"\n",
"This guide will show you how to use another or LLM API for evaluation.\n",
"\n",
"> **Note**: If your looking to use Azure OpenAI for evaluation checkout [this guide](./quickstart-azure-openai.ipynb)"
"This guide will show you how to use another or LLM API for evaluation."
]
},
{
"cell_type": "markdown",
"id": "43b57fcd-5f3f-4dc5-9ba1-c3b152c501cc",
"metadata": {},
"source": [
":::{Note}\n",
"If your looking to use Azure OpenAI for evaluation checkout [this guide](./azure-openai.ipynb)\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "55f0f9b9",
"metadata": {},
"source": [
"### Evaluating with GPT4\n",
"## Evaluating with GPT4\n",
"\n",
"Ragas uses gpt3.5 by default but using gpt4 for evaluation can improve the results so lets use that for the `Faithfulness` metric\n",
"\n",
Expand Down Expand Up @@ -71,7 +79,7 @@
"source": [
"from ragas.metrics import faithfulness\n",
"\n",
"faithfulness.llm = gpt4"
"faithfulness.llm.langchain_llm = gpt4"
]
},
{
Expand Down Expand Up @@ -100,7 +108,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9fb581d4057d4e70a0b70830b2f5f487",
"model_id": "6ecc1636c4f84c7292fc9d8675e691c7",
"version_major": 2,
"version_minor": 0
},
Expand Down Expand Up @@ -152,13 +160,13 @@
"name": "stderr",
"output_type": "stream",
"text": [
"100%|████████████████████████████████████████████████████████████| 2/2 [22:28<00:00, 674.38s/it]\n"
"100%|████████████████████████████████████████████████████████████| 1/1 [07:10<00:00, 430.26s/it]\n"
]
},
{
"data": {
"text/plain": [
"{'faithfulness': 0.7237}"
"{'faithfulness': 0.8867}"
]
},
"execution_count": 5,
Expand All @@ -170,7 +178,132 @@
"# evaluate\n",
"from ragas import evaluate\n",
"\n",
"result = evaluate(fiqa_eval[\"baseline\"], metrics=[faithfulness])\n",
"result = evaluate(\n",
" fiqa_eval[\"baseline\"].select(range(5)), # showing only 5 for demonstration \n",
" metrics=[faithfulness]\n",
")\n",
"\n",
"result"
]
},
{
"cell_type": "markdown",
"id": "f490031e-fb73-4170-8762-61cadb4031e6",
"metadata": {},
"source": [
"## Evaluating with Open-Source LLMs\n",
"\n",
"You can also use any of the Open-Source LLM for evaluating. Ragas support most the the deployment methods like [HuggingFace TGI](https://python.langchain.com/docs/integrations/llms/huggingface_textgen_inference), [Anyscale](https://python.langchain.com/docs/integrations/llms/anyscale), [vLLM](https://python.langchain.com/docs/integrations/llms/vllm) and many [more](https://python.langchain.com/docs/integrations/llms/) through Langchain. \n",
"\n",
"When it comes to selecting open-source language models, there are some rules of thumb to follow, given that the quality of evaluation metrics depends heavily on the model's quality:\n",
"\n",
"1. Opt for models with more than 7 billion parameters. This choice ensures a minimum level of quality in the results for ragas metrics. Models like Llama-2 or Mistral can be an excellent starting point.\n",
"2. Always prioritize finetuned models over base models. Finetuned models tend to follow instructions more effectively, which can significantly improve their performance.\n",
"3. If your project focuses on a specific domain, such as science or finance, prioritize models that have been pre-trained on a larger volume of tokens from your domain of interest. For instance, if you are working with research data, consider models pre-trained on a substantial number of tokens from platforms like arXiv or Semantic Scholar.\n",
"\n",
":::{note}\n",
"Choosing the right Open-Source LLM for evaluation can by tricky. You can also fine-tune these models to get even better performance on Ragas meterics. If you need some help/advice on that feel free to [talk to us](https://calendly.com/shahules/30min)\n",
":::\n",
"\n",
"In this example we are going to use [vLLM](https://github.com/vllm-project/vllm) for hosting a `HuggingFaceH4/zephyr-7b-alpha`. Checkout the [quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) for more details on how to get started with vLLM."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "85e313f2-e45c-4551-ab20-4e526e098740",
"metadata": {},
"outputs": [],
"source": [
"# start the vLLM server\n",
"!python -m vllm.entrypoints.openai.api_server \\\n",
" --model HuggingFaceH4/zephyr-7b-alpha \\\n",
" --host 0.0.0.0 \\\n",
" --port 8080"
]
},
{
"cell_type": "markdown",
"id": "c9ddf74a-9830-4e1a-a4dd-7e5ec17a71e4",
"metadata": {},
"source": [
"Now lets create an Langchain llm instance. Because vLLM can run in OpenAI compatibilitiy mode, we can use the `ChatOpenAI` class like this."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2fd4adf3-db15-4c95-bf7c-407266517214",
"metadata": {},
"outputs": [],
"source": [
"from langchain.chat_models import ChatOpenAI\n",
"\n",
"inference_server_url = \"http://localhost:8080/v1\"\n",
"\n",
"chat = ChatOpenAI(\n",
" model=\"HuggingFaceH4/zephyr-7b-alpha\",\n",
" openai_api_key=\"no-key\",\n",
" openai_api_base=inference_server_url,\n",
" max_tokens=5,\n",
" temperature=0,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "2dd7932a-7933-4de8-a6af-2830457e02a0",
"metadata": {},
"source": [
"Now lets import all the metrics you want to use and change the llm."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20882d05-1b54-4d17-88a0-f7ada2d6a576",
"metadata": {},
"outputs": [],
"source": [
"from ragas.metrics import (\n",
" context_precision,\n",
" answer_relevancy,\n",
" faithfulness,\n",
" context_recall,\n",
")\n",
"from ragas.metrics.critique import harmfulness\n",
"\n",
"# change the LLM\n",
"\n",
"faithfulness.llm.langchain_llm = chat\n",
"answer_relevancy.llm.langchain_llm = chat\n",
"context_precision.llm.langchain_llm = chat\n",
"context_recall.llm.langchain_llm = chat\n",
"harmfulness.llm.langchain_llm = chat"
]
},
{
"cell_type": "markdown",
"id": "58a610f2-19e5-40ec-bb7d-760c1d608a85",
"metadata": {},
"source": [
"Now you can run the evaluations with and analyse the results."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d8858300-7985-4c79-8d03-c671afd645ac",
"metadata": {},
"outputs": [],
"source": [
"# evaluate\n",
"from ragas import evaluate\n",
"\n",
"result = evaluate(\n",
" fiqa_eval[\"baseline\"].select(range(5)), # showing only 5 for demonstration \n",
" metrics=[faithfulness]\n",
")\n",
"\n",
"result"
]
Expand Down
7 changes: 7 additions & 0 deletions src/ragas/exceptions.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,10 @@ class RagasException(Exception):
def __init__(self, message: str):
self.message = message
super().__init__(message)


class OpenAIKeyNotFound(RagasException):
message: str = "OpenAI API key not found! Seems like your trying to use Ragas metrics with OpenAI endpoints. Please set 'OPENAI_API_KEY' environment variable" # noqa

def __init__(self):
super().__init__(self.message)
3 changes: 0 additions & 3 deletions src/ragas/metrics/answer_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,6 @@ def __post_init__(self: t.Self):
if self.faithfulness is None:
self.faithfulness = Faithfulness(llm=self.llm, batch_size=self.batch_size)

def init_model(self: t.Self):
pass

def _score_batch(
self: t.Self,
dataset: Dataset,
Expand Down
20 changes: 11 additions & 9 deletions src/ragas/metrics/answer_relevance.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from __future__ import annotations

import os
import typing as t
from dataclasses import dataclass

Expand All @@ -10,8 +11,8 @@
from langchain.embeddings.base import Embeddings
from langchain.prompts import ChatPromptTemplate, HumanMessagePromptTemplate

from ragas.exceptions import OpenAIKeyNotFound
from ragas.metrics.base import EvaluationMode, MetricWithLLM
from ragas.metrics.llms import generate

if t.TYPE_CHECKING:
from langchain.callbacks.manager import CallbackManager
Expand Down Expand Up @@ -57,13 +58,16 @@ class AnswerRelevancy(MetricWithLLM):
embeddings: Embeddings | None = None

def __post_init__(self: t.Self):
self.temperature = 0.2 if self.strictness > 0 else 0

if self.embeddings is None:
self.embeddings = OpenAIEmbeddings() # type: ignore
oai_key = os.getenv("OPENAI_API_KEY", "no-key")
self.embeddings = OpenAIEmbeddings(openai_api_key=oai_key) # type: ignore

def init_model(self):
super().init_model()

def init_model(self: t.Self):
pass
if isinstance(self.embeddings, OpenAIEmbeddings):
if self.embeddings.openai_api_key == "no-key":
raise OpenAIKeyNotFound

def _score_batch(
self: t.Self,
Expand All @@ -80,11 +84,9 @@ def _score_batch(
human_prompt = QUESTION_GEN.format(answer=ans)
prompts.append(ChatPromptTemplate.from_messages([human_prompt]))

results = generate(
results = self.llm.generate(
prompts,
self.llm,
n=self.strictness,
temperature=self.temperature,
callbacks=batch_group,
)
results = [[i.text for i in r] for r in results.generations]
Expand Down
Loading

0 comments on commit c2a64d5

Please sign in to comment.