Skip to content

Commit

Permalink
feat: open usage tracking (#52)
Browse files Browse the repository at this point in the history
Adds the following env vars to debug
```
export RAGAS_DEBUG=True
export __RAGAS_DEBUG_TRACKING=True
export RAGAS_DO_NOT_TRACK=True
```

fixes #51
  • Loading branch information
jjmachan authored Jul 6, 2023
1 parent 8e29433 commit 7df8cbc
Show file tree
Hide file tree
Showing 5 changed files with 129 additions and 16 deletions.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ clean: ## Clean all generated files
run-ci: format lint type ## Running all CI checks
run-benchmarks: ## Run benchmarks
@echo "Running benchmarks..."
@cd $(GIT_ROOT)/tests/benchmarks && python benchmark.py
@cd $(GIT_ROOT)/tests/benchmarks && python benchmark_eval.py
test: ## Run tests
@echo "Running tests..."
@pytest tests/unit
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
<a href="#fire-quickstart">Quickstart</a> |
<a href="#luggage-metrics">Metrics</a> |
<a href="#-community">Community</a> |
<a href="#-open-analytics">Open Analytics</a> |
<a href="#raising_hand_man-faq">FAQ</a> |
<a href="https://huggingface.co/explodinggradients">Hugging Face</a>
<p>
Expand Down Expand Up @@ -86,28 +87,29 @@ Ragas measures your pipeline's performance against two dimensions
Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.

To read more about our metrics, checkout [docs](/docs/metrics.md).
## :question: How to use Ragas to improve your pipeline?
*"Measurement is the first step that leads to control and eventually to improvement" - James Harrington*
## 🫂 Community
If you want to get more involved with Ragas, check out our [discord server](https://discord.gg/5djav8GGNZ). It's a fun community where we geek out about LLM, Retrieval, Production issues and more.

Here we assume that you already have your RAG pipeline ready. When it comes to RAG pipelines, there are mainly two parts - Retriever and generator. A change in any of this should also impact your pipelines's quality.
## 🔍 Open Analytics
We track very basic usage metrics to guide us to figure out what our users want, what is working and what's not. As a young startup, we have to be brutally honest about this which is why we are tracking these metrics. But as an Open Startup we open-source all the data we collect. You can read more about this [here](https://github.com/explodinggradients/ragas/issues/49). If you want to take a look at exactly what we track, feel free to check the [code](./src/ragas/_analytics.py)

1. First, decide one parameter that you're interested in adjusting. for example the number of retrieved documents, K.
2. Collect a set of sample prompts (min 20) to form your test set.
3. Run your pipeline using the test set before and after the change. Each time record the prompts with context and generated output.
4. Run ragas evaluation for each of them to generate evaluation scores.
5. Compare the scores and you will know how much the change has affected your pipelines' performance.
You can disable usage-tracking if you want by setting the `RAGAS_DO_NOT_TRACK` flag to true.

## 🫂 Community
If you want to get more involved with Ragas, check out our [discord server](https://discord.gg/5djav8GGNZ). It's a fun community where we geek out about LLM, Retrieval, Production issues and more.

## :raising_hand_man: FAQ
1. Why harmonic mean?

Harmonic mean penalizes extreme values. For example, if your generated answer is fully factually consistent with the context (faithfulness = 1) but is not relevant to the question (relevancy = 0), a simple average would give you a score of 0.5 but a harmonic mean will give you 0.0

2. How to use Ragas to improve your pipeline?

*"Measurement is the first step that leads to control and eventually to improvement" - James Harrington*

Here we assume that you already have your RAG pipeline ready. When it comes to RAG pipelines, there are mainly two parts - Retriever and generator. A change in any of this should also impact your pipelines's quality.



1. First, decide one parameter that you're interested in adjusting. for example the number of retrieved documents, K.
2. Collect a set of sample prompts (min 20) to form your test set.
3. Run your pipeline using the test set before and after the change. Each time record the prompts with context and generated output.
4. Run ragas evaluation for each of them to generate evaluation scores.
5. Compare the scores and you will know how much the change has affected your pipelines' performance.

84 changes: 84 additions & 0 deletions src/ragas/_analytics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
from __future__ import annotations

import logging
import os
import typing as t
from dataclasses import asdict, dataclass
from functools import lru_cache, wraps

import requests

from ragas.utils import get_debug_mode

if t.TYPE_CHECKING:
P = t.ParamSpec("P")
T = t.TypeVar("T")
AsyncFunc = t.Callable[P, t.Coroutine[t.Any, t.Any, t.Any]]

logger = logging.getLogger(__name__)


USAGE_TRACKING_URL = "https://t.explodinggradients.com"
RAGAS_DO_NOT_TRACK = "RAGAS_DO_NOT_TRACK"
RAGAS_DEBUG_TRACKING = "__RAGAS_DEBUG_TRACKING"
USAGE_REQUESTS_TIMEOUT_SEC = 1


@lru_cache(maxsize=1)
def do_not_track() -> bool: # pragma: no cover
# Returns True if and only if the environment variable is defined and has value True
# The function is cached for better performance.
return os.environ.get(RAGAS_DO_NOT_TRACK, str(False)).lower() == "true"


@lru_cache(maxsize=1)
def _usage_event_debugging() -> bool:
# For BentoML developers only - debug and print event payload if turned on
return os.environ.get(RAGAS_DEBUG_TRACKING, str(False)).lower() == "true"


def silent(func: t.Callable[P, T]) -> t.Callable[P, T]: # pragma: no cover
# Silent errors when tracking
@wraps(func)
def wrapper(*args: P.args, **kwargs: P.kwargs) -> t.Any:
try:
return func(*args, **kwargs)
except Exception as err: # pylint: disable=broad-except
if _usage_event_debugging():
if get_debug_mode():
logger.error(
"Tracking Error: %s", err, stack_info=True, stacklevel=3
)
else:
logger.info("Tracking Error: %s", err)
else:
logger.debug("Tracking Error: %s", err)

return wrapper


@dataclass
class BaseEvent:
event_type: str


@dataclass
class EvaluationEvent(BaseEvent):
metrics: list[str]
evaluation_mode: str
num_rows: int


@silent
def track(event_properties: BaseEvent):
if do_not_track():
return

payload = asdict(event_properties)

if _usage_event_debugging():
# For internal debugging purpose
logger.info("Tracking Payload: %s", payload)
return

requests.post(USAGE_TRACKING_URL, json=payload, timeout=USAGE_REQUESTS_TIMEOUT_SEC)
20 changes: 17 additions & 3 deletions src/ragas/evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import numpy as np
from datasets import Dataset, concatenate_datasets

from ragas._analytics import EvaluationEvent, track
from ragas.metrics.base import Metric

EvaluationMode = Enum("EvaluationMode", "generative retrieval grounded")
Expand All @@ -17,7 +18,7 @@ def get_evaluation_mode(ds: Dataset):
possible evaluation types
1. (q,a,c)
2. (q)
2. (q,a)
3. (q,c)
4. (g,a)
"""
Expand Down Expand Up @@ -87,6 +88,17 @@ def evaluate(
for metric in metrics:
scores.append(metric.score(dataset).select_columns(metric.name))

# log the evaluation event
metrics_names = [m.name for m in metrics]
track(
EvaluationEvent(
event_type="evaluation",
metrics=metrics_names,
evaluation_mode="",
num_rows=dataset.shape[0],
)
)

return Result(scores=concatenate_datasets(scores, axis=1), dataset=dataset)


Expand Down Expand Up @@ -117,7 +129,9 @@ def to_pandas(self, batch_size: int | None = None, batched: bool = False):

def __repr__(self) -> str:
scores = self.copy()
ragas_score = scores.pop("ragas_score")
score_strs = [f"'ragas_score': {ragas_score:0.4f}"]
score_strs = []
if "ragas_score" in scores:
ragas_score = scores.pop("ragas_score")
score_strs += f"'ragas_score': {ragas_score:0.4f}"
score_strs.extend([f"'{k}': {v:0.4f}" for k, v in scores.items()])
return "{" + ", ".join(score_strs) + "}"
13 changes: 13 additions & 0 deletions src/ragas/utils.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
from __future__ import annotations

import logging
import os
import typing as t
from functools import lru_cache
from warnings import warn

import torch
from torch import device as Device

DEVICES = ["cpu", "cuda"]
DEBUG_ENV_VAR = "RAGAS_DEBUG"


def device_check(device: t.Literal["cpu", "cuda"] | Device) -> torch.device:
Expand All @@ -19,3 +23,12 @@ def device_check(device: t.Literal["cpu", "cuda"] | Device) -> torch.device:
device = "cpu"

return torch.device(device)


@lru_cache(maxsize=1)
def get_debug_mode() -> bool:
if os.environ.get(DEBUG_ENV_VAR, str(False)).lower() == "true":
logging.basicConfig(level=logging.DEBUG)
return True
else:
return False

0 comments on commit 7df8cbc

Please sign in to comment.