Weights & Biases - Evaluating and testing LLM applications #36

AaronWard · 2023-10-27T23:22:57Z

AaronWard
Oct 27, 2023
Maintainer

Article
W&B Sweeps - used for iterating of configurations and evaluating metrics such as tokens used, cost, response quality results, different templates, additional configurations etc.

1. Understanding the Need for Evaluation

LLMs are now being used in various applications, from chatbots to data retrieval systems. However, while many examples are prototypes or proofs of concept, few are production-ready. A primary barrier is the challenge of evaluating their reliability and accuracy.
Consider a medical QA bot. It takes a diagnosis as input and recommends medicines with dosages based on data from a medicine database. The bot then processes the data and uses an LLM to generate a response. The stakes are high, as incorrect responses can have dire consequences.

2. Evaluation Methodologies

1. Eyeballing:

It's the simplest method where developers inspect the system's responses manually to gauge its performance.
Tools like W&B Prompts, Langchain, and LlamaIndex facilitate this process, helping developers understand the underlying workings of the system

2. Human Annotation (Supervised Evaluation):

The most reliable method involves creating an evaluation dataset for each component of the system.
This is cost-intensive and time-consuming, especially when specialized knowledge is required (e.g., medical expertise).

3. LLMs Evaluating LLMs:
An emerging practice is using one LLM to evaluate another.
This section introduces a Retrieval Augmented QA bot. The bot is designed to answer questions based on a specific paper.

Generate Eval Dataset using LLM:
- Instead of human annotators, LLMs can be used to generate vast amounts of test data.
- A tool from Langchain, called QAGenerationChain, can extract question-answer pairs from specific documents. The article provides a prompt example that instructs the LLM to generate QA pairs in a specific JSON format.
Metrics:
- LLMs as a Metric: LLMs can be used to semantically compare predicted and true answers. Langchain's QAEvalChain is introduced as a tool to accomplish this.
- Standard Metrics: Traditional metrics for question-answering tasks, such as Exact Match and F1 Score, are discussed. These metrics can be computed using the HuggingFace's Evaluate library.

Hyperparameter Tuning

Weights & Biases allows for hyperparameter optimization, specifically to improve the mean F1 score on an evaluation set (which we could have made using Langchain for example). W&B will run over your code with the different combinations of these configurations, bookkeeping the different parameters used. There are three ways it decides on the parameters:

grid Search – Iterate over every combination of hyperparameter values. Very effective, but can be computationally costly.
random Search – Select each new combination at random according to provided distributions. Surprisingly effective!
bayesian Search – Create a probabilistic model of metric score as a function of the hyperparameters, and choose parameters with high probability of improving the metric. Works well for small numbers of continuous parameters but scales poorly.

W&B Sweeps offer a way to explore hyperparameters through various strategies such as grid search, random search, and Bayesian optimization. The sweep configuration defines the strategy and can be specified in a Python nested dictionary or a YAML file.
Once the sweep is defined, it can be started, and agents will evaluate different hyperparameter combinations according to the chosen strategy. Here is an example configuration sweep file

method: random
name: random_qa_full_sweeps
parameters:
  embedding:
    values:
      - SentenceTransformerEmbeddings
      - OpenAIEmbeddings
      - CohereEmbeddings
  llm:
    values:
      - gpt-4
      - gpt-3.5-turbo
      - text-davinci-003
      - command
      - command-light
  prompt_template_file:
    values:
      - data/qa/prompt_template_1.txt
      - data/qa/prompt_template_2.txt
  retriever:
    values:
      - Chroma
      - TFIDFRetriever
      - FAISS
program: qa_full_sweeps.py

Visualizing W&B Sweep Results:

W&B Sweeps offer visualization tools allow you to view the results of your hyperparameter search and understand their impact on the LLM applications performance. You can customize these visualizations to get insights tailored to your specific needs.

Accessing the W&B App UI:
- To visualize the results of W&B Sweeps, you need to visit the W&B App UI at https://wandb.ai/home.
- Once there, you select the project that you had specified when initializing your W&B Sweep.
- After accessing your project workspace, click on the Sweep icon (which looks like a broom) on the left panel.
- This will take you to the Sweep UI, where you can select the name of your Sweep from a list to see the visualizations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weights & Biases - Evaluating and testing LLM applications #36

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Weights & Biases - Evaluating and testing LLM applications #36

AaronWard Oct 27, 2023 Maintainer

1. Understanding the Need for Evaluation

2. Evaluation Methodologies

Hyperparameter Tuning

Visualizing W&B Sweep Results:

Replies: 0 comments

AaronWard
Oct 27, 2023
Maintainer