diff --git a/docs/evaluation.md b/docs/evaluation.md
index 5102ec901..efac99152 100644
--- a/docs/evaluation.md
+++ b/docs/evaluation.md
@@ -1,7 +1,7 @@
# Evaluation
-Evaluations are a form of testing that helps you validate your LLM's responses
-and ensure they meet your quality bar.
+Evaluation is a form of testing that helps you validate your LLM's responses and
+ensure they meet your quality bar.
Firebase Genkit supports third-party evaluation tools through plugins, paired
with powerful observability features that provide insight into the runtime state
@@ -10,170 +10,421 @@ data including inputs, outputs, and information from intermediate steps to
evaluate the end-to-end quality of LLM responses as well as understand the
performance of your system's building blocks.
-For example, if you have a RAG flow, Genkit will extract the set of documents
-that was returned by the retriever so that you can evaluate the quality of your
-retriever while it runs in the context of the flow as shown below with the
-Genkit faithfulness and answer relevancy metrics:
+### Types of evaluation
-```ts
-import { genkit } from 'genkit';
-import { genkitEval, GenkitMetric } from '@genkit-ai/evaluator';
-import { vertexAI, textEmbedding004, gemini15Flash } from '@genkit-ai/vertexai';
+Genkit supports two types of evaluation:
-const ai = genkit({
- plugins: [
- vertexAI(),
- genkitEval({
- judge: gemini15Flash,
- metrics: [GenkitMetric.FAITHFULNESS, GenkitMetric.ANSWER_RELEVANCY],
- embedder: textEmbedding004, // GenkitMetric.ANSWER_RELEVANCY requires an embedder
- }),
- ],
- // ...
-});
-```
+* **Inference-based evaluation**: This type of evaluation runs against a
+collection of pre-determined inputs, assessing the corresponding outputs for
+quality.
-**Note:** The configuration above requires installing the `genkit`,
-`@genkit-ai/googleai`, `@genkit-ai/evaluator` and `@genkit-ai/vertexai`
-packages.
+ This is the most common evaluation type, suitable for most use cases. This
+ approach tests a system's actual output for each evaluation run.
-```posix-terminal
- npm install @genkit-ai/evaluator @genkit-ai/vertexai
-```
+ You can perform the quality assessment manually, by visually inspecting the
+ results. Alternatively, you can automate the assessment by using an
+ evaluation metric.
-Start by defining a set of inputs that you want to use as an input dataset
-called `testInputs.json`. This input dataset represents the test cases you will
-use to generate output for evaluation.
+* **Raw evaluation**: This type of evaluation directly assesses the quality of
+inputs without any inference. This approach typically is used with automated
+evaluation using metrics. All required fields for evaluation (e.g., `input`,
+`context`, `output` and `reference`) must be present in the input dataset. This
+is useful when you have data coming from an external source (e.g., collected
+from your production traces) and you want to have an objective measurement of
+the quality of the collected data.
-```json
-[
+ For more information, see the [Advanced use](#advanced_use) section of this
+ page.
+
+This section explains how to perform inference-based evaluation using Genkit.
+
+## Quick start
+
+### Setup
+
+- Use an existing Genkit app or create a new one by following our [Getting
+started](get-started) guide.
+- Add the following code to define a simple RAG application to evaluate. For
+this guide, we use a dummy retriever that always returns the same documents.
+
+```js
+import { genkit, z, Document } from "genkit";
+import {
+ googleAI,
+ gemini15Flash,
+ gemini15Pro,
+} from "@genkit-ai/googleai";
+
+// Initialize Genkit
+export const ai = genkit ({
+ plugins: [
+ googleAI(),
+ ]
+});
+
+// Dummy retriever that always returns the same docs
+export const dummyRetriever = ai.defineRetriever(
{
- "input": "What is the French word for Cheese?"
+ name: "dummyRetriever",
},
- {
- "input": "What green vegetable looks like cauliflower?"
+ async (i) => {
+ const facts = [
+ "Dog is man's best friend",
+ "Dogs have evolved and were domesticated from wolves",
+ ];
+ // Just return facts as documents.
+ return { documents: facts.map((t) => Document.fromText(t)) };
}
-]
-```
-
-If the evaluator requires a reference output for evaluating a flow, you can pass both
-input and reference output using this format instead:
+);
-```json
-[
- {
- "input": "What is the French word for Cheese?",
- "reference": "Fromage"
+// A simple question-answering flow
+export const qaFlow = ai.defineFlow({
+ name: 'qaFlow',
+ inputSchema: z.string(),
+ outputSchema: z.string(),
},
- {
- "input": "What green vegetable looks like cauliflower?",
- "reference": "Broccoli"
+ async (query) => {
+ const factDocs = await ai.retrieve({
+ retriever: dummyRetriever,
+ query,
+ options: { k: 2 },
+ });
+
+ const llmResponse = await ai.generate({
+ model: gemini15Flash,
+ prompt: `Answer this question with the given context ${query}`,
+ docs: factDocs,
+ });
+ return llmResponse.text;
}
-]
+);
```
+
+- (Optional) Add evaluation metrics to your application to use while
+evaluating. This guide uses the `MALICIOUSNESS` metric from the
+`genkitEval` plugin.
-Note that you can use any JSON data type in the input JSON file. Genkit will pass them along with the same data type to your flow.
-
-You can then use the `eval:flow` command to evaluate your flow against the test
-cases provided in `testInputs.json`.
+```js
+import { genkitEval, GenkitMetric } from "@genkit-ai/evaluator";
+import { gemini15Pro } from "@genkit-ai/googleai";
-```posix-terminal
-genkit eval:flow menuSuggestionFlow --input testInputs.json
+export const ai = genkit ({
+ plugins: [
+ ...
+ // Add this plugin to your Genkit initialization block
+ genkitEval({
+ judge: gemini15Pro,
+ metrics: [GenkitMetric.MALICIOUSNESS],
+ }),
+ ]
+});
```
-If your flow requires auth, you may specify it using the `--auth` argument:
+**Note:** The configuration above requires installing the
+[`@genkit-ai/evaluator`](https://www.npmjs.com/package/@genkit-ai/evaluator)
+package.
```posix-terminal
-genkit eval:flow menuSuggestionFlow --input testInputs.json --auth "{\"email_verified\": true}"
+ npm install @genkit-ai/evaluator
```
-
-You can then see evaluation results in the Developer UI by running:
+
+- Start your Genkit application
```posix-terminal
-genkit start
+genkit start --
```
+
+
-Then navigate to `localhost:4000/evaluate`.
+### Create a dataset
-Alternatively, you can provide an output file to inspect the output in a JSON
-file.
+Create a dataset to define the examples we want to use for evaluating our flow.
-```posix-terminal
-genkit eval:flow menuSuggestionFlow --input testInputs.json --output eval-result.json
-```
+1. Go to the Dev UI at `http://localhost:4000` and click the **Datasets** button
+to open the Datasets page.
+
+2. Click on the **Create Dataset** button to open the create dataset dialog.
+
+ a. Provide a `datasetId` for your new dataset. This guide uses
+ `myFactsQaDataset`.
+
+ b. Select `Flow` dataset type.
+
+ c. Leave the validation target field empty and click **Save**
+
+3. Your new dataset page appears, showing an empty dataset. Add examples to it
+ by following these steps:
+
+ a. Click the **Add example** button to open the example editor panel.
+
+ b. Only the `input` field is required. Enter `"Who is man's best friend?"`
+ in the `input` field, and click **Save** to add the example has to your
+ dataset.
+
+ c. Repeat steps (a) and (b) a couple more times to add more examples. This
+ guide adds the following example inputs to the dataset:
+
+ ```
+ "Can I give milk to my cats?"
+ "From which animals did dogs evolve?"
+ ```
+
+ By the end of this step, your dataset should have 3 examples in it, with the
+ values mentioned above.
+
+### Run evaluation and view results
+
+To start evaluating the flow, click the `Evaluations` tab in the Dev UI and
+click the **Run new evaluation** button to get started.
+
+1. Select the `Flow` radio button to evaluate a flow.
+
+2. Select `qaFlow` as the target flow to evaluate.
+
+3. Select `myFactsQaDataset` as the target dataset to use for evaluation.
+
+4. (Optional) If you have installed an evaluator metric using Genkit plugins,
+you can see these metrics in this page. Select the metrics that you want to use
+with this evaluation run. This is entirely optional: Omitting this step will
+still return the results in the evaluation run, but without any associated
+metrics.
+
+5. Finally, click **Run evaluation** to start evaluation. Depending on the flow
+you're testing, this may take a while. Once the evaluation is complete, a
+success message appears with a link to view the results. Click on the link to go
+to the _Evaluation details_ page.
+
+You can see the details of your evaluation on this page, including original
+input, extracted context and metrics (if any).
+
+## Core concepts
+
+### Terminology
+
+- **Evaluation**: An evaluation is a process that assesses system performance.
+In Genkit, such a system is usually a Genkit primitive, such as a flow or a
+model. An evaluation can be automated or manual (human evaluation).
+
+- **Bulk inference** Inference is the act of running an input on a flow or model
+to get the corresponding output. Bulk inference involves performing inference on
+multiple inputs simultaneously.
+
+- **Metric** An evaluation metric is a criterion on which an inference is
+scored. Examples include accuracy, faithfulness, maliciousness, whether the
+output is in English, etc.
+
+- **Dataset** A dataset is a collection of examples to use for inference-based
+evaluation. A dataset typically consists of `input` and optional `reference`
+fields. The `reference` field does not affect the inference step of evaluation
+but it is passed verbatim to any evaluation metrics. In Genkit, you can create a
+dataset through the Dev UI. There are two types of datasets in Genkit: _Flow_
+datasets and _Model_ datasets.
+
+### Schema validation
+
+Depending on the type, datasets have schema validation support in the Dev UI:
-**Note:** Below you can see an example of how an LLM can help you generate the
-test cases.
+* Flow datasets support validation of the `input` and `reference` fields of the
+dataset against a flow in the Genkit application. Schema validation is optional
+and is only enforced if a schema is specified on the target flow.
+
+* Model datasets have implicit schema, supporting both `string` and
+ `GenerateRequest` input types. String validation provides a convenient way to
+ evaluate simple text prompts, while `GenerateRequest` provides complete
+ control for advanced use cases (e.g. providing model parameters, message
+ history, tools, etc). You can find the full schema for `GenerateRequest` in
+ our [API reference
+ docs](https://genkit-js-api.web.app/interfaces/genkit._.GenerateRequest.html).
+
+
+Note: Schema validation is a helper tool for editing examples, but it is
+possible to save an example with invalid schema. These examples may fail when
+the running an evaluation.
## Supported evaluators
### Genkit evaluators
-Genkit includes a small number of native evaluators, inspired by RAGAS, to help
-you get started:
+Genkit includes a small number of native evaluators, inspired by
+[RAGAS](https://docs.ragas.io/en/stable/), to help you get started:
-* Faithfulness
-* Answer Relevancy
-* Maliciousness
+* Faithfulness -- Measures the factual consistency of the generated answer
+against the given context
+* Answer Relevancy -- Assesses how pertinent the generated answer is to the
+given prompt
+* Maliciousness -- Measures whether the generated output intends to deceive,
+harm, or exploit
### Evaluator plugins
-Genkit supports additional evaluators through plugins like the VertexAI Rapid Evaluators via the [VertexAI Plugin](./plugins/vertex-ai#evaluators).
+Genkit supports additional evaluators through plugins, like the Vertex Rapid
+Evaluators, which you access via the [VertexAI
+Plugin](./plugins/vertex-ai#evaluators).
## Advanced use
-`eval:flow` is a convenient way to quickly evaluate the flow, but sometimes you
-might need more control over evaluation steps. This may occur if you are using a
-different framework and already have some output you would like to evaluate. You
-can perform all the steps that `eval:flow` performs semi-manually.
+### Evaluation using the CLI
-You can batch run your Genkit flow and add a unique label to the run which then
-will be used to extract an evaluation dataset (a set of inputs, outputs, and
-contexts).
+Genkit CLI provides a rich API for performing evaluation. This is especially
+useful in environments where the Dev UI is not available (e.g. in a CI/CD
+workflow).
+
+Genkit CLI provides 3 main evaluation commands: `eval:flow`, `eval:extractData`,
+and `eval:run`.
+
+#### `eval:flow` command
-Run the flow over your test inputs:
+The `eval:flow` command runs inference-based evaluation on an input dataset.
+This dataset may be provided either as a JSON file or by referencing an existing
+dataset in your Genkit runtime.
```posix-terminal
-genkit flow:batchRun myRagFlow test_inputs.json --output flow_outputs.json --label customLabel
-```
+# Referencing an existing dataset
+genkit eval:flow qaFlow --input myFactsQaDataset
-Extract the evaluation data:
+# or, using a dataset from a file
+genkit eval:flow qaFlow --input testInputs.json
+```
+Note: Make sure that you start your genkit app before running these CLI
+commands.
```posix-terminal
-genkit eval:extractData myRagFlow --label customLabel --output customLabel_dataset.json
+genkit start --
```
-The exported data will be output as a JSON file with each testCase in the
-following format:
+Here, `testInputs.json` should be an array of objects containing an `input`
+field and an optional `reference` field, like below:
```json
[
{
- "testCaseId": string,
- "input": string,
- "output": string,
- "context": array of strings,
- "traceIds": array of strings,
+ "input": "What is the French word for Cheese?",
+ },
+ {
+ "input": "What green vegetable looks like cauliflower?",
+ "reference": "Broccoli"
}
]
```
-The data extractor will automatically locate retrievers and add the produced
-docs to the context array. By default, `eval:run` will run against all
-configured evaluators, and like `eval:flow`, results for `eval:run` will appear
-in the evaluation page of Developer UI, located at `localhost:4000/evaluate`.
+If your flow requires auth, you may specify it using the `--auth` argument:
+
+```posix-terminal
+genkit eval:flow qaFlow --input testInputs.json --auth "{\"email_verified\": true}"
+```
+
+By default, the `eval:flow` and `eval:run` commands use all available metrics
+for evaluation. To run on a subset of the configured evaluators, use the
+`--evaluators` flag and provide a comma-separated list of evaluators by name:
+
+```posix-terminal
+genkit eval:flow qaFlow --input testInputs.json --evaluators=genkit/faithfulness,genkit/answer_relevancy
+```
+You can view the results of your evaluation run in the Dev UI at
+`localhost:4000/evaluate`.
+
+#### `eval:extractData` and `eval:run` commands
+
+To support *raw evaluation*, Genkit provides tools to extract data from traces
+and run evaluation metrics on extracted data. This is useful, for example, if
+you are using a different framework for evaluation or if you are collecting
+inferences from a different environment to test locally for output quality.
+
+You can batch run your Genkit flow and add a unique label to the run which then
+can be used to extract an *evaluation dataset*. A raw evaluation dataset is a
+collection of inputs for evaluation metrics, *without* running any prior
+inference.
+
+Run your flow over your test inputs:
+
+```posix-terminal
+genkit flow:batchRun qaFlow testInputs.json --label firstRunSimple
+```
+
+Extract the evaluation data:
+
+```posix-terminal
+genkit eval:extractData qaFlow --label firstRunSimple --output factsEvalDataset.json
+```
+
+The exported data has a format different from the dataset format presented
+earlier. This is because this data is intended to be used with evaluation
+metrics directly, without any inference step. Here is the syntax of the
+extracted data.
+
+```json
+Array<{
+ "testCaseId": string,
+ "input": any,
+ "output": any,
+ "context": any[],
+ "traceIds": string[],
+}>;
+```
+
+The data extractor automatically locates retrievers and adds the produced docs
+to the context array. You can run evaluation metrics on this extracted dataset
+using the `eval:run` command.
+
+```posix-terminal
+genkit eval:run factsEvalDataset.json
+```
+
+By default, `eval:run` runs against all configured evaluators, and as with
+`eval:flow`, results for `eval:run` appear in the evaluation page of Developer
+UI, located at `localhost:4000/evaluate`.
### Custom extractors
-You can also provide custom extractors to be used in `eval:extractData` and
-`eval:flow` commands. Custom extractors allow you to override the default
-extraction logic giving you more power in creating datasets and evaluating them.
+Genkit provides reasonable default logic for extracting the necessary fields
+(`input`, `output` and `context`) while doing an evaluation. However, you may
+find that you need more control over the extraction logic for these fields.
+Genkit supports customs extractors to achieve this. You can provide custom
+extractors to be used in `eval:extractData` and `eval:flow` commands.
-To configure custom extractors, add a tools config file named
-`genkit-tools.conf.js` to your project root if you don't have one already.
+First, as a preparatory step, introduce an auxilary step in our `qaFlow`
+example:
+
+```js
+export const qaFlow = ai.defineFlow({
+ name: 'qaFlow',
+ inputSchema: z.string(),
+ outputSchema: z.string(),
+ },
+ async (query) => {
+ const factDocs = await ai.retrieve({
+ retriever: dummyRetriever,
+ query,
+ options: { k: 2 },
+ });
+ const factDocsModified = await run('factModified', async () => {
+ // Let us use only facts that are considered silly. This is a
+ // hypothetical step for demo purposes, you may perform any
+ // arbitrary task inside a step and reference it in custom
+ // extractors.
+ //
+ // Assume you have a method that checks if a fact is silly
+ return factDocs.filter(d => isSillyFact(d.text));
+ });
+
+ const llmResponse = await ai.generate({
+ model: gemini15Flash,
+ prompt: `Answer this question with the given context ${query}`,
+ docs: factDocs,
+ });
+ return llmResponse.text;
+ }
+);
+```
+
+Next, configure a custom extractor to use the output of the `factModified` step
+when evaluating this flow.
+
+If you don't have one a tools-config file to configure custom extractors, add
+one named `genkit-tools.conf.js` to your project root.
```posix-terminal
-cd $GENKIT_PROJECT_HOME
+cd /path/to/your/genkit/app
touch genkit-tools.conf.js
```
@@ -184,21 +435,26 @@ In the tools config file, add the following code:
module.exports = {
evaluators: [
{
- actionRef: '/flow/myFlow',
+ actionRef: '/flow/qaFlow',
extractors: {
- context: { outputOf: 'foo-step' },
- output: 'bar-step',
+ context: { outputOf: 'factModified' },
},
},
],
};
```
-In this sample, you configure an extractor for `myFlow` flow. The config
-overrides the extractors for `context` and `output` fields and uses the default
-logic for the `input` field.
+This config overrides the default extractors of Genkit's tooling, specifically
+changing what is considered as `context` when evaluating this flow.
+
+Running evaluation again reveals that context is now populated as the output of
+the step `factModified`.
-The specification of the evaluation extractors is as follows:
+```posix-terminal
+genkit eval:flow qaFlow --input testInputs.json
+```
+
+Evaluation extractors are specified as follows:
* `evaluators` field accepts an array of EvaluatorConfig objects, which are
scoped by `flowName`
@@ -212,59 +468,46 @@ The specification of the evaluation extractors is as follows:
inputOf: 'foo-step' }` would extract the input of step `foo-step` for
this key.
* `(trace) => string;` - For further flexibility, you can provide a
- function that accepts a Genkit trace and returns a `string`, and specify
- the extraction logic inside this function. Refer to
+ function that accepts a Genkit trace and returns an `any`-type value,
+ and specify the extraction logic inside this function. Refer to
`genkit/genkit-tools/common/src/types/trace.ts` for the exact TraceData
schema.
-**Note:** The extracted data for all these steps will be a JSON string. The
-tooling will parse this JSON string at the time of evaluation automatically. If
-providing a function extractor, make sure that the output is a valid JSON
-string. For example: `"Hello, world!"` is not valid JSON; `"\"Hello, world!\""`
-is valid.
-
-### Running on existing datasets
-
-To run evaluation over an already extracted dataset:
-
-```posix-terminal
-genkit eval:run customLabel_dataset.json
-```
-
-To output to a different location, use the `--output` flag.
-
-```posix-terminal
-genkit eval:flow menuSuggestionFlow --input testInputs.json --output customLabel_evalresult.json
-```
-
-To run on a subset of the configured evaluators, use the `--evaluators` flag and
-provide a comma-separated list of evaluators by name:
-
-```posix-terminal
-genkit eval:run customLabel_dataset.json --evaluators=genkit/faithfulness,genkit/answer_relevancy
-```
+**Note:** The extracted data for all these extractors is the type corresponding
+to the extractor. For example, if you use context: `{ outputOf: 'foo-step' }`,
+and `foo-step` returns an array of objects, the extracted context is also an
+array of objects.
### Synthesizing test data using an LLM
-Here's an example flow that uses a PDF file to generate possible questions users
-might be asking about it.
+Here is an example flow that uses a PDF file to generate potential user
+questions.
```ts
import { genkit, run, z } from "genkit";
import { googleAI, gemini15Flash } from "@genkit-ai/googleai";
-import { chunk } from "llm-chunk";
-import path from 'path';
+import { chunk } from "llm-chunk"; // npm i llm-chunk
+import path from "path";
+import { readFile } from "fs/promises";
+import pdf from "pdf-parse"; // npm i pdf-parse
const ai = genkit({ plugins: [googleAI()] });
const chunkingConfig = {
minLength: 1000, // number of minimum characters into chunk
maxLength: 2000, // number of maximum characters into chunk
- splitter: 'sentence', // paragraph | sentence
+ splitter: "sentence", // paragraph | sentence
overlap: 100, // number of overlap chracters
- delimiters: '', // regex for base split method
+ delimiters: "", // regex for base split method
} as any;
+async function extractText(filePath: string) {
+ const pdfFile = path.resolve(filePath);
+ const dataBuffer = await readFile(pdfFile);
+ const data = await pdf(dataBuffer);
+ return data.text;
+}
+
export const synthesizeQuestions = ai.defineFlow(
{
name: "synthesizeQuestions",
@@ -274,7 +517,6 @@ export const synthesizeQuestions = ai.defineFlow(
async (filePath) => {
filePath = path.resolve(filePath);
// `extractText` loads the PDF and extracts its contents as text.
- // See our RAG documentation for more details.
const pdfTxt = await run("extract-text", () => extractText(filePath));
const chunks = await run("chunk-it", async () =>
diff --git a/docs/plugin-authoring-evaluator.md b/docs/plugin-authoring-evaluator.md
index d7ca8a7bd..732d855f8 100644
--- a/docs/plugin-authoring-evaluator.md
+++ b/docs/plugin-authoring-evaluator.md
@@ -1,16 +1,24 @@
# Writing a Genkit Evaluator
-Firebase Genkit can be extended to support custom evaluation of test case output, either by using an LLM as a judge, or purely programmatically.
+Firebase Genkit can be extended to support custom evaluation, using either an
+LLM as a judge, or by programmatic (heuristic) evaluation.
## Evaluator definition
-Evaluators are functions that assess the content given to and generated by an LLM. There are two main approaches to automated evaluation (testing): heuristic assessment and LLM-based assessment. In the heuristic approach, you define a deterministic function like those of traditional software development. In an LLM-based assessment, the content is fed back to an LLM and the LLM is asked to score the output according to criteria set in a prompt.
+Evaluators are functions that assess an LLM's response. There are two main
+approaches to automated evaluation: heuristic evaluation and LLM-based
+evaluation. In the heuristic approach, you define a deterministic function,
+whereas in an LLM-based assessment, the content is fed back to an LLM and the
+LLM is asked to score the output according to criteria set in a prompt.
-Regardless of the approach you take, you need to use the `ai.defineEvaluator` method to define an evaluator action in Genkit. We will see a couple of examples of how to use this method in this document.
+Both approaches are supported by the `ai.defineEvaluator` method to define an
+evaluator action in Genkit. This document explores a couple of examples on how
+to use this method for heuristic and LLM-based evaluations.
### LLM based Evaluators
-An LLM-based evaluator leverages an LLM to evaluate the input, context, or output of your generative AI feature.
+An LLM-based evaluator leverages an LLM to evaluate the `input`, `context`, and
+`output` of your generative AI feature.
LLM-based evaluators in Genkit are made up of 3 components:
@@ -20,49 +28,66 @@ LLM-based evaluators in Genkit are made up of 3 components:
#### Define the prompt
-For this example, the prompt is going to ask the LLM to judge how delicious the output is. First, provide context to the LLM, then describe what you want it to do, and finally, give it a few examples to base its response on.
+For this example, the evaluator leverages an LLM to determine whether an
+`output` is delicious or not. First, provide context to the LLM, then describe
+what you want it to do, and finally, give it a few examples to base its response
+on.
-Genkit’s `definePrompt` utility provides an easy way to define prompts with input and output validation. Here’s how you can set up an evaluation prompt with `definePrompt`.
+Genkit’s `definePrompt` utility provides an easy way to define prompts with
+input and output validation. You can set up an evaluation prompt with
+`definePrompt` as follows:
```ts
+import { z } from "genkit";
+
const DELICIOUSNESS_VALUES = ['yes', 'no', 'maybe'] as const;
const DeliciousnessDetectionResponseSchema = z.object({
reason: z.string(),
verdict: z.enum(DELICIOUSNESS_VALUES),
});
-type DeliciousnessDetectionResponse = z.infer;
-const DELICIOUSNESS_PROMPT = ai.definePrompt(
- {
- name: 'deliciousnessPrompt',
- inputSchema: z.object({
- output: z.string(),
- }),
- outputSchema: DeliciousnessDetectionResponseSchema,
- },
- `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.
+function getDeliciousnessPrompt(ai: Genkit) {
+ return ai.definePrompt({
+ name: 'deliciousnessPrompt',
+ input: {
+ schema: z.object({
+ responseToTest: z.string(),
+ }),
+ },
+ output: {
+ schema: DeliciousnessDetectionResponseSchema,
+ }
+ },
+ `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.
- Examples:
- Output: Chicken parm sandwich
- Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }
+ Examples:
+ Output: Chicken parm sandwich
+ Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }
- Output: Boston Logan Airport tarmac
- Response: { "reason": "Not edible.", "verdict": "no" }
+ Output: Boston Logan Airport tarmac
+ Response: { "reason": "Not edible.", "verdict": "no" }
- Output: A juicy piece of gossip
- Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }
+ Output: A juicy piece of gossip
+ Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }
- New Output:
- {{output}}
- Response:
- `
-);
+ New Output:
+ {% verbatim %}
+ {{ responseToTest }}
+ {% endverbatim %}
+ Response:
+ `
+ );
+}
```
#### Define the scoring function
-Now, define the function that will take an example which includes `output` as is required by the prompt and score the result. Genkit test cases include `input` as required a required field, with optional fields for `output` and `context`. It is the responsibility of the evaluator to validate that all fields required for evaluation are present.
+Define a function that takes an example which includes `output` as it is
+required by the prompt, and scores the result. Genkit testcases include `input` as
+a required field, with `output` and `context` as optional fields. It is the
+responsibility of the evaluator to validate that all fields required for
+evaluation are present.
```ts
import { ModelArgument, z } from 'genkit';
@@ -84,17 +109,17 @@ export async function deliciousnessScore<
throw new Error('Output is required for Deliciousness detection');
}
- //Hydrate the prompt
- const finalPrompt = DELICIOUSNESS_PROMPT.renderText({
- output: d.output as string,
- });
-
- // Call the LLM to generate an evaluation result
- const response = await ai.generate({
- model: judgeLlm,
- prompt: finalPrompt,
- config: judgeConfig,
- });
+ // Hydrate the prompt and generate an evaluation result
+ const deliciousnessPrompt = getDeliciousnessPrompt(ai);
+ const response = await deliciousnessPrompt(
+ {
+ responseToTest: d.output as string,
+ },
+ {
+ model: judgeLlm,
+ config: judgeConfig,
+ }
+ );
// Parse the output
const parsedResponse = response.output;
@@ -112,10 +137,10 @@ export async function deliciousnessScore<
#### Define the evaluator action
-The final step is to write a function that defines the evaluator action itself.
+The final step is to write a function that defines the `EvaluatorAction`.
```ts
-import { Genkit, ModelReference, z } from 'genkit';
+import { Genkit, z } from 'genkit';
import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';
/**
@@ -125,12 +150,12 @@ export function createDeliciousnessEvaluator<
ModelCustomOptions extends z.ZodTypeAny,
>(
ai: Genkit,
- judge: ModelReference,
- judgeConfig: z.infer
+ judge: ModelArgument,
+ judgeConfig?: z.infer
): EvaluatorAction {
return ai.defineEvaluator(
{
- name: `myAwesomeEval/deliciousness`,
+ name: `myCustomEvals/deliciousnessEvaluator`,
displayName: 'Deliciousness',
definition: 'Determines if output is considered delicous.',
isBilled: true,
@@ -146,7 +171,12 @@ export function createDeliciousnessEvaluator<
}
```
-The `defineEvaluator` method is similar to other Genkit constructors like `defineFlow`, `defineRetriever` etc. The user should provide an `EvaluatorFn` to the `defineEvaluator` callback. The `EvaluatorFn` accepts a `BaseEvalDataPoint` which corresponds to a single entry in a dataset under evaluation, along with an optional custom options parameter if specified. The function, should process the datapoint and return an `EvalResponse` object.
+The `defineEvaluator` method is similar to other Genkit constructors like
+`defineFlow`, `defineRetriever`, etc. This method requires an `EvaluatorFn` to
+be provided as a callback method. The `EvaluatorFn` accepts a
+`BaseEvalDataPoint` which corresponds to a single entry in a dataset under
+evaluation, along with an optional custom options parameter if specified. The
+function processes the datapoint and returns an `EvalResponse` object.
Here are the Zod Schemas for `BaseEvalDataPoint` and `EvalResponse`:
@@ -185,11 +215,18 @@ const ScoreSchema = z.object({
});
```
-`defineEvaluator` lets the user provide a name and user-readable display name and a definition for the evaluator. The display name and definiton will be displayed in evaluation runs in the Dev UI. It also has an optional `isBilled` option which marks whether this evaluator may result in billing (eg: if it uses a billed LLM or API). If an evaluator is billed, the user is prompted for a confirmation in the CLI before they can run an evaluation, to help guard from unintended expenses.
+`defineEvaluator` lets the user provide a name, a user-readable display name,
+and a definition for the evaluator. The display name and definiton are displayed
+along with evaluation results in the Dev UI. It also has an optional `isBilled`
+option which marks whether this evaluator may result in billing (e.g.: it uses
+a billed LLM or API). If an evaluator is billed, the user is prompted for a
+confirmation in the CLI before they can run an evaluation, to help guard from
+unintended expenses.
### Heuristic Evaluators
-A heuristic evaluator can be any function used to evaluate the input, context, or output of your generative AI feature.
+A heuristic evaluator can be any function used to evaluate the `input`, `context`,
+or `output` of your generative AI feature.
Heuristic evaluators in Genkit are made up of 2 components:
@@ -198,17 +235,18 @@ Heuristic evaluators in Genkit are made up of 2 components:
#### Define the scoring function
-Just like the LLM-based evaluator, define the scoring function. In this case, the scoring function does not need to know about the judge LLM or its config.
+Similar to the LLM-based evaluator, define the scoring function. In this case,
+the scoring function does not need a judge LLM.
```ts
import { EvalResponses } from 'genkit';
import { BaseEvalDataPoint, Score } from 'genkit/evaluator';
const US_PHONE_REGEX =
- /^[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}$/i;
+ /[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}/i;
/**
- * Scores whether an individual datapoint matches a US Phone Regex.
+ * Scores whether a datapoint output contains a US Phone number.
*/
export async function usPhoneRegexScore(
dataPoint: BaseEvalDataPoint
@@ -219,23 +257,13 @@ export async function usPhoneRegexScore(
}
const matches = US_PHONE_REGEX.test(d.output as string);
const reasoning = matches
- ? `Output matched regex ${US_PHONE_REGEX.source}`
- : `Output did not match regex ${US_PHONE_REGEX.source}`;
+ ? `Output matched US_PHONE_REGEX`
+ : `Output did not match US_PHONE_REGEX`;
return {
score: matches,
details: { reasoning },
};
}
-
-/**
- * Create an EvalResponses from an individual scored datapoint.
- */
-function fillScores(dataPoint: BaseEvalDataPoint, score: Score): EvalResponses {
- return {
- testCaseId: dataPoint.testCaseId,
- evaluation: score,
- };
-}
```
#### Define the evaluator action
@@ -247,141 +275,89 @@ import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';
/**
* Configures a regex evaluator to match a US phone number.
*/
-export function createUSPhoneRegexEvaluator(
- ai: Genkit,
- metric: MyAwesomeMetric
-): EvaluatorAction {
+export function createUSPhoneRegexEvaluator(ai: Genkit): EvaluatorAction {
return ai.defineEvaluator(
{
- name: `myAwesomeEval/${metric.toLocaleLowerCase()}`,
- displayName: 'Regex Match',
- definition:
- 'Runs the output against a regex and responds with true if a match is found and false otherwise.',
+ name: `myCustomEvals/usPhoneRegexEvaluator`,
+ displayName: "Regex Match for US PHONE NUMBER",
+ definition: "Uses Regex to check if output matches a US phone number",
isBilled: false,
},
async (datapoint: BaseEvalDataPoint) => {
const score = await usPhoneRegexScore(datapoint);
- return fillScores(datapoint, score);
+ return {
+ testCaseId: datapoint.testCaseId,
+ evaluation: score,
+ };
}
);
}
-
```
-## Configuration
-
-### Plugin Options
-
-Define the `PluginOptions` that the custom evaluator plugin will use. This object has no strict requirements and is dependent on the types of evaluators that are defined.
-
-At a minimum it will need to take the definition of which metrics to register.
-
-```ts
-export enum MyAwesomeMetric {
- WORD_COUNT = 'WORD_COUNT',
- DELICIOUSNESS = 'DELICIOUSNESS',
- US_PHONE_REGEX_MATCH = 'US_PHONE_REGEX_MATCH',
-}
-
-export interface PluginOptions {
- metrics?: Array;
-}
-```
-
-If this new plugin uses an LLM as a judge and the plugin supports swapping out which LLM to use, define additional parameters in the `PluginOptions` object.
-
-```ts
-export interface PluginOptions {
- judge: ModelReference;
- judgeConfig?: z.infer;
- metrics?: Array;
-}
-```
+## Putting it together
### Plugin definition
-Plugins are registered with the framework via the `genkit.config.ts` file in a project. To be able to configure a new plugin, define a function that defines a `GenkitPlugin` and configures it with the `PluginOptions` defined above.
+Plugins are registered with the framework by installing them at the time of
+initializing Genkit. To define a new plugin, use the `genkitPlugin` helper
+method to instantiate all Genkit actions within the plugin context.
-In this case we have two evaluators `DELICIOUSNESS` and `US_PHONE_REGEX_MATCH`. This is where those evaluators are registered with the plugin and with Firebase Genkit.
+Here we have two evaluators,the LLM-based deliciousness evaluator and the
+regex-based US phone number evaluator. Instatiating these evaluators within the
+plugin context registeres them with the plugin.
```ts
import { GenkitPlugin, genkitPlugin } from 'genkit/plugin';
-export function myAwesomeEval(
- options: PluginOptions
-): GenkitPlugin {
+export function myCustomEvals<
+ ModelCustomOptions extends z.ZodTypeAny
+>(options: {
+ judge: ModelArgument;
+ judgeConfig?: ModelCustomOptions;
+}): GenkitPlugin {
// Define the new plugin
- return genkitPlugin(
- 'myAwesomeEval',
- async (ai: Genkit) => {
- const { judge, judgeConfig, metrics } = options;
- const evaluators: EvaluatorAction[] = metrics.map((metric) => {
- switch (metric) {
- case MyAwesomeMetric.DELICIOUSNESS:
- // This evaluator requires an LLM as judge
- return createDeliciousnessEvaluator(ai, judge, judgeConfig);
- case MyAwesomeMetric.US_PHONE_REGEX_MATCH:
- // This evaluator does not require an LLM
- return createUSPhoneRegexEvaluator(ai, metric);
- }
- });
- }
- );
+ return genkitPlugin("myCustomEvals", async (ai: Genkit) => {
+ const { judge, judgeConfig } = options;
+
+ // The plugin instatiates our custom evaluators within the context
+ // of the `ai` object, making them available
+ // throughout our Genkit application.
+ createDeliciousnessEvaluator(ai, judge, judgeConfig);
+ createUSPhoneRegexEvaluator(ai);
+ });
}
-export default myAwesomeEval;
+export default myCustomEvals;
```
### Configure Genkit
-Add the newly defined plugin to your Genkit configuration.
+Add the `myCustomEvals` plugin to your Genkit configuration.
-For evaluation with Gemini, disable safety settings so that the evaluator can accept, detect, and score potentially harmful content.
+For evaluation with Gemini, disable safety settings so that the evaluator can
+accept, detect, and score potentially harmful content.
```ts
-import { gemini15Flash } from '@genkit-ai/googleai';
+import { gemini15Pro } from '@genkit-ai/googleai';
const ai = genkit({
plugins: [
+ vertexAI(),
...
- myAwesomeEval({
- judge: gemini15Flash,
- judgeConfig: {
- safetySettings: [
- {
- category: 'HARM_CATEGORY_HATE_SPEECH',
- threshold: 'BLOCK_NONE',
- },
- {
- category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
- threshold: 'BLOCK_NONE',
- },
- {
- category: 'HARM_CATEGORY_HARASSMENT',
- threshold: 'BLOCK_NONE',
- },
- {
- category: 'HARM_CATEGORY_SEXUALLY_EXPLICIT',
- threshold: 'BLOCK_NONE',
- },
- ],
- },
- metrics: [
- MyAwesomeMetric.DELICIOUSNESS,
- MyAwesomeMetric.US_PHONE_REGEX_MATCH
- ],
+ myCustomEvals({
+ judge: gemini15Pro,
}),
],
...
});
```
-## Testing
+## Using your custom evaluators
-The same issues that apply to evaluating the quality of the output of a generative AI feature apply to evaluating the judging capacity of an LLM-based evaluator.
+Once you instantiate your custom evaluators within the Genkit app context (either
+through a plugin or directly), they are ready to be used. Let us try out the
+deliciousness evaluator with a few sample inputs and outputs.
-To get a sense of whether the custom evaluator performs at the expected level, create a set of test cases that have a clear right and wrong answer.
-
-As an example for deliciousness, that might look like a json file `deliciousness_dataset.json`:
+Create a json file `deliciousness_dataset.json` with the following content:
```json
[
@@ -398,15 +374,18 @@ As an example for deliciousness, that might look like a json file `deliciousness
]
```
-These examples can be human generated or you can ask an LLM to help create a set of test cases that can be curated. There are many available benchmark datasets that can be used as well.
-
-Then use the Genkit CLI to run the evaluator against these test cases.
+Use the Genkit CLI to run the evaluator against these test cases.
```posix-terminal
# Start your genkit runtime
genkit start --
-genkit eval:run deliciousness_dataset.json
+genkit eval:run deliciousness_dataset.json --evaluators=myCustomEvals/deliciousnessEvaluator
```
Navigate to `localhost:4000/evaluate` to view your results in the Genkit UI.
+
+It is important to note that confidence in custom evaluators increases as
+you benchmark them with standard datasets or approaches. Iterate on the results
+of such benchmarks to improve your evaluators' performance till it reaches the
+desired quality.