From 0e2759ca95ddb07ae0d0f360e739a5e8500a68c0 Mon Sep 17 00:00:00 2001 From: William Guss Date: Mon, 16 Dec 2024 12:50:47 -0800 Subject: [PATCH] Wguss/evaluation docs (#406) * partial * . --- ...ion.rst.partial => eval_usage.rst.partial} | 5 +- docs/src/core_concepts/evaluations.rst | 224 ++++++++++++++++++ docs/src/core_concepts/evaluations.rst.sample | 196 +++++++++++++++ docs/src/index.rst | 1 + 4 files changed, 424 insertions(+), 2 deletions(-) rename docs/src/core_concepts/{evaluation.rst.partial => eval_usage.rst.partial} (96%) create mode 100644 docs/src/core_concepts/evaluations.rst create mode 100644 docs/src/core_concepts/evaluations.rst.sample diff --git a/docs/src/core_concepts/evaluation.rst.partial b/docs/src/core_concepts/eval_usage.rst.partial similarity index 96% rename from docs/src/core_concepts/evaluation.rst.partial rename to docs/src/core_concepts/eval_usage.rst.partial index f2bda0ed..6c0f36ca 100644 --- a/docs/src/core_concepts/evaluation.rst.partial +++ b/docs/src/core_concepts/eval_usage.rst.partial @@ -1,8 +1,7 @@ Evaluations =========== -Evaluations in ELL provide a powerful framework for assessing and analyzing Language Model Programs (LMPs). This guide covers the core concepts and features of the evaluation system. - + transcriptions Basic Usage ---------- @@ -208,3 +207,5 @@ Best Practices - Use meaningful commit messages - Track major changes - Maintain evaluation history + + diff --git a/docs/src/core_concepts/evaluations.rst b/docs/src/core_concepts/evaluations.rst new file mode 100644 index 00000000..445a7e36 --- /dev/null +++ b/docs/src/core_concepts/evaluations.rst @@ -0,0 +1,224 @@ +==================================== +Evaluations (New) +==================================== + +Evaluations represent a crucial component in the practice of prompt engineering. They provide the quantitative and qualitative signals necessary to understand whether a language model program achieves the desired objectives. Without evaluations, the process of refining prompts often devolves into guesswork, guided only by subjective impressions rather than structured evidence. Although many developers default to an ad hoc process—manually reviewing a handful of generated outputs and deciding by intuition whether one version of a prompt is better than another—this approach quickly becomes untenable as tasks grow more complex, as teams grow larger, and as stakes get higher. + +The premise of ell’s evaluation feature is that prompt engineering should mirror, where possible, the rigor and methodology of modern machine learning. In machine learning, progress is measured against validated benchmarks, metrics, and datasets. Even as one tunes parameters or tries novel architectures, the question “Did we do better?” can be answered systematically. Similarly, evaluations in ell offer a structured and reproducible way to assess prompts. They transform the process from an ephemeral art into a form of empirical inquiry. In doing so, they also introduce the notion of eval engineering, whereby evaluations themselves become first-class entities that are carefully constructed, versioned, and refined over time. + + +The Problem of Prompt Engineering by Intuition +---------------------------------------------- + +Prompt engineering without evaluations is often characterized by subjective assessments that vary from day to day and person to person. In simple projects, this might suffice. For example, when producing a handful of short marketing texts, a developer might be content to trust personal taste as the measure of success. However, as soon as the problem grows beyond a few trivial examples, this style of iterative tweaking collapses. With more complex tasks, larger data distributions, and subtle constraints—such as maintaining a specific tone or meeting domain-specific requirements—subjective judgments no longer yield consistent or reliable improvements. + +Without evaluations, there is no systematic way to ensure that a revised prompt actually improves performance on the desired tasks. There is no guarantee that adjusting a single detail in the prompt to improve outputs on one example does not degrade outputs elsewhere. Over time, as prompt engineers read through too many model responses, they become either desensitized to quality issues or hypersensitive to minor flaws. This miscalibration saps productivity and leads to unprincipled prompt tuning. Subjective judgment cannot scale, fails to capture statistical performance trends, and offers no verifiable path to satisfy external stakeholders who demand reliability, accuracy, or compliance with given standards. + +.. note:: + + The intuitive, trial-and-error style of prompt engineering can be visually depicted. Imagine a simple diagram in ell Studio (ell’s local, version-controlled dashboard) that shows a single prompt evolving over time, each modification recorded and compared. Without evaluations, this “diff” of prompt versions tells us only that the code changed—not whether it changed for the better. + + +The Concept of Evals +-------------------- + +An eval is a structured evaluation suite that measures a language model program’s performance quantitatively and, when necessary, qualitatively. It consists of three essential elements. First, there is a dataset that represents a distribution of inputs over which the prompt must perform. Second, there are criteria that define what constitutes a successful output. Third, there are metrics that translate the model’s raw outputs into a measurable quantity. + +Below is a minimal example showing how these pieces fit together in ell. Assume we have a dataset of simple classification tasks and a language model program (LMP) that attempts to answer them: + +.. code-block:: python + + import ell + ell.init(store="./logdir") # Enable versioning and storage + + # 1. Define an LMP: + @ell.simple(model="gpt-4o", max_tokens=10) + def classify_sentiment(text: str): + """You are a sentiment classifier. Return 'positive' or 'negative'.""" + return f"Classify sentiment: {text}" + + # 2. A small dataset: + dataset = [ + {"input": {"text": "I love this product!"}, "expected_output": "positive"}, + {"input": {"text": "This is terrible."}, "expected_output": "negative"} + ] + + # 3. A metric function that checks correctness: + def accuracy_metric(datapoint, output): + return float(datapoint["expected_output"].lower() in output.lower()) + + # 4. Constructing the eval: + eval = ell.evaluation.Evaluation( + name="sentiment_eval", + dataset=dataset, + metrics={"accuracy": accuracy_metric} + ) + + # Run the eval: + result = eval.run(classify_sentiment) + print("Average accuracy:", result.results.metrics["accuracy"].mean()) + +Here, the dataset provides two test cases, the LMP attempts to solve them, and the metric quantifies how well it performed. As the LMP changes over time, rerunning this eval yields comparable, reproducible scores. + +In many cases, constructing an eval means assembling a carefully chosen set of input examples along with ground-truth labels or ideal reference outputs. For tasks that resemble classification, defining metrics is straightforward. For more open-ended tasks, evals may rely on heuristic functions, human annotations, or even other language model programs (critics) to rate outputs. + + +Eval Engineering +---------------- + +Defining a single eval and sticking to it blindly can be as problematic as never evaluating at all. In practice, an eval is never perfect on the first try. As the prompt engineer tests models against the eval, new edge cases and overlooked criteria emerge. Perhaps the chosen metric saturates too easily, or perhaps the dataset fails to represent the complexity of real inputs. Updating and refining the eval in response to these insights is what we call eval engineering. + +Consider a scenario where our first eval always returns a perfect score. Maybe our criteria are too lenient. With eval engineering, we revise and strengthen the eval: + +.. code-block:: python + + # A new, more complex metric that penalizes incorrect formatting: + def stricter_accuracy(datapoint, output): + # Now we require the output to match exactly 'positive' or 'negative' + # to count as correct, making the eval more discriminative. + return float(output.strip().lower() == datapoint["expected_output"].lower()) + + # Revised eval: + eval_strict = ell.evaluation.Evaluation( + name="sentiment_eval_stricter", + dataset=dataset, + metrics={"accuracy": stricter_accuracy} + ) + + # Run on the same LMP: + result_strict = eval_strict.run(classify_sentiment) + print("Average accuracy (stricter):", result_strict.results.metrics["accuracy"].mean()) + +If the original eval gave an average accuracy of 1.0, the stricter eval might yield a lower score, prompting further improvements to the LMP. Over time, eval engineering leads to evaluations that genuinely reflect the underlying goals. + + +Model-Based Evaluation +-------------------------------- + +In many real-world scenarios, an eval cannot be reduced to a fixed set of rules or ground-truth answers. Consider a task like producing compelling outreach emails. Quality is subjective, and the notion of success might be tied to subtle attributes. In these cases, one can incorporate human judgments or another LMP as a critic: + +.. code-block:: python + + @ell.simple(model="gpt-4o") + def write_invitation(name: str): + """Invite the given person to an event in a friendly, concise manner.""" + return f"Write an invitation for {name} to our annual gala." + + # A critic that uses an LMP to check if the invitation is friendly enough: + @ell.simple(model="gpt-4o", temperature=0.1) + def invitation_critic(invitation: str): + """Return 'yes' if the invitation is friendly, otherwise 'no'.""" + return f"Is this invitation friendly? {invitation}" + + def friendly_score(datapoint, output): + # Run the critic on the output + verdict = invitation_critic(output).lower() + return float("yes" in verdict) + + dataset_invites = [ + {"input": {"name": "Alice"}}, + {"input": {"name": "Bob"}}, + ] + + eval_invites = ell.evaluation.Evaluation( + name="friendly_invitation_eval", + dataset=dataset_invites, + metrics={"friendliness": friendly_score}, + ) + + result_invites = eval_invites.run(write_invitation) + print("Average friendliness:", result_invites.results.metrics["friendliness"].mean()) + +Here, we rely on a second LMP to measure friendlier invitations. If its judgments are too lenient or too strict, we can “eval engineer” the critic as well—refining its instructions or training a reward model if we have human-labeled data. Over time, these improvements yield more robust and meaningful evaluations. + +In particular, one can construct an eval for their eval, period. In order to generate a critic that reliably mirrors human judgments, you can first create a dataset of your own qualitative assessment of various LLM outputs you wish to create an eval for. In order to generate a critic that reliably mirrors human judgments, you can first create a dataset of your own qualitative assessment of various LLM outputs you wish to create an eval for. + + +Connecting Evals to Prompt Optimization +--------------------------------------- + +By placing evaluations at the center of prompt engineering, the entire process becomes more efficient and credible. Instead of repeatedly scanning outputs and making guesswork judgments, the prompt engineer tweaks the prompt, runs the eval, and compares the scores. This cycle can happen at scale and against large datasets, providing statistically meaningful insights. + +For example, suppose we want to improve the `classify_sentiment` LMP. We make a change to the prompt, then rerun the eval: + +.. code-block:: python + + # Original prompt in classify_sentiment: + # "You are a sentiment classifier. Return 'positive' or 'negative'." + # Suppose we revise it to include a stricter definition: + + @ell.simple(model="gpt-4o", max_tokens=10) + def classify_sentiment_improved(text: str): + """You are a sentiment classifier. If the text shows positive feelings, return exactly 'positive'. + Otherwise, return exactly 'negative'.""" + return f"Check sentiment: {text}" + + # Re-run the stricter eval: + result_strict_improved = eval_strict.run(classify_sentiment_improved) + print("Stricter accuracy after improvement:", result_strict_improved.results.metrics["accuracy"].mean()) + +If the new score surpasses the old one, we know we have made a meaningful improvement. Over time, multiple runs of these evals are recorded in ell’s store. They can be visualized in ell Studio (a local, dashboard-like interface) to track progress, identify regressions, and compare versions at a glance. + + +Versioning and Storing Evals in ell +----------------------------------- + +Just as prompt engineering benefits from version control and provenance tracking, so does eval engineering. An eval changes over time: new datasets, new metrics, new criteria. ell captures these changes automatically when `ell.init()` is called with a storage directory. Each run of an eval stores results, metrics, and associated prompts for future reference. + +You can open ell Studio with: + +.. code-block:: bash + + ell-studio --storage ./logdir + +Here, you will see your evals listed alongside their version histories, their datasets, and the results produced by various LMP runs. This environment allows both prompt engineers and eval engineers to confidently iterate, knowing that any improvement or regression can be traced back to a specific version of the prompt and the eval. + + +Accessing and Interpreting Evaluation Results +--------------------------------------------- + +After running an eval, ell provides an `EvaluationRun` object, which stores both raw outputs and computed metrics. You can access these as follows: + +.. code-block:: python + + run = eval_strict.run(classify_sentiment_improved) + # Access raw metrics: + metrics = run.results.metrics + print("All metrics:", metrics.keys()) + print("Accuracy scores per datapoint:", metrics["accuracy"].values) + + # Access raw outputs: + print("Model outputs:", run.results.outputs) + +This structured data makes it straightforward to integrate evaluations into CI pipelines, automatic regression checks, or advanced statistical analyses. + + +The Underlying API for Evaluations +---------------------------------- + +The `Evaluation` class in ell is flexible yet straightforward. It handles dataset iteration, calling the LMP, collecting outputs, and applying metric and annotation functions. Its interface is designed so that, as your tasks and methodology evolve, you can easily incorporate new data, new metrics, or new eval configurations. + +A simplified version of the `Evaluation` class conceptually looks like this: + +.. code-block:: python + + class Evaluation: + def __init__(self, name: str, dataset=None, n_evals=None, samples_per_datapoint=1, metrics=None, annotations=None, criterion=None): + # Initialization and validation logic + self.name = name + self.dataset = dataset + self.n_evals = n_evals + self.samples_per_datapoint = samples_per_datapoint + # Wrap metrics and criteria and store them internally + # ... + + def run(self, lmp, n_workers=1, use_api_batching=False, api_params=None, verbose=False, **additional_lmp_params): + # 1. Prepare dataset and parameters + # 2. Invoke the LMP on each datapoint + # 3. Compute metrics and store results + # 4. Return EvaluationRun with all information + return EvaluationRun(...) + +This API, combined with ell’s built-in tracing, versioning, and visualization, provides a complete solution for rigorous prompt engineering and eval engineering workflows. + +As evals grow and mature, they provide the stable foundation on which to stand when refining prompts. Combined with ell’s infrastructure for versioning and tracing, evaluations make it possible to bring principled, data-driven methodologies to prompt engineering. The result is a process that can scale in complexity and ambition, confident that improvements are real, documented, and reproducible. \ No newline at end of file diff --git a/docs/src/core_concepts/evaluations.rst.sample b/docs/src/core_concepts/evaluations.rst.sample new file mode 100644 index 00000000..97278d97 --- /dev/null +++ b/docs/src/core_concepts/evaluations.rst.sample @@ -0,0 +1,196 @@ +Evaluations +=========== + +Prompt engineering often resembles an optimization process without a clear, quantifiable objective function. Engineers tweak prompts based on intuition or "vibes," hoping to improve the model's outputs. While this approach can yield short-term results, it presents several significant challenges. + +Firstly, relying on intuition makes it difficult to quantify improvements or regressions in the model's performance. Without clear metrics, determining whether changes to prompts are genuinely beneficial becomes speculative. This lack of quantitative feedback can lead to inefficient iterations and missed opportunities for optimization. + +Secondly, the process is inherently subjective. Different prompt engineers may have varying opinions on what constitutes a "good" output, leading to inconsistent optimizations. This subjectivity hampers collaboration and makes it challenging to build upon each other's work effectively. + +Moreover, manually evaluating outputs is time-consuming and doesn't scale well, especially with large datasets or diverse use cases. As the number of inputs grows, it's impractical to assess each output individually. This limitation hampers the ability to guarantee that the language model will perform reliably across all desired scenarios. + +In high-stakes applications—such as legal, healthcare, or domains requiring stringent compliance—stakeholders demand assurances about model performance. Providing such guarantees is virtually impossible without quantitative assessments. The inability to measure and demonstrate performance can hinder the deployment of language models in critical areas where they could offer significant benefits. + +Additionally, when working with complex prompt chains involving multiple language model programs (LMPs), optimizing one component may inadvertently degrade the performance of others. Without systematic evaluation methods, identifying and rectifying these issues becomes a formidable challenge. This interdependency underscores the need for a holistic approach to prompt optimization. + +These challenges highlight the necessity for a more rigorous, objective, and scalable approach to prompt engineering. + +Introducing Evals +----------------- + +An **Eval** is a systematic method for evaluating language model programs using quantitative metrics over a dataset of inputs. It serves as a programmatic means to assess whether your prompt engineering efforts have successfully optimized the model's performance for your specific use case. + +### What Are Evals? + +Evals consist of three main components: + +- **Dataset**: A collection of inputs representative of the use cases you care about. This dataset should be large and varied to ensure statistical significance and to capture the diversity of scenarios your model will encounter. + +- **Metrics**: Quantitative criteria that measure how well the LMP performs on the dataset. Metrics could include accuracy, precision, recall, or custom functions that reflect specific aspects of performance relevant to your application. + +- **Qualitative Annotations**: Optional assessments providing additional context or insights into the model's outputs. These annotations can help interpret quantitative results and guide further refinements. + +By running an LMP against an Eval, you obtain scores that reflect the model's performance according to your defined metrics. + +### Benefits of Using Evals + +The use of Evals offers several key advantages: + +- **Statistical Significance**: Evaluating the model over a large and varied dataset yields meaningful performance statistics. This approach reduces the influence of outliers and provides a more accurate picture of the model's capabilities. + +- **Quantitative Analysis**: Replacing subjective judgments with objective metrics reduces cognitive load and enables more focused improvements. Quantitative feedback accelerates the optimization process by highlighting specific areas for enhancement. + +- **Reproducibility**: Consistent and comparable evaluations over time allow you to track progress and ensure that changes lead to genuine improvements. Reproducibility is essential for debugging, auditing, and maintaining confidence in the model. + +- **Scalability**: Evals facilitate efficient assessment of model performance across thousands of examples without manual intervention. This scalability is crucial for deploying language models in production environments where they must handle diverse and extensive input. + +The Necessity of Eval Engineering +--------------------------------- + +While Evals provide a systematic framework for assessment, creating effective Evals is an engineering process in itself—this is where **Eval Engineering** becomes crucial. + +### Why Eval Engineering Is Crucial + +An Eval that lacks discriminative power may saturate too early, showing perfect or near-perfect scores even when the model has significant room for improvement. This saturation typically results from metrics or criteria that are insufficiently sensitive to variations in output quality. + +Conversely, misaligned Evals—where the metrics do not align with the true objectives—can lead to optimizing the model in the wrong direction. The model may perform well on the Eval but fail to deliver the desired outcomes in real-world applications. + +Eval Engineering involves carefully designing and iteratively refining the dataset, metrics, and criteria to ensure that the Eval accurately reflects the qualities you desire in the model's outputs. This process mirrors prompt engineering but focuses on crafting robust evaluations rather than optimizing prompts. + +### The Process of Eval Engineering + +Eval Engineering encompasses several key activities: + +- **Defining Clear Criteria**: Establish explicit, measurable criteria that align with your goals. Clarity in what constitutes success is essential for both the prompt and Eval. + +- **Ensuring Statistical Power**: Collect sufficient and diverse data to make meaningful assessments. A well-constructed dataset captures the range of inputs the model will encounter and provides a solid foundation for evaluation. + +- **Iteratively Refining Metrics**: Adjust metrics and criteria as needed to maintain alignment with objectives and improve discriminative ability. This refinement is an ongoing process as you discover new insights or as requirements evolve. + +- **Versioning and Documentation**: Keep detailed records of Eval versions, changes made, and reasons for those changes. Proper documentation ensures transparency and facilitates collaboration among team members. + +### Turning Qualitative Evaluations into Quantitative Ones + +In scenarios where you lack ground truth labels or have open-ended generative tasks, transforming qualitative assessments into quantitative metrics is challenging. Several approaches can help bridge this gap: + +#### Using Language Models as Critics + +Language models can serve as evaluators by acting as critics of other models' outputs. By providing explicit criteria, you can prompt a language model to assess outputs and generate scores. This method leverages the language model's understanding to provide consistent evaluations. + +#### Human Evaluations + +Human evaluators can assess model outputs against defined criteria, offering qualitative annotations that convert into quantitative scores. While effective, this approach can be resource-intensive and may not scale well for large datasets. + +#### Training Reward Models + +By collecting a dataset of human evaluations, you can train a reward model—a specialized machine learning model that predicts human judgments. This reward model can then provide quantitative assessments, enabling scalable evaluations that approximate human feedback. + +Implementing Evals in ell +------------------------- + +**ell** introduces built-in support for Evals, integrating evaluation directly into your prompt engineering workflow. + +### Creating an Eval in ell + +An Eval in ell is defined using the `Evaluation` class: +```python +from ell.eval import Evaluation +Define your Eval +my_eval = Evaluation( +name='example_eval', +data=[{'input': 'sample input', 'expected_output': 'desired output'}], +metrics=[accuracy_metric], +description='An example Eval for demonstration purposes.' +``` + +- **Name**: A unique identifier for the Eval. + +- **Data**: A list of dictionaries containing input data and, optionally, expected outputs. + +- **Metrics**: A list of functions that compute performance metrics. + +- **Description**: A textual description of the Eval's purpose and contents. + +### Running an Eval + +To run an Eval on an LMP: +```python +Run the Eval +results = my_eval.run(your_language_model_program) +Access the results +print(results.metrics) +``` + +The `run` method executes the LMP on the Eval's dataset and computes the specified metrics, returning a `RunResult` object with detailed performance data. + +### Viewing Eval Results in ell Studio + +When you run an Eval, the results are automatically stored and can be viewed using ell Studio: + +ell Studio provides an interactive dashboard where you can visualize Eval scores, track performance over time, and compare different versions of your LMPs and Evals. + +Versioning Evals with ell +------------------------- + +Just as prompts require versioning, Evals need version control to manage changes and ensure consistency. + +### Automatic Versioning + +ell automatically versions your Evals by hashing the following components: + +- **Dataset**: Changes to the input data result in a new Eval version. + +- **Metric Functions**: Modifications to the evaluation metrics produce a new version. + +Each Eval version is stored with metadata, including: + +- **Eval ID**: A unique hash representing the Eval version. + +- **Creation Date**: Timestamp of when the Eval was created. + +- **Change Log**: Automatically generated commit messages describing changes between versions. + +### Benefits of Versioning Evals + +Versioning Evals offers significant benefits: + +- **Reproducibility**: Reproduce past evaluations exactly as they were conducted. + +- **Comparison Over Time**: Compare model performance across different Eval versions to track progress or identify regressions. + +- **Rollback Capability**: Revert to previous Eval versions if new changes negatively affect evaluations. + +- **Transparency**: Clearly document how and why Evals have changed over time, enhancing collaboration and accountability. + +Benefits of Eval Engineering +---------------------------- + +Implementing Eval Engineering provides numerous advantages: + +- **Enhanced Rigor**: Introduce scientific methods into prompt engineering, making the process more objective and reliable. + +- **Improved Collaboration**: Separate concerns by having team members focus on prompt engineering or Eval engineering, promoting specialization and efficiency. + +- **Faster Iterations**: Reduce the time spent on manual evaluations, allowing for quicker optimization cycles. + +- **Scalable Evaluations**: Efficiently handle large datasets, enabling comprehensive assessments of model performance. + +- **Alignment with Objectives**: Ensure that the model's outputs closely match stakeholder needs by defining explicit evaluation criteria. + +Evaluations and the Future of Prompt Engineering +----------------------------------------------- + +As language models continue to advance, the importance of robust evaluation methods will grow. Models will increasingly saturate existing Evals, meaning they perform near-perfectly on current evaluations. At this point, further improvements require constructing new Evals with greater discriminative power. + +Eval Engineering will be pivotal in pushing the boundaries of model performance. By continuously refining Evals, you can identify subtle areas for enhancement even when models appear to have plateaued. This ongoing process ensures that models remain aligned with evolving objectives and adapt to new challenges. + +Moreover, Eval Engineering is not just about immediate gains. Developing expertise in this area prepares teams for future developments in the field, positioning them to leverage advancements effectively. + +Conclusion +---------- + +Evals and Eval Engineering represent significant steps toward making prompt engineering a more systematic, reliable, and scalable process. By integrating Evals into your workflow with ell, you move beyond subjective assessments, introducing scientific rigor into the optimization of language model programs. + +The adoption of Eval Engineering practices not only improves current outcomes but also future-proofs your workflows. As language models evolve, the ability to design and implement robust evaluations will be increasingly valuable. + +To get started with Evals in ell, consult the API documentation and explore examples. By embracing Eval Engineering, you enhance your prompt engineering efforts and contribute to the advancement of the field. diff --git a/docs/src/index.rst b/docs/src/index.rst index 26f3e6ef..b71f9991 100644 --- a/docs/src/index.rst +++ b/docs/src/index.rst @@ -265,6 +265,7 @@ To get started with ``ell``, see the :doc:`Getting Started ` se core_concepts/ell_simple core_concepts/versioning_and_storage core_concepts/ell_studio + core_concepts/evaluations core_concepts/message_api core_concepts/ell_complex