LM-Eval evaluation job

LM-Eval service defines a new Custom Resource Definition (CRD) called LMEvalJob. An LMEvalJob object represents an evaluation job. LMEvalJob objects are monitored by the TrustyAI Kubernetes operator.

To run an evaluation job, you create an LMEvalJob object with the following information: model, model arguments, task, and secret.

After the LMEvalJob is created, the LM-Eval service runs the evaluation job. The status and results of the LMEvalJob object update when the information is available.

Note

Other TrustyAI features (such as bias and drift metrics) do not support non-tabular models (including LLMs). Deploying the TrustyAIService custom resource (CR) in a namespace that contains non-tabular models (such as the namespace where an evaluation job is being executed) can cause errors within the TrustyAI service.

Sample LMEvalJob object

The sample LMEvalJob object contains the following features:

The google/flan-t5-base model from Hugging Face.
The dataset from the wnli subset of the General Language Understanding Evaluation (GLUE). For more information about the wnli TaskCard, see Unitxt website.
The following default parameters for the multi_class.relation template Unitxt task:`f1_micro`, f1_macro, and accuracy.

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true

After you apply the sample LMEvalJob, check its state by using the following command:

oc get lmevaljob evaljob-sample

Output similar to the following appears: NAME: evaljob-sample STATE: Running

Evaluation results are available when the object’s state changes to Complete. Both the model and dataset in this example are small. The evaluation job should finish within 10 minutes on a CPU-only node.

Use the following command to get the results:

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
  -o template --template={{.status.results}} | jq '.results'

Output similar to the following appears:

{
  "tr_0": {
    "alias": "tr_0",
    "f1_micro,none": 0.5633802816901409,
    "f1_micro_stderr,none": "N/A",
    "accuracy,none": 0.5633802816901409,
    "accuracy_stderr,none": "N/A",
    "f1_macro,none": 0.36036036036036034,
    "f1_macro_stderr,none": "N/A"
  }
}

Notes on the results

The f1_micro, f1_macro, and accuracy scores are 0.56, 0.36, and 0.56.
The full results are stored in the .status.results of the LMEvalJob object as a JSON document. The command above only retrieves the results field of the JSON document.

LMEvalJob properties

The following table lists each property in the LMEvalJob and its usage:

Table 1. LM-Eval properties

Parameter	Description
`model`	Specifies which model type or provider is evaluated. This field directly maps to the `--model` argument of the `lm-evaluation-harness`. Supported model types and providers include: * `hf`: HuggingFace models * `openai-completions`: OpenAI Completions API models * `openai-chat-completions`: Chat Completions API models * `local-completions` and `local-chat-completions`: OpenAI API-compatible servers * `textsynth`: TextSynth APIs
`modelArgs`	A list of paired name and value arguments for the model type. Each model type or provider supports different arguments: * `hf` (HuggingFace): Check the huggingface.py * `local-completions` (OpenAI API-compatible server): Check the openai_completions.py and tapi_models.py * `local-chat-completions` (OpenAI API-compatible server): Check openai_completions.py and tapi_models.py * openai-completions (OpenAI Completions API models): Check openai_completions.py and tapi_models.py * openai-chat-completions (ChatCompletions API models): Check openai_completions.py and tapi_models.py textsynth (TextSynth APIs): Check textsynth.py
`taskList.taskNames`	Specifies a list of tasks supported by `lm-evaluation-harness`.
`taskList.taskRecipes`	Specifies the task using the Unitxt recipe format: * `card`: Use the `name` to specify a Unitxt card or `custom` for a custom card `name`: Specifies a Unitxt card from the Unitxt catalog. Use the card’s ID as the value. For example, the ID of Wnli card is `cards.wnli`. `custom`: Defines and uses a custom card. The value is a JSON object that contains the custom dataset. For more information about creating a custom card, see the Unitxt documentation. If the dataset used by the custom card needs an API key from an environment variable or a persistent volume, set up corresponding resources in the `pod` field. * `template`: Specifies a Unitxt template from the Unitxt catalog. Use the template’s ID as the value. * `task` (optional): Specifies a Unitxt task from the Unitxt catalog. Use the task’s ID as the value. A Unitxt card has a predefined task. Only specify a value for this if you want to run a different task. * `metrics` (optional): Specifies a Unitxt task from the Unitxt catalog. Use the metric’s ID as the value. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics. * `format` (optional): Specifies a Unitxt format from the Unitxt catalog. Use the format’s ID as the value. * `loaderLimit` (optional): Specifies the maximum number of instances per stream to be returned from the loader. You can use this parameter to reduce loading time in large datasets. * `numDemos` (optional): Number of few-shot to be used. * `demosPoolSize` (optional): Size of the few-shot pool.
`numFewShot`	Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, do not use this field. Use `numDemos` under the `taskRecipes` instead.
`limit`	Set a limit to run the tasks instead of running the entire dataset. Accepts either an integer or a float between 0.0 and 1.0.
`genArgs`	Maps to the `--gen_kwargs` parameter for the `lm-evaluation-harness`. For more information, see the `lm-evaluation-harness` documentation.
`logSamples`	If this flag is passed, then the model’s outputs and the text fed into the model will be saved at per-document granularity.
`batchSize`	Specifies the batch size for the evaluation in integer format. The `auto:N` batch size is not used for API models, but numeric batch sizes are used for APIs.
`pod`	Specifies extra information for the `lm-eval` job’s pod. * `container`: Specifies additional container settings for the `lm-eval` container. `env`: Specifies environment variables. This parameter uses the `EnvVar` data structure of Kubernetes. `volumeMounts`: Mounts the volumes into the `lm-eval` container ** `resources`: Specify the resources for the lm-eval container. * `volumes`: Specifies the volume information for the `lm-eval` and other containers. This parameter uses the `Volume` data structure of Kubernetes. * `sideCars`: A list of containers that run along with the `lm-eval` container. It uses the `Container` data structure of Kubernetes.
`outputs`	This parameter defines a custom output location to store the the evaluation results. Only Persistent Volume Claims (PVC) are supported.
`outputs.pvcManaged`	Creates an operator-managed PVC to store this job’s results. The PVC is named `<job-name>-pvc` and is owned by the `LMEvalJob`. After the job finishes, the PVC is still be available, but it is deleted with the `LMEvalJob`. Supports the following fields: * `size`: The PVC’s size, compatible with standard PVC syntax (for example, 5Gi)
`outputs.pvcName`	Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lmeval-evaluation-job.adoc

lmeval-evaluation-job.adoc

LM-Eval evaluation job

Files

lmeval-evaluation-job.adoc

Latest commit

History

lmeval-evaluation-job.adoc

File metadata and controls

LM-Eval evaluation job