Skip to content

Latest commit

 

History

History
189 lines (129 loc) · 9.92 KB

lmeval-evaluation-job.adoc

File metadata and controls

189 lines (129 loc) · 9.92 KB

LM-Eval evaluation job

LM-Eval service defines a new Custom Resource Definition (CRD) called LMEvalJob. An LMEvalJob object represents an evaluation job. LMEvalJob objects are monitored by the TrustyAI Kubernetes operator.

To run an evaluation job, you create an LMEvalJob object with the following information: model, model arguments, task, and secret.

After the LMEvalJob is created, the LM-Eval service runs the evaluation job. The status and results of the LMEvalJob object update when the information is available.

Note

Other TrustyAI features (such as bias and drift metrics) do not support non-tabular models (including LLMs). Deploying the TrustyAIService custom resource (CR) in a namespace that contains non-tabular models (such as the namespace where an evaluation job is being executed) can cause errors within the TrustyAI service.

Sample LMEvalJob object

The sample LMEvalJob object contains the following features:

apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
  name: evaljob-sample
spec:
  model: hf
  modelArgs:
  - name: pretrained
    value: google/flan-t5-base
  taskList:
    taskRecipes:
    - card:
        name: "cards.wnli"
      template: "templates.classification.multi_class.relation.default"
  logSamples: true

After you apply the sample LMEvalJob, check its state by using the following command:

oc get lmevaljob evaljob-sample

Output similar to the following appears: NAME: evaljob-sample STATE: Running

Evaluation results are available when the object’s state changes to Complete. Both the model and dataset in this example are small. The evaluation job should finish within 10 minutes on a CPU-only node.

Use the following command to get the results:

oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
  -o template --template={{.status.results}} | jq '.results'

Output similar to the following appears:

{
  "tr_0": {
    "alias": "tr_0",
    "f1_micro,none": 0.5633802816901409,
    "f1_micro_stderr,none": "N/A",
    "accuracy,none": 0.5633802816901409,
    "accuracy_stderr,none": "N/A",
    "f1_macro,none": 0.36036036036036034,
    "f1_macro_stderr,none": "N/A"
  }
}
Notes on the results
  • The f1_micro, f1_macro, and accuracy scores are 0.56, 0.36, and 0.56.

  • The full results are stored in the .status.results of the LMEvalJob object as a JSON document. The command above only retrieves the results field of the JSON document.

LMEvalJob properties

The following table lists each property in the LMEvalJob and its usage:

Table 1. LM-Eval properties
Parameter Description

model

Specifies which model type or provider is evaluated. This field directly maps to the --model argument of the lm-evaluation-harness. Supported model types and providers include:

* hf: HuggingFace models

* openai-completions: OpenAI Completions API models

* openai-chat-completions: Chat Completions API models

* local-completions and local-chat-completions: OpenAI API-compatible servers

* textsynth: TextSynth APIs

modelArgs

A list of paired name and value arguments for the model type. Each model type or provider supports different arguments:

* hf (HuggingFace): Check the huggingface.py

* local-completions (OpenAI API-compatible server): Check the openai_completions.py and tapi_models.py

* local-chat-completions (OpenAI API-compatible server): Check openai_completions.py and tapi_models.py

* openai-completions (OpenAI Completions API models): Check openai_completions.py and tapi_models.py

* openai-chat-completions (ChatCompletions API models): Check openai_completions.py and tapi_models.py

textsynth (TextSynth APIs): Check textsynth.py

taskList.taskNames

Specifies a list of tasks supported by lm-evaluation-harness.

taskList.taskRecipes

Specifies the task using the Unitxt recipe format: * card: Use the name to specify a Unitxt card or custom for a custom card name: Specifies a Unitxt card from the Unitxt catalog. Use the card’s ID as the value. For example, the ID of Wnli card is cards.wnli. custom: Defines and uses a custom card. The value is a JSON object that contains the custom dataset. For more information about creating a custom card, see the Unitxt documentation. If the dataset used by the custom card needs an API key from an environment variable or a persistent volume, set up corresponding resources in the pod field. * template: Specifies a Unitxt template from the Unitxt catalog. Use the template’s ID as the value. * task (optional): Specifies a Unitxt task from the Unitxt catalog. Use the task’s ID as the value. A Unitxt card has a predefined task. Only specify a value for this if you want to run a different task. * metrics (optional): Specifies a Unitxt task from the Unitxt catalog. Use the metric’s ID as the value. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics. * format (optional): Specifies a Unitxt format from the Unitxt catalog. Use the format’s ID as the value. * loaderLimit (optional): Specifies the maximum number of instances per stream to be returned from the loader. You can use this parameter to reduce loading time in large datasets. * numDemos (optional): Number of few-shot to be used. * demosPoolSize (optional): Size of the few-shot pool.

numFewShot

Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, do not use this field. Use numDemos under the taskRecipes instead.

limit

Set a limit to run the tasks instead of running the entire dataset. Accepts either an integer or a float between 0.0 and 1.0.

genArgs

Maps to the --gen_kwargs parameter for the lm-evaluation-harness. For more information, see the lm-evaluation-harness documentation.

logSamples

If this flag is passed, then the model’s outputs and the text fed into the model will be saved at per-document granularity.

batchSize

Specifies the batch size for the evaluation in integer format. The auto:N batch size is not used for API models, but numeric batch sizes are used for APIs.

pod

Specifies extra information for the lm-eval job’s pod. * container: Specifies additional container settings for the lm-eval container. env: Specifies environment variables. This parameter uses the EnvVar data structure of Kubernetes. volumeMounts: Mounts the volumes into the lm-eval container ** resources: Specify the resources for the lm-eval container. * volumes: Specifies the volume information for the lm-eval and other containers. This parameter uses the Volume data structure of Kubernetes. * sideCars: A list of containers that run along with the lm-eval container. It uses the Container data structure of Kubernetes.

outputs

This parameter defines a custom output location to store the the evaluation results. Only Persistent Volume Claims (PVC) are supported.

outputs.pvcManaged

Creates an operator-managed PVC to store this job’s results. The PVC is named <job-name>-pvc and is owned by the LMEvalJob. After the job finishes, the PVC is still be available, but it is deleted with the LMEvalJob. Supports the following fields:

* size: The PVC’s size, compatible with standard PVC syntax (for example, 5Gi)

outputs.pvcName

Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job.