LM-Eval service defines a new Custom Resource Definition (CRD) called LMEvalJob
. An LMEvalJob
object represents an evaluation job. LMEvalJob
objects are monitored by the TrustyAI Kubernetes operator.
To run an evaluation job, you create an LMEvalJob
object with the following information: model
, model arguments
, task
, and secret
.
After the LMEvalJob
is created, the LM-Eval service runs the evaluation job. The status and results of the LMEvalJob
object update when the information is available.
Note
|
Other TrustyAI features (such as bias and drift metrics) do not support non-tabular models (including LLMs). Deploying the TrustyAIService custom resource (CR) in a namespace that contains non-tabular models (such as the namespace where an evaluation job is being executed) can cause errors within the TrustyAI service. |
The sample LMEvalJob
object contains the following features:
-
The
google/flan-t5-base
model from Hugging Face. -
The dataset from the
wnli
subset of the General Language Understanding Evaluation (GLUE). For more information about thewnli
TaskCard
, see Unitxt website. -
The following default parameters for the multi_class.relation template Unitxt task:`f1_micro`,
f1_macro
, andaccuracy
.
apiVersion: trustyai.opendatahub.io/v1alpha1
kind: LMEvalJob
metadata:
name: evaljob-sample
spec:
model: hf
modelArgs:
- name: pretrained
value: google/flan-t5-base
taskList:
taskRecipes:
- card:
name: "cards.wnli"
template: "templates.classification.multi_class.relation.default"
logSamples: true
After you apply the sample LMEvalJob
, check its state by using the following command:
oc get lmevaljob evaljob-sample
Output similar to the following appears:
NAME: evaljob-sample
STATE: Running
Evaluation results are available when the object’s state changes to Complete
. Both the model and dataset in this example are small. The evaluation job should finish within 10 minutes on a CPU-only node.
Use the following command to get the results:
oc get lmevaljobs.trustyai.opendatahub.io evaljob-sample \
-o template --template={{.status.results}} | jq '.results'
Output similar to the following appears:
{
"tr_0": {
"alias": "tr_0",
"f1_micro,none": 0.5633802816901409,
"f1_micro_stderr,none": "N/A",
"accuracy,none": 0.5633802816901409,
"accuracy_stderr,none": "N/A",
"f1_macro,none": 0.36036036036036034,
"f1_macro_stderr,none": "N/A"
}
}
-
The
f1_micro
,f1_macro
, andaccuracy
scores are 0.56, 0.36, and 0.56. -
The full results are stored in the
.status.results
of theLMEvalJob
object as a JSON document. The command above only retrieves the results field of the JSON document.
The following table lists each property in the LMEvalJob
and its usage:
Parameter | Description |
---|---|
|
Specifies which model type or provider is evaluated. This field directly maps to the * * * * * |
|
A list of paired name and value arguments for the model type. Each model type or provider supports different arguments: * * * * openai-completions (OpenAI Completions API models): Check openai_completions.py and tapi_models.py * openai-chat-completions (ChatCompletions API models): Check openai_completions.py and tapi_models.py textsynth (TextSynth APIs): Check textsynth.py |
|
Specifies a list of tasks supported by |
|
Specifies the task using the Unitxt recipe format:
* |
|
Sets the number of few-shot examples to place in context. If you are using a task from Unitxt, do not use this field. Use |
|
Set a limit to run the tasks instead of running the entire dataset. Accepts either an integer or a float between 0.0 and 1.0. |
|
Maps to the |
|
If this flag is passed, then the model’s outputs and the text fed into the model will be saved at per-document granularity. |
|
Specifies the batch size for the evaluation in integer format. The |
|
Specifies extra information for the |
|
This parameter defines a custom output location to store the the evaluation results. Only Persistent Volume Claims (PVC) are supported. |
|
Creates an operator-managed PVC to store this job’s results. The PVC is named * |
|
Binds an existing PVC to a job by specifying its name. The PVC must be created separately and must already exist when creating the job. |