-
Notifications
You must be signed in to change notification settings - Fork 485
Open
Labels
Needs TriageNeed team to review and classifyNeed team to review and classifybugSomething isn't workingSomething isn't working
Description
Version
1.4.0rc2.dev18+g34744ab9
Which installation method(s) does this occur on?
Source
Describe the bug.
When using nat eval --skip_workflow to re-run evaluators on existing workflow output, evaluators lose access to custom fields from the original dataset.
Only the workflow output fields are available in item.full_dataset_entry.
Example:
- My dataset has difficulty and category fields. When running normally, my evaluator can access them via
item.full_dataset_entry["difficulty"]. When using--skip_workflow, these fields are missing - only workflow output fields (id, question, generated_answer, etc.) are available.
Expected: --skip_workflow should merge the original dataset with workflow output, preserving all original fields so evaluators can still access them.
Minimum reproducible example
=== SETUP ===
Create a directory called "reproduce_issue" with these 5 files:
=== FILE: pyproject.toml ===
[project]
name = "reproduce_issue"
version = "0.1.0"
[tool.setuptools]
py-modules = ["register", "custom_evaluator"]
[project.entry-points.'nat.components']
reproduce_issue = "register"
=== FILE: register.py ===
"""Register custom evaluator with NAT."""
from custom_evaluator import DifficultyAwareEvaluator # noqa: F401
=== FILE: custom_evaluator.py ===
"""Custom evaluator that needs 'difficulty' and 'category' fields from original dataset."""
from nat.builder.builder import EvalBuilder
from nat.builder.evaluator import EvaluatorInfo
from nat.cli.register_workflow import register_evaluator
from nat.data_models.evaluator import EvaluatorBaseConfig
from nat.eval.evaluator.base_evaluator import BaseEvaluator
from nat.eval.evaluator.evaluator_model import EvalInputItem, EvalOutputItem
class DifficultyAwareEvaluatorConfig(EvaluatorBaseConfig, name="difficulty_aware"): # type: ignore[call-arg]
pass
class DifficultyAwareEvaluator(BaseEvaluator):
def __init__(self, max_concurrency: int = 5):
super().__init__(max_concurrency=max_concurrency, tqdm_desc="Checking difficulty")
async def evaluate_item(self, item: EvalInputItem) -> EvalOutputItem:
difficulty = item.full_dataset_entry.get("difficulty")
category = item.full_dataset_entry.get("category")
if difficulty is None or category is None:
return EvalOutputItem(
id=item.id,
score=0.0,
reasoning={
"error": "BUG: 'difficulty' and/or 'category' fields are missing!",
"available_keys": list(item.full_dataset_entry.keys()),
},
)
return EvalOutputItem(
id=item.id,
score=1.0,
reasoning={"difficulty": difficulty, "category": category},
)
@register_evaluator(config_type=DifficultyAwareEvaluatorConfig)
async def difficulty_aware(config: DifficultyAwareEvaluatorConfig, builder: EvalBuilder):
evaluator = DifficultyAwareEvaluator(max_concurrency=builder.get_max_concurrency())
yield EvaluatorInfo(
config=config,
evaluate_fn=evaluator.evaluate,
description="Difficulty Aware Evaluator",
)
=== FILE: config.yml ===
llms:
llm:
_type: openai
model_name: gpt-4o-mini
workflow:
_type: test_echo
eval:
general:
max_concurrency: 1
output:
dir: ./output
dataset:
_type: json
file_path: ./dataset.json
id_key: id
structure:
question_key: question
evaluators:
difficulty_check:
_type: difficulty_aware
=== FILE: dataset.json ===
[
{"id": "q1", "question": "What is 2+2?", "difficulty": "easy", "category": "math"},
{"id": "q2", "question": "What is the capital of France?", "difficulty": "easy", "category": "geography"}
]
=== REPRODUCE ===
cd reproduce_issue
uv pip install -e .
# Step 1: Normal run - WORKS (score=1.0)
rm -rf output output2
nat eval --config_file config.yml --dataset dataset.json
# Step 2: Re-run evaluators only - FAILS (score=0.0)
nat eval --config_file config.yml --dataset output/workflow_output.json --skip_workflow --override eval.general.output.dir ./output2Relevant log output
Click here to see error details
Step 1 output (difficulty_check_output.json) - evaluator finds fields:
{
"average_score": 1.0,
"eval_output_items": [
{"id": "q1", "score": 1.0, "reasoning": {"difficulty": "easy", "category": "math"}},
{"id": "q2", "score": 1.0, "reasoning": {"difficulty": "easy", "category": "geography"}}
]
}
Step 2 output (output2/difficulty_check_output.json) - fields are MISSING:
{
"average_score": 0.0,
"eval_output_items": [
{"id": "q1", "score": 0.0, "reasoning": {"error": "BUG: 'difficulty' and/or 'category' fields are missing!", "available_keys": ["id", "question", "answer", "generated_answer", "intermediate_steps", "expected_intermediate_steps"]}},
{"id": "q2", "score": 0.0, "reasoning": {"error": "BUG: 'difficulty' and/or 'category' fields are missing!", "available_keys": ["id", "question", "answer", "generated_answer", "intermediate_steps", "expected_intermediate_steps"]}}
]
}
Other/Misc.
No response
Code of Conduct
- I agree to follow the NeMo Agent toolkit Code of Conduct
- I have searched the open bugs and have found no duplicates for this bug report
Metadata
Metadata
Assignees
Labels
Needs TriageNeed team to review and classifyNeed team to review and classifybugSomething isn't workingSomething isn't working