Skip to content

--skip_workflow loses original dataset fields - evaluators can't access custom fields #1385

@maxjeblick

Description

@maxjeblick

Version

1.4.0rc2.dev18+g34744ab9

Which installation method(s) does this occur on?

Source

Describe the bug.

When using nat eval --skip_workflow to re-run evaluators on existing workflow output, evaluators lose access to custom fields from the original dataset.
Only the workflow output fields are available in item.full_dataset_entry.

Example:

  • My dataset has difficulty and category fields. When running normally, my evaluator can access them via item.full_dataset_entry["difficulty"]. When using --skip_workflow, these fields are missing - only workflow output fields (id, question, generated_answer, etc.) are available.

Expected: --skip_workflow should merge the original dataset with workflow output, preserving all original fields so evaluators can still access them.

Minimum reproducible example

=== SETUP ===

Create a directory called "reproduce_issue" with these 5 files:

=== FILE: pyproject.toml ===
[project]
name = "reproduce_issue"
version = "0.1.0"

[tool.setuptools]
py-modules = ["register", "custom_evaluator"]

[project.entry-points.'nat.components']
reproduce_issue = "register"

=== FILE: register.py ===
"""Register custom evaluator with NAT."""
from custom_evaluator import DifficultyAwareEvaluator  # noqa: F401

=== FILE: custom_evaluator.py ===
"""Custom evaluator that needs 'difficulty' and 'category' fields from original dataset."""

from nat.builder.builder import EvalBuilder
from nat.builder.evaluator import EvaluatorInfo
from nat.cli.register_workflow import register_evaluator
from nat.data_models.evaluator import EvaluatorBaseConfig
from nat.eval.evaluator.base_evaluator import BaseEvaluator
from nat.eval.evaluator.evaluator_model import EvalInputItem, EvalOutputItem


class DifficultyAwareEvaluatorConfig(EvaluatorBaseConfig, name="difficulty_aware"):  # type: ignore[call-arg]
    pass


class DifficultyAwareEvaluator(BaseEvaluator):
    def __init__(self, max_concurrency: int = 5):
        super().__init__(max_concurrency=max_concurrency, tqdm_desc="Checking difficulty")

    async def evaluate_item(self, item: EvalInputItem) -> EvalOutputItem:
        difficulty = item.full_dataset_entry.get("difficulty")
        category = item.full_dataset_entry.get("category")

        if difficulty is None or category is None:
            return EvalOutputItem(
                id=item.id,
                score=0.0,
                reasoning={
                    "error": "BUG: 'difficulty' and/or 'category' fields are missing!",
                    "available_keys": list(item.full_dataset_entry.keys()),
                },
            )

        return EvalOutputItem(
            id=item.id,
            score=1.0,
            reasoning={"difficulty": difficulty, "category": category},
        )


@register_evaluator(config_type=DifficultyAwareEvaluatorConfig)
async def difficulty_aware(config: DifficultyAwareEvaluatorConfig, builder: EvalBuilder):
    evaluator = DifficultyAwareEvaluator(max_concurrency=builder.get_max_concurrency())
    yield EvaluatorInfo(
        config=config,
        evaluate_fn=evaluator.evaluate,
        description="Difficulty Aware Evaluator",
    )

=== FILE: config.yml ===
llms:
  llm:
    _type: openai
    model_name: gpt-4o-mini

workflow:
  _type: test_echo

eval:
  general:
    max_concurrency: 1
    output:
      dir: ./output
    dataset:
      _type: json
      file_path: ./dataset.json
      id_key: id
      structure:
        question_key: question
  evaluators:
    difficulty_check:
      _type: difficulty_aware

=== FILE: dataset.json ===
[
  {"id": "q1", "question": "What is 2+2?", "difficulty": "easy", "category": "math"},
  {"id": "q2", "question": "What is the capital of France?", "difficulty": "easy", "category": "geography"}
]

=== REPRODUCE ===

cd reproduce_issue
uv pip install -e .

# Step 1: Normal run - WORKS (score=1.0)
rm -rf output output2
nat eval --config_file config.yml --dataset dataset.json

# Step 2: Re-run evaluators only - FAILS (score=0.0)
nat eval --config_file config.yml --dataset output/workflow_output.json --skip_workflow --override eval.general.output.dir ./output2

Relevant log output

Click here to see error details
Step 1 output (difficulty_check_output.json) - evaluator finds fields:
{
"average_score": 1.0,
"eval_output_items": [
{"id": "q1", "score": 1.0, "reasoning": {"difficulty": "easy", "category": "math"}},
{"id": "q2", "score": 1.0, "reasoning": {"difficulty": "easy", "category": "geography"}}
]
}
Step 2 output (output2/difficulty_check_output.json) - fields are MISSING:
{
"average_score": 0.0,
"eval_output_items": [
{"id": "q1", "score": 0.0, "reasoning": {"error": "BUG: 'difficulty' and/or 'category' fields are missing!", "available_keys": ["id", "question", "answer", "generated_answer", "intermediate_steps", "expected_intermediate_steps"]}},
{"id": "q2", "score": 0.0, "reasoning": {"error": "BUG: 'difficulty' and/or 'category' fields are missing!", "available_keys": ["id", "question", "answer", "generated_answer", "intermediate_steps", "expected_intermediate_steps"]}}
]
}

Other/Misc.

No response

Code of Conduct

  • I agree to follow the NeMo Agent toolkit Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs TriageNeed team to review and classifybugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions