--skip_workflow loses original dataset fields - evaluators can't access custom fields

### Version

1.4.0rc2.dev18+g34744ab9

### Which installation method(s) does this occur on?

Source

### Describe the bug.

When using `nat eval --skip_workflow` to re-run evaluators on existing workflow output, evaluators lose access to custom fields from the original dataset.
Only the workflow output fields are available in `item.full_dataset_entry`.


Example: 
- My dataset has difficulty and category fields. When running normally, my evaluator can access them via `item.full_dataset_entry["difficulty"]`. When using `--skip_workflow`, these fields are missing - only workflow output fields (id, question, generated_answer, etc.) are available.

Expected: `--skip_workflow` should merge the original dataset with workflow output, preserving all original fields so evaluators can still access them.


### Minimum reproducible example

```shell
=== SETUP ===

Create a directory called "reproduce_issue" with these 5 files:

=== FILE: pyproject.toml ===
[project]
name = "reproduce_issue"
version = "0.1.0"

[tool.setuptools]
py-modules = ["register", "custom_evaluator"]

[project.entry-points.'nat.components']
reproduce_issue = "register"

=== FILE: register.py ===
"""Register custom evaluator with NAT."""
from custom_evaluator import DifficultyAwareEvaluator  # noqa: F401

=== FILE: custom_evaluator.py ===
"""Custom evaluator that needs 'difficulty' and 'category' fields from original dataset."""

from nat.builder.builder import EvalBuilder
from nat.builder.evaluator import EvaluatorInfo
from nat.cli.register_workflow import register_evaluator
from nat.data_models.evaluator import EvaluatorBaseConfig
from nat.eval.evaluator.base_evaluator import BaseEvaluator
from nat.eval.evaluator.evaluator_model import EvalInputItem, EvalOutputItem


class DifficultyAwareEvaluatorConfig(EvaluatorBaseConfig, name="difficulty_aware"):  # type: ignore[call-arg]
    pass


class DifficultyAwareEvaluator(BaseEvaluator):
    def __init__(self, max_concurrency: int = 5):
        super().__init__(max_concurrency=max_concurrency, tqdm_desc="Checking difficulty")

    async def evaluate_item(self, item: EvalInputItem) -> EvalOutputItem:
        difficulty = item.full_dataset_entry.get("difficulty")
        category = item.full_dataset_entry.get("category")

        if difficulty is None or category is None:
            return EvalOutputItem(
                id=item.id,
                score=0.0,
                reasoning={
                    "error": "BUG: 'difficulty' and/or 'category' fields are missing!",
                    "available_keys": list(item.full_dataset_entry.keys()),
                },
            )

        return EvalOutputItem(
            id=item.id,
            score=1.0,
            reasoning={"difficulty": difficulty, "category": category},
        )


@register_evaluator(config_type=DifficultyAwareEvaluatorConfig)
async def difficulty_aware(config: DifficultyAwareEvaluatorConfig, builder: EvalBuilder):
    evaluator = DifficultyAwareEvaluator(max_concurrency=builder.get_max_concurrency())
    yield EvaluatorInfo(
        config=config,
        evaluate_fn=evaluator.evaluate,
        description="Difficulty Aware Evaluator",
    )

=== FILE: config.yml ===
llms:
  llm:
    _type: openai
    model_name: gpt-4o-mini

workflow:
  _type: test_echo

eval:
  general:
    max_concurrency: 1
    output:
      dir: ./output
    dataset:
      _type: json
      file_path: ./dataset.json
      id_key: id
      structure:
        question_key: question
  evaluators:
    difficulty_check:
      _type: difficulty_aware

=== FILE: dataset.json ===
[
  {"id": "q1", "question": "What is 2+2?", "difficulty": "easy", "category": "math"},
  {"id": "q2", "question": "What is the capital of France?", "difficulty": "easy", "category": "geography"}
]

=== REPRODUCE ===

cd reproduce_issue
uv pip install -e .

# Step 1: Normal run - WORKS (score=1.0)
rm -rf output output2
nat eval --config_file config.yml --dataset dataset.json

# Step 2: Re-run evaluators only - FAILS (score=0.0)
nat eval --config_file config.yml --dataset output/workflow_output.json --skip_workflow --override eval.general.output.dir ./output2
```

### Relevant log output

<details><summary>Click here to see error details</summary><pre>
Step 1 output (difficulty_check_output.json) - evaluator finds fields:
{
"average_score": 1.0,
"eval_output_items": [
{"id": "q1", "score": 1.0, "reasoning": {"difficulty": "easy", "category": "math"}},
{"id": "q2", "score": 1.0, "reasoning": {"difficulty": "easy", "category": "geography"}}
]
}
Step 2 output (output2/difficulty_check_output.json) - fields are MISSING:
{
"average_score": 0.0,
"eval_output_items": [
{"id": "q1", "score": 0.0, "reasoning": {"error": "BUG: 'difficulty' and/or 'category' fields are missing!", "available_keys": ["id", "question", "answer", "generated_answer", "intermediate_steps", "expected_intermediate_steps"]}},
{"id": "q2", "score": 0.0, "reasoning": {"error": "BUG: 'difficulty' and/or 'category' fields are missing!", "available_keys": ["id", "question", "answer", "generated_answer", "intermediate_steps", "expected_intermediate_steps"]}}
]
}
</pre></details>


### Other/Misc.

_No response_

### Code of Conduct

- [x] I agree to follow the NeMo Agent toolkit Code of Conduct
- [x] I have searched the [open bugs](https://github.com/NVIDIA/NeMo-Agent-Toolkit/issues?q=is%3Aopen+is%3Aissue+label%3Abug) and have found no duplicates for this bug report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

--skip_workflow loses original dataset fields - evaluators can't access custom fields #1385

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Other/Misc.

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

--skip_workflow loses original dataset fields - evaluators can't access custom fields #1385

Description

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Other/Misc.

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions