Skip to content

json: add optional return_file_name parameter#7948

Open
Sachin-0001 wants to merge 1 commit intohuggingface:mainfrom
Sachin-0001:add-return-file-name
Open

json: add optional return_file_name parameter#7948
Sachin-0001 wants to merge 1 commit intohuggingface:mainfrom
Sachin-0001:add-return-file-name

Conversation

@Sachin-0001
Copy link

This PR adds an optional return_file_name parameter to the JSON dataset loader.

When enabled, a new file_name column is added containing the source file name
for each row. Default behavior is unchanged.

Changes:

  • Add return_file_name to JsonConfig
  • Append file name during JSON table generation
  • Add tests covering default and enabled behavior, and ensures other functions are not affected

Motivation:
This helps resume training from checkpoints by identifying already-consumed data shards.

Fixes #5806

dhruvildarji added a commit to dhruvildarji/datasets that referenced this pull request Feb 23, 2026
Add an optional `return_file_name` parameter to `CsvConfig` that, when
set to `True`, appends a `file_name` column containing the source file
basename to every batch yielded by `_generate_tables`. Default is
`False` to preserve backward compatibility.

Part of huggingface#5806. Extends the file_name feature (already implemented for
the JSON builder in huggingface#7948) to the CSV packaged builder.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Return the name of the currently loaded file in the load_dataset function.

1 participant