WIP - improve dataset fetchers #1852

rcap107 · 2026-01-19T13:51:29Z

This PR is improving the dataset fetcher functions. It addresses #1422 by adding the path to the dataset file to the Bunch object returned by the fetcher.

I am also adding the default data folder of skrub to the configuration file, and I'm deprecating the original name SKRUB_DATA_DIRECTORY in favor of SKB_DATA_DIRECTORY, to follow the same format as the other environment variables set by skrub.

…-fetchers

rcap107 · 2026-01-20T15:13:59Z

Something I did not consider is that some datasets have multiple paths (such as fetch_plane_delays).

For the moment I am returning a list of paths, but then the result is that then single datasets become clunky to load, like

data = fetch_employee_salaries()
df = pd.read_csv(data["paths"][0])

I'm not sure what's the best way to deal with this

rcap107 added 4 commits January 19, 2026 14:47

WIP - improve dataset fetchers

41184ed

iter

b78226b

Merge remote-tracking branch 'upstream/HEAD' into enh-improve-dataset…

5267a51

…-fetchers

Adding paths to bunch

e7df43e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - improve dataset fetchers #1852

WIP - improve dataset fetchers #1852

Uh oh!

rcap107 commented Jan 19, 2026

Uh oh!

rcap107 commented Jan 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WIP - improve dataset fetchers #1852

Are you sure you want to change the base?

WIP - improve dataset fetchers #1852

Uh oh!

Conversation

rcap107 commented Jan 19, 2026

Uh oh!

rcap107 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rcap107 commented Jan 20, 2026 •

edited

Loading