Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to recover original and transformed features from OpenML #67

Open
Tracked by #63
Innixma opened this issue Jul 18, 2024 · 2 comments
Open
Tracked by #63

Add a way to recover original and transformed features from OpenML #67

Innixma opened this issue Jul 18, 2024 · 2 comments
Milestone

Comments

@Innixma
Copy link
Collaborator

Innixma commented Jul 18, 2024

From @geoalgo:

One thing I want to mention that I think could be quite useful is adding a way to recover original and transformed features from openml.

Something like that:

df, y = repo.openml_dataframe(dataset="airplane", fold=2) # gets the raw columns from the dataset
X, y = repo.openml_transformed_features(dataset="airplane", fold=2)  # gets the features as provided to the model

This would allow to use Tabrepo to train TabPFN models (probably with larger scales that what they currently use). Also it would make it easier to train new models and add them in tabrepo.

@Innixma Innixma added this to the TabRepo 2.0 milestone Jul 18, 2024
@Innixma
Copy link
Collaborator Author

Innixma commented Jul 18, 2024

A note on this:

Getting the original features is easy, but getting the transformed ones is a bit nuanced.

The transformation logic in AutoGluon could change between versions. We could either accept this and warn the user, or cache the transformed features as part of the Repo creation process while fitting the original models.

Additionally, we ran the models via AutoMLBenchmark, which might have non-standard handling of the data, such as converting dtypes prior to sending to AutoGluon. I would need to double check if loading the data through OpenML is identical to loading it via AMLB.

@geoalgo
Copy link
Collaborator

geoalgo commented Jul 18, 2024

Thanks makes lot of sense!
What we could have then is to support something like repo.openml_dataframe(dataset="airplane", fold=2) in the repository.

For the feature matrix, I see your point that this may change, we could perhaps just add a simple util (outside of EvaluationRepository) to cast this dataframe to a feature matrix by just calling AG featurizer. I believe this would also be valuable to get quickly a matrix to fit a model.

If we have those, we should ping TabPFN folks as this may be quite useful for their training and evaluations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants