Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TabRepo 2.0 Feature Tracker #63

Open
4 of 23 tasks
Innixma opened this issue Jul 10, 2024 · 4 comments
Open
4 of 23 tasks

TabRepo 2.0 Feature Tracker #63

Innixma opened this issue Jul 10, 2024 · 4 comments
Milestone

Comments

@Innixma
Copy link
Collaborator

Innixma commented Jul 10, 2024

For TabRepo 2.0, several quality of life changes should be made for ease of use. This list will evolve over time.

P0 (Critical)

P1

P2 (Nice-to-have)

P3

  • Can theoretically reduce multiclass task storage and memory cost by a factor of num_classes by storing only the prediction probability of the ground truth class. This might be brittle to extensions though, and would only work for log_loss.
@Innixma Innixma added this to the TabRepo 2.0 milestone Jul 10, 2024
@geoalgo
Copy link
Collaborator

geoalgo commented Jul 11, 2024

Another thing that has been my radar for some time is to have tabrepo on huggingface. It will speedup the download time by ~8x (download is very slow from outside) and would make the dataset more visible.

@geoalgo
Copy link
Collaborator

geoalgo commented Jul 11, 2024

Having an example or an API that allows to "join" two repository would be also be quite useful. One could do:

from tabrepo import load_repository
from tabrepo.utils import merge_repositories
repo = load_repository("D244_F3_C1530_30")
repo_with_new_method = load_repository("D244_F3_C1530_30")
# builds a repository from the two, filter models that appear only in all task, underneath, just calls the repo that contains a given model
repo_union = merge_repositories([repo, repo_with_new_method], force_dense=True)

@Innixma
Copy link
Collaborator Author

Innixma commented Jul 11, 2024

@geoalgo Yes, repo joining is something I plan to implement. I added a tracking GitHub issue: #65

@geoalgo
Copy link
Collaborator

geoalgo commented Jul 18, 2024

Sorry for the delay again :-) The list you made sounds great!

One thing I want to mention that I think could be quite useful is adding a way to recover original and transformed features from openml.

Something like that:

df, y = repo.openml_dataframe(dataset="airplane", fold=2) # gets the raw columns from the dataset
X, y = repo.openml_transformed_features(dataset="airplane", fold=2)  # gets the features as provided to the model

This would allow to use Tabrepo to train TabPFN models (probably with larger scales that what they currently use). Also it would make it easier to train new models and add them in tabrepo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants