-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve estimated row count #3055
Comments
The current estimator works using this formula
Where we take a sample from the dataset by streaming the first 5GB of in-memory data. It was made to work for arbitrary file formats. For webdataset afaik there is no strict rule to have a fixed number of samples per shard, I don't know how often your method would be more accurate or less accurate. Unless this rule is enforced somewhere ? |
Yes I am aware how the current estimator works, as stated in the issue this assumes a consistent file size across the entire set. There may not be a strict rule to have a fixed number of samples, just as there is no fixed rule that the number of samples in the first 5GB is the same as the rest. Nevertheless a fixed number of samples per shard is the typical usage and Webdataset's |
Oh great to see that the ShardWriter does enforce this. Since it's the official implementation and most people use it we can probably rely on this assumption :) I'd be happy to provide some guidance if you want to look into how to implement this ! |
Currently estimated row count assumes file sizes are consistent across the entire set, from what I've seen this results in wildly inaccurate estimates for WebDataset. Typically WebDatasets are created with a set number of samples per file, therefore a simpler more accurate estimate can be calculated from the row count of one shard multiplied by the total number of shards.
The text was updated successfully, but these errors were encountered: