-
Notifications
You must be signed in to change notification settings - Fork 180
Usage of H5py Virtual Datasets for concat_on_disk
#2032
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Usage of H5py Virtual Datasets for concat_on_disk
#2032
Conversation
Codecov Report❌ Patch coverage is
❌ Your project check has failed because the head coverage (31.80%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.
Additional details and impacted files@@ Coverage Diff @@
## main #2032 +/- ##
===========================================
- Coverage 87.58% 31.80% -55.79%
===========================================
Files 46 46
Lines 7064 7103 +39
===========================================
- Hits 6187 2259 -3928
- Misses 877 4844 +3967
🚀 New features to boost your workflow:
|
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think this should be put behind an argument so it is opt-in
- A single test should be enough to ensure results match
- When the dataset is read back in, and a backing file has been deleted, does
h5raise an error? Or will it error weirdly somehow withinanndataonce you try to access the data?
These are done
According to the requirements documentation: https://support.hdfgroup.org/releases/hdf5/documentation/rfc/HDF5-VDS-requirements-use-cases-2014-12-10.pdf
From my experience it tries as hard to use a cache or something like that because when I delete original files it doesn't throw an error and the result file looks as if it didn't change. This behaviour isn't discussed much in https://docs.h5py.org |
|
@selmanozleyen Do you want my review on this again or are the comments still somewhat unaddressed? |
I found the behavior undefined/unpredictable when the source files are deleted. I couldn't find a way to overcome that because it's not stated clearly in the docs. Sometimes it errors sometimes it doesn't as in the CI tests. If we are fine with just documenting this and merging I will remove the fail assertions then it will be read to merge |
ilan-gold
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the behavior undefined/unpredictable when the source files are deleted. I couldn't find a way to overcome that because it's not stated clearly in the docs.
Can we ask the developers of h5py?
If we are fine with just documenting this and merging I will remove the fail assertions then it will be read to merge
I would like to at least open an issue with the h5py people and give it a day or two (I have some changes requested here anyway)
| ) | ||
|
|
||
|
|
||
| def test_anndatas_virtual_concat_missing_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't have multiple different functions that do almost identical things. Please refactor
| path, | ||
| *, | ||
| max_loaded_elems, | ||
| virtual_concat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typing! and virtual_concat is a boolean, so let's prefix this everywhere with a verb like use i.e., use_virtual_concat
| init_elem, # TODO: user should be able to specify dataset kwargs | ||
| dataset_kwargs=dict(indptr_dtype=indptr_dtype), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this as the TODO but I'm not sure its relevance to
Added TODO for being able to specify compression args since when I used the default approach the output file size grew up too much compared to the size of the inputs.
or in general, why the resultant dataset is 12GB. Is this obs and var or?
| For sparse arrays, if the backend is hdf5 and there is no reindexing and | ||
| `virtual_concat` is True, | ||
| the virtual concatenation is used using the `h5py` virtual dataset support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| For sparse arrays, if the backend is hdf5 and there is no reindexing and | |
| `virtual_concat` is True, | |
| the virtual concatenation is used using the `h5py` virtual dataset support. | |
| For sparse arrays, if the backend is hdf5 and there is no reindexing and | |
| `virtual_concat` is True, | |
| virtual concatenation is used via docs.h5py.org/en/stable/vds.html. |
Also maybe see if we can intersphinx this link? I suspect so but am not sure
When there is no need for reindexing and the backend is
hdf5we can just use the virtual datasets instead. This requires no in memory copies but instead it just links to the original file location instead. I was able to concat the tahoe datasets (314GB in total) in a few minutes and the result was a 12GB.h5adfile.Other notes:
indptr's were in total 780Mb for all the tahoe files so I just concatenate them in memory instead.TODOs: