Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of Zstd compression #2937

Open
gsjaardema opened this issue Jun 20, 2024 · 34 comments
Open

Use of Zstd compression #2937

gsjaardema opened this issue Jun 20, 2024 · 34 comments
Assignees
Milestone

Comments

@gsjaardema
Copy link
Contributor

gsjaardema commented Jun 20, 2024

I'm a little late to the party, but have been starting to look at using the Zstd library for compression in netCDF.

Am I misunderstanding something, or do I have to explicitly set the environment variable HDF5_PLUGIN_DIR to the location of the directory containing the filter for zstd prior to running an application that wants to use Zstd compression?

It seems like using Zstd instead of zlib is making me jump through lots of hoops instead of "just working" like zlib has/does... Thinking that maybe I am missing something...

@edwardhartnett
Copy link
Contributor

I'm also not happy with the extra steps needed, and now there is a HDF5 function which allows us to control the filter path programmatically, which means we can solve this whole problem.

But for now, set HDF5_PLUGIN_DIR, or else you can accept the default plugin install and then you don't have to set anything. Unfortunately, I don't know the details for CMake, but for autoconf, I think you use --with-plugin-dir with no argument, and that will use the default location.

I will try to swing around to this code and make this easier in a future release.

Make sure you also take a look at the quantize feature, which can really improve compression sizes and speeds.

@DennisHeimbigner
Copy link
Collaborator

The underlying problem is that we would need to create a list of compressors
that are to be "built in" to both libnetcdf and possibly also libhdf5.
Zstd is certainly a candidate for that status. But there others -- libblosc for example --
are also possible candidates. I frankly have no criteria on which to decide.

@edwardhartnett
Copy link
Contributor

We decide collectively, based on our judgement of what is most useful and can be sustainably supported. Just as we decided to include zstandard. Criteria include:

  • FOSS, available everywhere
  • significant compression improvement (size or speed of compress/uncompress).
  • compatibility with netcdf-java.

I was at the HDF5 workshop for particle physics teams, and they were all using lz4 because it was so much faster. So that's the next one I'll look at. Fortunately John provided a lz4 class for netcdf-java.

Recall that the CCR project exists to prototype and explore. So I would suggest that we deal with zstandard today, and if more compressors are to be built-in, we deal with that on a case-by-case basis, having thoroughly tested our ideas in CCR.

But the problems with zstandard are not in the API, but in the configure and initialization. We need to make that easier for Greg and others like him...

@edwardhartnett
Copy link
Contributor

@DennisHeimbigner are you already using BLOSC or LZ4 for Zarr stuff?

@DennisHeimbigner
Copy link
Collaborator

We use BLOSC for zarr/nczarr.
It is available but unused for HDF5.

@gsjaardema
Copy link
Contributor Author

or else you can accept the default plugin install and then you don't have to set anything.

Yes, I thought I was doing that, but still had to define the environment variable. Will look again to see what I was missing during build process to make this work.

Make sure you also take a look at the quantize feature, which can really improve compression sizes and speeds.

Yes, that is working and was very simple to get going. Thanks for all the work you all are doing on netCDF.

@edwardhartnett
Copy link
Contributor

@gsjaardema I would be really interested in any final results you get for the new compression methods - that is, percent faster, or percent improvement in compressed size...

@gsjaardema
Copy link
Contributor Author

I havent' been able to get netCDF/HDF5 to find the plugins unless I specify the HDF5_PLUGIN_DIR at runtime. My build is using cmake. I will continue to try different variations of the build to see if I can get it to work...

@DennisHeimbigner
Copy link
Collaborator

What value do you assume for the default plugin location?

@gsjaardema
Copy link
Contributor Author

What value do you assume for the default plugin location?

The library is configured with the location of the local HDF5 plugin directory and that is correctly echoed from nc-config and is in `libnetcdf.settings:

 --plugindir     -> /root/src/seacas/lib/hdf5/lib/plugin

Plugin Install Prefix:  /root/src/seacas/lib/hdf5/lib/plugin

Multi-Filter Support:	yes
Quantization:		yes
Logging:     		no
SZIP Write Support:     yes
Standard Filters:       deflate szip zstd bz2
ZSTD Support:           yes
Parallel Filters:       yes

If I just run my executable, it doesn't find the Zstd compression filter with 4.9.2 or with main. If I define:

HDF5_PLUGIN_PATH=/root/src/seacas/lib/hdf5/lib/plugin ../bin/io_shell --in_type generated --compress 4 --zstd  100x100x100 tmp-z04.g

Then it correctly finds the Zstd compression filter...

@edwardhartnett
Copy link
Contributor

OK, so the way it should work (but apparently does not) is that if you keep your plugins in the directory you told configure, you should not have to set the environment var...

@gsjaardema
Copy link
Contributor Author

gsjaardema commented Jun 27, 2024

Based on a reading of docs/filters.md, it looks like you need to set the environment variable at runtime: (my highlighting)

The important thing to note is that at run-time, there are several cases to consider:

  1. HDF5_PLUGIN_PATH is defined and is the same value as it was at build time -- no action needed
  2. HDF5_PLUGIN_PATH is defined and is has a different value from build time -- the user is responsible for ensuring that the run-time path includes the same directory used at build time, otherwise this case will fail.
  3. HDF5_PLUGIN_PATH is not defined at either run-time or build-time -- no action needed
  4. HDF5_PLUGIN_PATH is not defined at run-time but was defined at build-time -- this will probably fail

I can somewhat control this from my application which would do the writing, but then (based on minimal attempts), it looks like any downstream application that wants to read my file that has zstd compression in it will also have to set the plugin path environment variable or it will fail to read the file. NetCDF: Filter error: undefined filter encountered

I would like to use zstandard or even some of the other filters, but currently I think I will be setting myself up for lots of complaints from people who write files with compressed variables and then try to read the file a day/week/month later and have no idea why it fails.

Ideally:

  • zstd works like zlib does today (I realize will take time for distributions to update to the version that has this)
    • I can control somewhat my toolchain and the netcdf that is used, but I can't make users define an environment variable for the tools they will be using.. (Although since we use modules, I could add the setting to the module files...)
  • A 4.9.3 installation or later would say that zstd compression not supported or something similar if it tried to read a file and it didn't support zstd.
    • The "undefined filter encountered" isn't the best although even outputting the filter "id" that you couldn't find would be an improvement in general since you can't always know the name of a filter that you don't have installed... I could then grep an include file or the netcdf source or something to map the filter id back to what the filter actually is..
  • If the plugin path is specified at build/install time, then it should be searched at runtime without defining HDF5_PLUGIN_PATH

This is not a complaint, I really appreciate the work that has gone into this and definitely want to use it...

@gsjaardema
Copy link
Contributor Author

Maybe I am misreading / misinterpreting item 4 in the docs/filters.md I quoted above, but that seems to require setting HDF5_PLUGIN_PATH at runtime no matter what...

@DennisHeimbigner
Copy link
Collaborator

Closely related issues: #2753

@DennisHeimbigner
Copy link
Collaborator

One problem we are up against is that if we use the H5PLxxx API
then we have to conform to its loading semantics, which frankly I
have not investigated. For example, if we add a new location to the hdf5
internal path list, one would assume it would used the next time a plugin is required
(i.e. the equivalent of nc_def_var_filter). It presumably has no affect on already loaded
plugins, even if the new location is, say, at the front of the list and overrides some
already loaded plugin.

@gsjaardema
Copy link
Contributor Author

Thanks for the reminder about H5PLprepend and others. I think ed or ward had also told me about that. I did a quick test and it does work and eliminates the need for the environment variable;

@DennisHeimbigner
Copy link
Collaborator

In Issue #2753, I appear to have promised
a proposal, but it was never implemented.

@DennisHeimbigner
Copy link
Collaborator

I also note that we may have an option to completely bypass the HDF5 dynamic loading
algorithm altogether. THe H5Zregister function just tells HDF5 to be aware of some plugin
and assumes that the caller (libnet in this case) is responsible for loading it. Not sure how
this interacts with the H5PLxxx API.

@edwardhartnett
Copy link
Contributor

@gsjaardema I agree this has to be fixed. But how?

@gsjaardema
Copy link
Contributor Author

@edwardhartnett I'm not sure what the best solution is. Just looking at Zstd, the "easy" solution seems to be to treat it the same as zlib, quantize, shuffle, and szip -- compile it directly into the library and not rely on any plugin paths or other runtime loading. If it is there at build time, it is there at runtime.

This doesn't scale well since then how do you handle blosc or Z123 or the next five ultimate compression libraries... So I think there is also the issue of how to handle plugins in general...

The difficulty for my usage is that I want to be able to query at my build time what capabilities are available in netCDF, HDF5, CGNS, matio, and maybe the other libraries I use and then decide in my code how to build my libraries and what capabilities to expose/support and then my libraries are used in other applications. If Zstd is advertised as supported by netCDF, then I should just be able to link with netCDF and support Zstd instead of having to wonder if something will happen at runtime that will cause Zstd to not be available.

There is enough difficulty in making sure the entire tool chain on multiple hosts will all have netCDF libraries that support Zstd, quantize, and other features without adding on the issue that this could all be tested at build/install time, but then fail at runtime...

If plugins are the way to do it, then I would like to have the plugin directory that I specify at build time to be searched at runtime without me having to specify anything at runtime. If something does change, then specifying HDF5_PLUGIN_PATH or some other environment variable is helpful and the capability to be able to add new capabilities through plugins is nice to have...

I think for Zstd since there is an explicit nc_def_var_zstandard(exoid, varid, file->compression_level); API function, it seems like it should be part of the netCDF library and if that function exists at build time, then it exists.

There is still the difficulty of using a new feature that does not exist in older versions of the library. Quantize is nice since it is done at write time and does not need to be supported in the applications that are reading the file. Compression is harder since it is needed at both write and read time, so the entire toolchain needs to be updated to support this once it becomes available to write fiels using it and it is difficult since an older library doesn't even know what Zstd is, so can't give a meaningful error message about a feature created after the reading library was installed... (We still get some random failures at times when a users path points to an older netCDF application that doesn't know about netcdf-4 or some other feature that has existed for an eternity...)

So not sure my rambling has any solutions or recommendations in it. It is a hard problem and for read/write libraries the problem is even harder since the need for the capability (zstd, netcdf-4) follows the file which can move among multiple hosts and be consumed by applications that are not always under out control...

@DennisHeimbigner
Copy link
Collaborator

I want to be able to query at my build time what capabilities are available in netCDF, HDF5, CGNS, matio, and maybe the other libraries

Currently, the HDF5 API is not very good at exporting that info. NetCDF is doable. Do not know about the other libraries you mention.

@DennisHeimbigner
Copy link
Collaborator

I am going to try to start tackling this piecemeal.
First, I want to see why the HDF5 default directory is not being used (re: comment #2937 (comment))

@DennisHeimbigner
Copy link
Collaborator

Question: when you built libhdf5, did you set the option

--with-default-plugindir=/root/src/seacas/lib/hdf5/lib/plugin

@WardF WardF self-assigned this Aug 5, 2024
@WardF WardF added this to the 4.9.3 milestone Aug 5, 2024
@edwardhartnett
Copy link
Contributor

@gsjaardema did you get this resolved?

I have just made a bunch of changes for next release to make this work a little better, and to document it. Hopefully that will make it easier for future users.

If there's no remaining problem, please close this issue.

@gsjaardema
Copy link
Contributor Author

I will try to look at it this week. Thanks for all the work you and Dennis did on this.

@gsjaardema
Copy link
Contributor Author

OK, I am trying main and just looking at configuration currently. The first configure seems to give correct plugin directory:

Plugins Enabled:        yes
Plugin Install Dir:     /Users/gdsjaar/src/seacas-plugin/lib/hdf5/lib/plugin

I then simply did a touch CMakeCache.txt; make and the resulting reconfigure gives:

✔ ~/src/seacas-plugin/TPL/netcdf/netcdf-c/build [main {origin/main}|✔]
16:11 $ touch CMakeCache.txt
✔ ~/src/seacas-plugin/TPL/netcdf/netcdf-c/build [main {origin/main}|✔]
16:12 $ make
-- Checking for Deprecated Options
CMake Warning at CMakeLists.txt:482 (message):
  NETCDF_ENABLE_NETCDF4 is deprecated; please use NETCDF_ENABLE_HDF5


-- Defaulting to -DPLUGIN_INSTALL_DIR=/usr/local/hdf5/lib/plugin
-- Final value of-DPLUGIN_INSTALL_DIR=/usr/local/hdf5/lib/plugin
-- ENABLE_PLUGIN_INSTALL=YES PLUGIN_INSTALL_DIR=/usr/local/hdf5/lib/plugin
-- NETCDF_ENABLE_PLUGINS: ON
-- Found HDF5 version: 1.14.4

... deleted lines...

-- Installing: lib__nch5bzip2.dylib into /usr/local/hdf5/lib/plugin
-- Installing: lib__nch5zstd.dylib into /usr/local/hdf5/lib/plugin

And then the Configuration Summary:

Configuration Summary:

... deleted lines ...

Plugins Enabled:        yes
Plugin Install Dir:     /usr/local/hdf5/lib/plugin

So it looks like the plugin installation directory is not being persisted and gets reset on a subsequent reconfigure.

@WardF
Copy link
Member

WardF commented Aug 19, 2024

This is correct and expected behavior; expected in that I've just been working in that part of the code and, indeed, that is what the logic dictates should happen. I'm open to the discussion about having the cached value used if it is set!

@edwardhartnett
Copy link
Contributor

Should it be cached?

@gsjaardema
Copy link
Contributor Author

gsjaardema commented Aug 20, 2024

If I have configured the CMake build and edit the CMakeCache.txt to, for example, change the build from RELEASE to DEBUG or enable testing, I don't expect my Plugin directory to change when I build the code following that change which is what is happening now...

The NETCDF_PLUGIN_INSTALL_DIR CMake variable is also somewhat confusing. It seems to get its value from checking an environment variable HDF5_PLUGIN_PATH if set, but has a default value of "YES" and a somewhat confusing doc string:

set(NETCDF_PLUGIN_INSTALL_DIR "YES" CACHE STRING "Whether and where we should install plugins; defaults to yes")

I have ended up with a directory named "YES" on some of the builds.

If I explicitly set NETCDF_PLUGIN_INSTALL_DIR to a location and don't set the HDF5_PLUGIN_PATH, then the build doesn't use my value...

Appreciate all the work being done in this area, but just giving some feedback on some non-intuitive (at least to me) behavior I am seeing.

@gsjaardema
Copy link
Contributor Author

Question: when you built libhdf5, did you set the option

--with-default-plugindir=/root/src/seacas/lib/hdf5/lib/plugin

I use the CMake build, so use:

         -DH5_DEFAULT_PLUGINDIR:PATH=${INSTALL_PATH}/lib/hdf5/lib/plugin

I was not using this at the beginning, but started at some point in this process...

@edwardhartnett
Copy link
Contributor

OK, but what specifically can I do to make it better. Cache HDF5_DEFAULT_PLUGINDIR?

@gsjaardema
Copy link
Contributor Author

Something related to the plugindir should be cached such that if I configure and then touch CMakeCache.txt and then remake the code, the plugin directory should not change. I think that there should be some clariification/distinction/unification between HDF5_DEFAULT_PLUGINDIR and NETCDF_PLUGIN_INSTALL_DIR.

It is unclear what NETCDF_PLUGIN_INSTALL_DIR=YES is supposed to mean since (I think) all other symbol_DIR are typically paths and not booleans. Because it is currently both, it is a STRING which makes it difficult to validate its value.

@gsjaardema
Copy link
Contributor Author

I also see that there are two listings of the plugin directory in the libnetcdf.settings:

../libnetcdf.settings.in:Plugin Install Prefix:  @PLUGIN_INSTALL_DIR_SETTING@
../libnetcdf.settings.in:Plugin Install Dir:     @NETCDF_PLUGIN_INSTALL_DIR@

In my configurations/builds, both of those show the same value which I assume is correct, but I'm unclear what to do if they ever differ and if they are always supposed to be the same, should one of them be removed and hopefully simplify some logic somewhere.

@edwardhartnett
Copy link
Contributor

OK, let me take a look at this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants