Using memory mapping #1237

aaime · 2023-09-19T10:59:07Z

aaime
Sep 19, 2023

I'm trying to understand a significant performance difference between a Java-based application using the NetCDF library, and a C++ one, using the C library. The C one is performing quite a bit better (2-3 times faster) over network file systems, trying to get a time-series of data for a particular point, over a file with chunking a 32x32x32 chunking (time, lat, lon), using NetCDF 4 with compression. Mind, if the data is local, the difference is minimal instead, which in my experience indicates an I/O chattiness problem (too many small accesses).

Checking with strace, I can see that the Java program is doing a massive number of reads and seeks, while the C one is using memory mapping and performing the job with a much smaller number of kernel calls. Both programs are extracting only the "Section" of data required, the Java one does so by giving the target Section to Variable.read(section).

Debugging, I can see the NetCDF Java library iterating over the file chunks and asking itself "is this chunk intersecting the target Section", ending up visiting all chunks (even if only the header of each chunk is read), which seems inefficient. Is there no header indicating the offset of each chunk, so that only the desired chunks can be read? (similar to the tile directory in a GeoTIFF, for a comparison).

The C program seems to be reading only what it needs somehow, but I'm not as adept at reading it as the Java one. Or it could be that memory mapping is just hiding the extra calls, but it's really doing between 1 and 2 orders of magnitude less I/O calls.

Ideas? Also wondering, the Java program is using RandomAccessFile, but Java also supports memory mapping through NIO, is there any way to make the Java library use it? I've been exploring the code a bit, and also looked at the "ng" repository (version 6 and 7) with no much luck.

JohnLCaron · 2023-09-24T20:49:18Z

JohnLCaron
Sep 24, 2023
Collaborator

Hi Andrea: 1. Using RAF to a remote file is surely a bad idea. A thredds server would help a lot. 2. In addition, I did notice recently how badly coded the HDF5/Netcdf IOSP is, exactly as you describe, iterating over the all file chunks to see if they intersect. That does strike me as a likely reason the C library does so much better. I have a background project <https://github.com/JohnLCaron/cdm-kotlin>that I think fixed that problem, but unless you want to switch to an alpha quality kotlin library, it probably doesn't help you. 3. I have built experimental versions of RAF on top of NIO memory mapping, in general the performance isnt that much better. All mmap gives you is maybe better caching, so it wont solve #2. But maybe over NFS it might be worth it. If you want to try it, and manage the changes and PRs etc, I might be able to find my old code to get you started. Regards, John

…

On Tue, Sep 19, 2023 at 4:59 AM Andrea Aime ***@***.***> wrote: I'm trying to understand a significant performance difference between a Java-based application using the NetCDF library, and a C++ one, using the C library. The C one is performing quite a bit better (2-3 times faster) over network file systems, trying to get a time-series of data for a particular point, over a file with chunking a 32x32x32 chunking (time, lat, lon), using NetCDF 4 with compression. Mind, if the data is local, the difference is minimal instead, which in my experience indicates an I/O chattiness problem (too many small accesses). Checking with strace, I can see that the Java program is doing a massive number of reads and seeks, while the C one is using memory mapping and performing the job with a much smaller number of kernel calls. Both programs are extracting only the "Section" of data required, the Java one does so by giving the target Section to Variable.read(section). Debugging, I can see the NetCDF Java library iterating over the file chunks and asking itself "is this chunk intersecting the target Section", ending up visiting all chunks (even if only the header of each chunk is read), which seems inefficient. Is there no header indicating the offset of each chunk, so that only the desired chunks can be read? (similar to the tile directory in a GeoTIFF, for a comparison). The C program seems to be reading only what it needs somehow, but I'm not as adept at reading it as the Java one. Or it could be that memory mapping is just hiding the extra calls, but it's really doing between 1 and 2 orders of magnitude less I/O calls. Ideas? Also wondering, the Java program is using RandomAccessFile, but Java also supports memory mapping through NIO, is there any way to make the Java library use it? I've been exploring the code a bit, and also looked at the "ng" repository (version 6 and 7) with no much luck. — Reply to this email directly, view it on GitHub <#1237>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEVZQCUG2FWIPYUEQUXRE3X3F3IPANCNFSM6AAAAAA46DSADE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

JohnLCaron · 2023-09-24T21:13:07Z

JohnLCaron
Sep 24, 2023
Collaborator

https://github.com/Unidata/netcdf-java/blob/maint-5.x/cdm/core/src/main/java/ucar/unidata/io/MMapRandomAccessFile.java

0 replies

aaime · 2023-10-02T14:26:12Z

aaime
Oct 2, 2023
Author

Hi John,
thanks for following up!
About the memory mapping, yes, I doubt it will make much difference until the reading algorithm is improved.

Also thanks for sharing the Kotlin library. I'm having some trouble understanding it, Kotlin seems to be a bit harder than I thought to grok. I guess the interesting part would be going from Section, to the address of a Chunk in the section, to the offset on disk of said chunk. I'm guessing it's the Btree1 class?

0 replies

JohnLCaron · 2023-10-09T19:24:39Z

JohnLCaron
Oct 9, 2023
Collaborator

Hi Andrea:

I think the gist of the improvement is in com.sunya.cdm.layout.Tiling in conjunction with com.sunya.netchdf.hdf5.H5Tiling and Btree1.

The idea is that each variable has a fixed tile size that tesselates its dataspace, so you can determine which tiles are needed for any given data subset without looking at disk at all. Then you use Btree1 to efficiently find those tiles. If any are missing, you can return an array filled with the fill value.

The old code ran just used the btree to find the needed tiles. So it had to touch every chunk. (So embarassing to look back at your old code with new eyes).

Regards, John.

0 replies

JohnLCaron · 2023-10-09T19:30:36Z

JohnLCaron
Oct 9, 2023
Collaborator

Would you be interested in using netchdf-kotlin in GeoSolutions/GeoServer? Contact me off-line if so.

0 replies

aaime · 2023-12-20T15:32:50Z

aaime
Dec 20, 2023
Author

I've finally managed to get my hands dirty with the code, here are some findings.

I have confirmed that the current BTree reading code is inefficient, reading from disk too many DataBTree.Node structures, which I've confirmed though profiling is the hot-spot for the execution of my test code (extracting a time series in a particular point, out of a file in COARDS convention, with 700 different times) .

It seems a bit crazy that moving along the tree structure is more expensive than reading the actual data, but the code ends up reading more than 500 nodes, while the data reading code is probably smart enough to read only the bits needed:

The trick is avoid opening, and reading from disk (which does happen in the Node constructor), the nodes that we can predict are not going to be useful . This leads to opening only around 40 Node objects instead (the DataBTree.debugDataBtree came in handy here), significantly cutting down the read time.

I have a rough proof of concept in this commit. The idea is to have an iterator that is given not the start origin, but the full wanted section, and use that to filter down the reads, using a subclass of node, SearchNode, that does the job.
I had to change a bit the hasNext() interface as there might be IO happening in it now. The tricky part happens indeed in "hasNext()", I struggled quite a bit to try to predict if a given node could match the want section. I would really appreciate if you could have a look, I'm sure I've made some mistakes in there.
That said, the current code can execute the full process of opening the file and grabbing the time series, up to 4 times faster, without changes to IO. Do you think the changes (properly cleaned up) have a chance of getting merged in netcdf-java?

And then there is memory mapping, and the deprecated class you referenced. I can confirm it provides no significant speedup... until it does! Making the code use MMapRandomAccessFile seems to provide no significant speedup at first, but then I noticed the FileCache. If the FileCache is enabled, in combination with MMapRandomAccessFile being used and cached in there, leads to another 4 times speedup, if the time series is extracted again. It seems the memory mapping helps, but only if one has a chance to reuse the memory map more than once (which I guess would be quite useful for desktop apps like ToolsUI or Panoply, or long-lived servers like Thredds).

My experience with memory mapping in the GeoTools library also tells me this approach is not without thorns though: it's great on Linux, very poor on Windows. On Linux the mapping is virtual, comes with no actual memory usage, it's all managed transparently by the kernel. On Windows, it's physical, the memory is actually allocated, so it cannot be used for large files.
Gut feeling, I would un-deprecated the class and allow options to use it, waning users for the consequences if used on Windows (might be still great for small files).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using memory mapping #1237

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Using memory mapping #1237

aaime Sep 19, 2023

Replies: 6 comments

JohnLCaron Sep 24, 2023 Collaborator

JohnLCaron Sep 24, 2023 Collaborator

aaime Oct 2, 2023 Author

JohnLCaron Oct 9, 2023 Collaborator

JohnLCaron Oct 9, 2023 Collaborator

aaime Dec 20, 2023 Author

aaime
Sep 19, 2023

JohnLCaron
Sep 24, 2023
Collaborator

JohnLCaron
Sep 24, 2023
Collaborator

aaime
Oct 2, 2023
Author

JohnLCaron
Oct 9, 2023
Collaborator

JohnLCaron
Oct 9, 2023
Collaborator

aaime
Dec 20, 2023
Author