Replies: 6 comments
-
Hi Andrea:
1. Using RAF to a remote file is surely a bad idea. A thredds server would
help a lot.
2. In addition, I did notice recently how badly coded the HDF5/Netcdf IOSP
is, exactly as you describe, iterating over the all file chunks to see if
they intersect. That does strike me as a likely reason the C library does
so much better. I have a background project
<https://github.com/JohnLCaron/cdm-kotlin>that I think fixed that problem,
but unless you want to switch to an alpha quality kotlin library, it
probably doesn't help you.
3. I have built experimental versions of RAF on top of NIO memory mapping,
in general the performance isnt that much better. All mmap gives you is
maybe better caching, so it wont solve #2. But maybe over NFS it might be
worth it. If you want to try it, and manage the changes and PRs etc, I
might be able to find my old code to get you started.
Regards, John
…On Tue, Sep 19, 2023 at 4:59 AM Andrea Aime ***@***.***> wrote:
I'm trying to understand a significant performance difference between a
Java-based application using the NetCDF library, and a C++ one, using the C
library. The C one is performing quite a bit better (2-3 times faster) over
network file systems, trying to get a time-series of data for a particular
point, over a file with chunking a 32x32x32 chunking (time, lat, lon),
using NetCDF 4 with compression. Mind, if the data is local, the difference
is minimal instead, which in my experience indicates an I/O chattiness
problem (too many small accesses).
Checking with strace, I can see that the Java program is doing a massive
number of reads and seeks, while the C one is using memory mapping and
performing the job with a much smaller number of kernel calls. Both
programs are extracting only the "Section" of data required, the Java one
does so by giving the target Section to Variable.read(section).
Debugging, I can see the NetCDF Java library iterating over the file
chunks and asking itself "is this chunk intersecting the target Section",
ending up visiting all chunks (even if only the header of each chunk is
read), which seems inefficient. Is there no header indicating the offset of
each chunk, so that only the desired chunks can be read? (similar to the
tile directory in a GeoTIFF, for a comparison).
The C program seems to be reading only what it needs somehow, but I'm not
as adept at reading it as the Java one. Or it could be that memory mapping
is just hiding the extra calls, but it's really doing between 1 and 2
orders of magnitude less I/O calls.
Ideas? Also wondering, the Java program is using RandomAccessFile, but
Java also supports memory mapping through NIO, is there any way to make the
Java library use it? I've been exploring the code a bit, and also looked at
the "ng" repository (version 6 and 7) with no much luck.
—
Reply to this email directly, view it on GitHub
<#1237>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEVZQCUG2FWIPYUEQUXRE3X3F3IPANCNFSM6AAAAAA46DSADE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi John, Also thanks for sharing the Kotlin library. I'm having some trouble understanding it, Kotlin seems to be a bit harder than I thought to grok. I guess the interesting part would be going from Section, to the address of a Chunk in the section, to the offset on disk of said chunk. I'm guessing it's the Btree1 class? |
Beta Was this translation helpful? Give feedback.
-
Hi Andrea: I think the gist of the improvement is in com.sunya.cdm.layout.Tiling in conjunction with com.sunya.netchdf.hdf5.H5Tiling and Btree1. The idea is that each variable has a fixed tile size that tesselates its dataspace, so you can determine which tiles are needed for any given data subset without looking at disk at all. Then you use Btree1 to efficiently find those tiles. If any are missing, you can return an array filled with the fill value. The old code ran just used the btree to find the needed tiles. So it had to touch every chunk. (So embarassing to look back at your old code with new eyes). Regards, John. |
Beta Was this translation helpful? Give feedback.
-
Would you be interested in using netchdf-kotlin in GeoSolutions/GeoServer? Contact me off-line if so. |
Beta Was this translation helpful? Give feedback.
-
I've finally managed to get my hands dirty with the code, here are some findings. I have confirmed that the current BTree reading code is inefficient, reading from disk too many DataBTree.Node structures, which I've confirmed though profiling is the hot-spot for the execution of my test code (extracting a time series in a particular point, out of a file in COARDS convention, with 700 different times) . It seems a bit crazy that moving along the tree structure is more expensive than reading the actual data, but the code ends up reading more than 500 nodes, while the data reading code is probably smart enough to read only the bits needed: The trick is avoid opening, and reading from disk (which does happen in the Node constructor), the nodes that we can predict are not going to be useful . This leads to opening only around 40 Node objects instead (the DataBTree.debugDataBtree came in handy here), significantly cutting down the read time. I have a rough proof of concept in this commit. The idea is to have an iterator that is given not the start origin, but the full wanted section, and use that to filter down the reads, using a subclass of node, SearchNode, that does the job. And then there is memory mapping, and the deprecated class you referenced. I can confirm it provides no significant speedup... until it does! Making the code use MMapRandomAccessFile seems to provide no significant speedup at first, but then I noticed the FileCache. If the FileCache is enabled, in combination with MMapRandomAccessFile being used and cached in there, leads to another 4 times speedup, if the time series is extracted again. It seems the memory mapping helps, but only if one has a chance to reuse the memory map more than once (which I guess would be quite useful for desktop apps like ToolsUI or Panoply, or long-lived servers like Thredds). My experience with memory mapping in the GeoTools library also tells me this approach is not without thorns though: it's great on Linux, very poor on Windows. On Linux the mapping is virtual, comes with no actual memory usage, it's all managed transparently by the kernel. On Windows, it's physical, the memory is actually allocated, so it cannot be used for large files. |
Beta Was this translation helpful? Give feedback.
-
I'm trying to understand a significant performance difference between a Java-based application using the NetCDF library, and a C++ one, using the C library. The C one is performing quite a bit better (2-3 times faster) over network file systems, trying to get a time-series of data for a particular point, over a file with chunking a 32x32x32 chunking (time, lat, lon), using NetCDF 4 with compression. Mind, if the data is local, the difference is minimal instead, which in my experience indicates an I/O chattiness problem (too many small accesses).
Checking with strace, I can see that the Java program is doing a massive number of reads and seeks, while the C one is using memory mapping and performing the job with a much smaller number of kernel calls. Both programs are extracting only the "Section" of data required, the Java one does so by giving the target Section to
Variable.read(section)
.Debugging, I can see the NetCDF Java library iterating over the file chunks and asking itself "is this chunk intersecting the target Section", ending up visiting all chunks (even if only the header of each chunk is read), which seems inefficient. Is there no header indicating the offset of each chunk, so that only the desired chunks can be read? (similar to the tile directory in a GeoTIFF, for a comparison).
The C program seems to be reading only what it needs somehow, but I'm not as adept at reading it as the Java one. Or it could be that memory mapping is just hiding the extra calls, but it's really doing between 1 and 2 orders of magnitude less I/O calls.
Ideas? Also wondering, the Java program is using RandomAccessFile, but Java also supports memory mapping through NIO, is there any way to make the Java library use it? I've been exploring the code a bit, and also looked at the "ng" repository (version 6 and 7) with no much luck.
Beta Was this translation helpful? Give feedback.
All reactions