The role of Kerchunk in PVGIS 6 #581
NikosAlexandris
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Dear all @ fsspec/kerchunk
For the past four years I’ve worked on the PVGIS 1 project at the Joint Research Centre, European Commission. PVGIS provides on‑demand and instant solar‑energy estimates. The secret of speed lies not in the programming language, but in a handcrafted chunked time series data strategy, pioneered almost two decades ago : files of 25 × 25 pixel chunks with values stored contiguously in time. Any increase in latency directly affects its users.
My vision was to replace the legacy C/C++ codebase with a future‑proof engine that still delivers results in under a second. The result is PVGIS 6 23, an all‑Python + NumPy prototype that demonstrates the Python ecosystem can support high‑performance scientific computing. The chunk-oriented thinking demonstrated in Kerchunk, Xarray, Zarr & friends, has been both backing-up the existing scheme behind PVGIS and inspiring for a transition to Python. A future-coming PVGIS service based on the new engine will read it's time series from appropriately chunked Zarr4 stores.
What's the relation with
kerchunkhowever ?kerchunk& friends, including @martindurant, have been more than simple inspiration for PVGIS 6.The single most important question to answer before embarking on a complete rewrite of the old PVGIS 5.x codebase 5 , was :
Yes, it is. By appropriately chunking the input time series, of course !
Is it necessary ?
It depends who you ask, is my humble view. But let's say that replicating the speed of the current service (PVGIS 5.x) was mandatory.
So, what about Kerchunk ?
PVGIS 6 does not need
kerchunkto do this. Nonetheless,kerchunkwas the entry gate to experimental work and a learning process that eventually led to PVGIS 6. In the end, it's all about chunking, isn't it ?Have you thought of using Kerchunk to... ?
A fairly large amount of time I spent in experimenting with Kerchunk. Aiming at a most economical data storage solution, I was hoping to be able to read fast large time series 7 that are originally split in multiple NetCDF files, by feeding a single Kerchunk-generated index--exposed as a Zarr store--to Xarray's
open_datasetfunction. Thought this to be possible with the existing powers of Kerchunk. However, there was a hook : data in NetCDF files are usually compressed, hence by definition chunked 8. Therefore time needs to be spent in decompressing before reading and loading values in memory. It is likely impossible to achieve a split-of-a-second speed to read large time series from multiple input files. PVGIS 6 was getting more challenging, more exciting.I’m grateful to the Kerchunk authors and community! 🙏🏼
Nikos
ps- For the records, the first time I got hinted about
kerchunkwas by @pmav. Reading, learning, asking many questions (of mine in this repository [0]) was next. Also writing https://github.com/NikosAlexandris/rekxx (for which I am not happy that I could not merge inkerchunkitself (yet?)). Of course, many posts in the pangeo forum helped, more than a bit. The answer was out there already ! The work to do was to piece together different software and the right data structure.ps2- I've posted about PVGIS 6 in LinkedIn
Footnotes
https://joint-research-centre.ec.europa.eu/photovoltaic-geographical-information-system-pvgis_en ↩
https://code.europa.eu/pvgis/pvgis ↩
https://github.com/fsspec/kerchunk/issues?q=involves%3Anikosalexandris ↩
Zarr gets smarter and more efficient : https://discourse.pangeo.io/t/new-cloud-tensor-i-o-benchmarks-zarr-is-fast-now/5459. ↩
in-house "proprietary" code in C/C++ [^3] ↩
the larger part of it was published as a GRASS GIS module called r.sun ↩
say yearly data of hourly time series ↩
a chunk is then the atomic unit of comressed data, is my understanding ↩
Beta Was this translation helpful? Give feedback.
All reactions