Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed Dec 10, 2024
1 parent 8f32936 commit d6e15a5
Show file tree
Hide file tree
Showing 9 changed files with 415 additions and 90 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
06259119
b3698749
180 changes: 138 additions & 42 deletions _tex/paper.tex
Original file line number Diff line number Diff line change
Expand Up @@ -118,15 +118,17 @@
\makesavenoteenv{longtable}
\usepackage{graphicx}
\makeatletter
\def\maxwidth{\ifdim\Gin@nat@width>\linewidth\linewidth\else\Gin@nat@width\fi}
\def\maxheight{\ifdim\Gin@nat@height>\textheight\textheight\else\Gin@nat@height\fi}
\makeatother
% Scale images if necessary, so that they will not overflow the page
% margins by default, and it is still possible to overwrite the defaults
% using explicit options in \includegraphics[width, height, ...]{}
\setkeys{Gin}{width=\maxwidth,height=\maxheight,keepaspectratio}
\newsavebox\pandoc@box
\newcommand*\pandocbounded[1]{% scales image to fit in text height/width
\sbox\pandoc@box{#1}%
\Gscale@div\@tempa{\textheight}{\dimexpr\ht\pandoc@box+\dp\pandoc@box\relax}%
\Gscale@div\@tempb{\linewidth}{\wd\pandoc@box}%
\ifdim\@tempb\p@<\@tempa\p@\let\@tempa\@tempb\fi% select the smaller of both
\ifdim\@tempa\p@<\p@\scalebox{\@tempa}{\usebox\pandoc@box}%
\else\usebox{\pandoc@box}%
\fi%
}
% Set default figure placement to htbp
\makeatletter
\def\fps@figure{htbp}
\makeatother
% definitions for citeproc citations
Expand Down Expand Up @@ -210,9 +212,6 @@
\@ifpackageloaded{subcaption}{}{\usepackage{subcaption}}
\makeatother

\ifLuaTeX
\usepackage{selnolig} % disable illegal ligatures
\fi
\usepackage{bookmark}

\IfFileExists{xurl.sty}{\usepackage{xurl}}{} % add URL line breaks if available
Expand Down Expand Up @@ -336,7 +335,7 @@ \subsubsection{\texorpdfstring{\textbf{Global API

\centering{

\includegraphics{figures/figure-1.png}
\pandocbounded{\includegraphics[keepaspectratio]{figures/figure-1.png}}

}

Expand Down Expand Up @@ -392,7 +391,7 @@ \section{Methodology}\label{methodology}

\centering{

\includegraphics{figures/figure-2.png}
\pandocbounded{\includegraphics[keepaspectratio]{figures/figure-2.png}}

}

Expand All @@ -419,7 +418,7 @@ \section{Methodology}\label{methodology}

\centering{

\includegraphics{figures/figure-3.png}
\pandocbounded{\includegraphics[keepaspectratio]{figures/figure-3.png}}

}

Expand All @@ -442,13 +441,13 @@ \section{Methodology}\label{methodology}
table describes the different configurations used in our tests.

\begin{longtable}[]{@{}
>{\raggedright\arraybackslash}p{(\columnwidth - 12\tabcolsep) * \real{0.1589}}
>{\raggedright\arraybackslash}p{(\columnwidth - 12\tabcolsep) * \real{0.3444}}
>{\raggedright\arraybackslash}p{(\columnwidth - 12\tabcolsep) * \real{0.1457}}
>{\raggedright\arraybackslash}p{(\columnwidth - 12\tabcolsep) * \real{0.0993}}
>{\raggedright\arraybackslash}p{(\columnwidth - 12\tabcolsep) * \real{0.0728}}
>{\raggedright\arraybackslash}p{(\columnwidth - 12\tabcolsep) * \real{0.0728}}
>{\raggedright\arraybackslash}p{(\columnwidth - 12\tabcolsep) * \real{0.1060}}@{}}
>{\raggedright\arraybackslash}p{(\linewidth - 12\tabcolsep) * \real{0.1589}}
>{\raggedright\arraybackslash}p{(\linewidth - 12\tabcolsep) * \real{0.3444}}
>{\raggedright\arraybackslash}p{(\linewidth - 12\tabcolsep) * \real{0.1457}}
>{\raggedright\arraybackslash}p{(\linewidth - 12\tabcolsep) * \real{0.0993}}
>{\raggedright\arraybackslash}p{(\linewidth - 12\tabcolsep) * \real{0.0728}}
>{\raggedright\arraybackslash}p{(\linewidth - 12\tabcolsep) * \real{0.0728}}
>{\raggedright\arraybackslash}p{(\linewidth - 12\tabcolsep) * \real{0.1060}}@{}}
\toprule\noalign{}
\begin{minipage}[b]{\linewidth}\raggedright
prefix
Expand Down Expand Up @@ -501,7 +500,7 @@ \section{Results}\label{results}

\centering{

\includegraphics{figures/figure-4.png}
\pandocbounded{\includegraphics[keepaspectratio]{figures/figure-4.png}}

}

Expand All @@ -525,7 +524,7 @@ \section{Results}\label{results}

\centering{

\includegraphics{figures/figure-5.png}
\pandocbounded{\includegraphics[keepaspectratio]{figures/figure-5.png}}

}

Expand All @@ -539,9 +538,11 @@ \section{Results}\label{results}

\section{Recommendations}\label{recommendations}

We have split our recommendations for the ATL03 product into 3 main
categories, creating the files, accessing the files, and future tool
development.
Based on the benckmarks we got from our tests, we have split our
recommendations for the ATL03 product into 3 main categories, creating
the files, accessing the files, and future tool development. These
recommendations aim to streamline HDF5 workflows in cloud environments,
enhancing performance and reducing costs.

\subsection{Recommended cloud
optimizations}\label{recommended-cloud-optimizations}
Expand Down Expand Up @@ -596,15 +597,77 @@ \subsubsection{Reasoning}\label{reasoning}
ideal scenario all the space will be filled but that is not the case and
we will end up with unused space See~\ref{fig-2}.

\subsection{Recommended access
patterns}\label{recommended-access-patterns}

In progress
\subsection{Recommended Access
Patterns}\label{recommended-access-patterns}

\subsection{Recommended tooling
development}\label{recommended-tooling-development}
As we saw in our benchmarks, efficient access to cloud optimized HDF5
files in cloud storage requires that we also optimize our access
patterns. The following recommendations focus on optimizing workflows
for Python users. However, these recommendations should be applicable
across programming languages. It's also worth mentioning that the HDF
Group aims to include some of these features in their roadmap.

In progress
\begin{itemize}
\tightlist
\item
\textbf{Efficient Reads}: Efficiently reading cloud-hosted HDF5 files
involves minimizing network requests and prioritizing large sequential
reads. Configure chunk sizes between 1--10 MB to match the block sizes
used in cloud object storage systems, ensuring meaningful data
retrieval in each read. Avoid small chunks, as they cause excessive
HTTP overhead and slower access speeds.
\item
\textbf{Parallel Access}: Use parallel computing frameworks like
\texttt{Dask} or multiprocessing to divide read operations across
multiple processes or nodes. This alleviates the sequential access
bottleneck caused by the HDF5 global lock, particularly in workflows
accessing multiple datasets.
\item
\textbf{Cache Management}: Implement caching for metadata to avoid
repetitive fetches. Tools like \texttt{fsspec} or \texttt{h5coro}
allow in-memory or on-disk caching for frequently accessed data,
reducing latency during high-frequency
\item
\textbf{Regional Access}: Operate workflows in the same cloud region
as the data to minimize costs and latency. Cross-region data transfer
is expensive and introduces significant delays. Where possible, deploy
virtual machines close to the data storage region.
\end{itemize}

\subsection{Recommended Tooling
Development}\label{recommended-tooling-development}

To enable widespread and efficient use of HDF5 files in cloud
environments, it is crucial to develop robust tools across all major
programming languages. The HDF Group has expressed intentions to include
these features in their roadmap, ensuring seamless compatibility with
emerging cloud storage and computing standards. This section highlights
tooling strategies to support metadata indexing, driver enhancements,
and diagnostics, applicable to Python and other languages.

\begin{itemize}
\tightlist
\item
\textbf{Enhanced HDF5 Drivers:} Improve drivers like \texttt{h5py} and
\texttt{ROS3} to better handle cloud object storage's nuances, such as
intelligent request batching and speculative reads. This mitigates
inefficiencies caused by high-latency networks.
\item
\textbf{Metadata Indexing:} Develop tools for pre-indexing metadata,
similar to Kerchunk. These tools should enable clients to retrieve
only necessary data offsets, avoiding full metadata reads and
improving access times.
\item
\textbf{Kerchunk-like Integration:} Extend Kerchunk to integrate
seamlessly with analysis libraries like Xarray. This includes building
robust sidecar files that efficiently map hierarchical datasets,
enabling faster partial reads and enhancing cloud-native workflows.
\item
\textbf{Diagnostic Tools:} Create tools for diagnostics and
performance profiling tailored to cloud-optimized HDF5 files. These
tools should identify bottlenecks in access patterns and recommend
adjustments in configurations or chunking strategies.
\end{itemize}

\subsection{Mission implementation}\label{mission-implementation}

Expand Down Expand Up @@ -635,18 +698,51 @@ \subsection{Mission implementation}\label{mission-implementation}
utility (with no options) to create a ``defragmented'' final product.
\end{enumerate}

\section{Discussion}\label{discussion}

\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\subsection{Discussion and Further
Work}\label{discussion-and-further-work}

We believe that implementing cloud optimized HDF5 will greatly improve
downstream workflows that will unlock science in the cloud. We also
recognize that in order to get there, some key factors in the ecosystem
need to be addressed. Chunking strategies, adaptive caching and
automatic driver configurations should be developed to optimize
performance.

Efforts should expand multi-language support, creating universal
interfaces and libraries for broader adoption beyond Python.
Cloud-native enhancements must focus on optimizing HDF5 for distributed
systems and object storage, addressing egress costs, ease of use and
scalability. Finally, advancing ecosystem interoperability involves
setting integration standards and aligning with emerging trends such as
serverless and edge computing. These efforts, combined with community
collaboration, will modernize HDF5 to meet the challenges of evolving
data-intensive applications.

\subsubsection{Chunking Shapes and
Sizes}\label{chunking-shapes-and-sizes}

Optimizing chunk shapes and sizes is essential for efficient HDF5 usage,
especially in cloud environments:

\begin{itemize}
\tightlist
\item
Chunking shapes and sizes
\textbf{Chunk Shape:} Align chunk dimensions with anticipated access
patterns. For example, row-oriented queries benefit from row-aligned
chunks.
\item
Paged aggregation vs User block
\item
Side effects on different access patterns, e.g.~Kerchunk
\end{enumerate}
\textbf{Chunk Size:} Use chunk sizes between 1--10 MB to match cloud
storage block sizes. Larger chunks improve sequential access but
require more memory. Smaller chunks support granular reads but may
increase network overhead.
\end{itemize}

Finally, we recognize that this study has not been as extensive as it
could have been (cross language, multiple datasets) and yet we think we
ran into the key scenarios data producers will face when they start
producing cloud optimized HDf5 files. We think that there is room for
improvement and experimentation with various configurations based on
real-world scenarios is crucial to determine the best performance.

\section{References}\label{references}

Expand Down
Loading

0 comments on commit d6e15a5

Please sign in to comment.