|
| 1 | +--- |
| 2 | +title: "RNTuple: Where are we now and what's next?" |
| 3 | +layout: archive |
| 4 | +author: Florine de Geus |
| 5 | +--- |
| 6 | + |
| 7 | +Hello, this is Florine from the ROOT team! Over the past year, I’ve been working |
| 8 | +as a technical student funded by ATLAS to evaluate and help further develop |
| 9 | +RNTuple. As you may already be aware, RNTuple [[1]](#rntuple-evolution) is |
| 10 | +currently being developed as the successor to TTree, and is projected to be |
| 11 | +used in Run 4. I imagine you might be wondering why there is a need for a |
| 12 | +completely new (TTree-incompatible) system, and what this looks like. That’s why |
| 13 | +in this blog post, I will try to answer this question, as well as give you an |
| 14 | +overview of the current status of RNTuple, what we’re still working on before |
| 15 | +its first production release (and what we will work on beyond this), and |
| 16 | +finally how you can already try it out! |
| 17 | + |
| 18 | +## Why do we need RNTuple? |
| 19 | + |
| 20 | +At this point, ROOT has been around for more than a quarter of a century – and |
| 21 | +TTree for just as long. And as you might imagine, the computing landscape today |
| 22 | +looks vastly different compared to 25 years ago. Just to set the scene: when |
| 23 | +ROOT was first released, there was no C++ standard yet and parallel (let alone |
| 24 | +distributed) computing really wasn’t a thing yet. On the hardware side, modern |
| 25 | +storage technologies such as SSDs and object stores were still unheard of, and |
| 26 | +let’s not forget to mention the evolution of networking technologies! |
| 27 | +Naturally, TTree wasn’t designed and implemented with these things in mind. Now |
| 28 | +of course, over the years a lot of effort has been put into improving the |
| 29 | +performance and stability of TTree to make it compatible with modern computing |
| 30 | +practices as much as possible. However, there are limits to what is possible in |
| 31 | +this regard, especially given the fact that backwards- and |
| 32 | +forwards-compatibility are two major requirements for ROOT’s I/O system. This |
| 33 | +has led to the fact that with the High-Luminosity LHC on the horizon, where 90% |
| 34 | +of the total amount of LHC data is expected to be produced [[2]](#atlas-roadmap), |
| 35 | +we need to think about more optimized ways to store physics |
| 36 | +data. The challenge here is that this data is unique in the sense that events |
| 37 | +(or, in computer science terms, “entry” or “row”) are statistically independent |
| 38 | +of each other. At the same time one event typically contains many (complex) data |
| 39 | +structures, of which we often only need a small subset at a time, and we found |
| 40 | +out that standard technologies are not well-tuned for this type of data |
| 41 | +storage [[3]](#hep-data-formats). That is why we decided to combine the years |
| 42 | +of experience with TTree and various industry best-practices and invest in the |
| 43 | +next generation of high-energy physics data storage. Enter RNTuple! |
| 44 | + |
| 45 | +## Where we are now? |
| 46 | + |
| 47 | +For the past four years, a lot of effort has been put into making RNTuple the |
| 48 | +best it can be. We are working closely with the experiments to make sure that |
| 49 | +RNTuple can support their data models across all relevant stages in the |
| 50 | +production pipeline. Simultaneously, we want to make sure that it is as |
| 51 | +optimized as possible. This means making sure that the data stored in RNTuple |
| 52 | +is as compact as possible, and at the same time coming up with ways in which we |
| 53 | +can make reading and writing RNTuples to and from memory as fast as possible. |
| 54 | +To give you an idea of where we’re currently at, the plot below shows the |
| 55 | +average on-disk event size for ATLAS’s DAOD_PHYS data model [[4]](#atlas-edm), |
| 56 | +comparing TTree and RNTuple. With RNTuple, we could potentially |
| 57 | +save 20-35% of storage space, and in turn reduce the consumed network bandwidth |
| 58 | +when reading the data from a remote location. When we’re talking about exabytes |
| 59 | +of event data, this is quite significant! |
| 60 | + |
| 61 | +<center> |
| 62 | + <img |
| 63 | + src="{{'/assets/images/rntuple_blog_size_mc.png' | relative_url}}" |
| 64 | + alt="DAOD_PHYS storage efficiency" |
| 65 | + style="width: 80%" width="1600" height="749" /> |
| 66 | +</center> |
| 67 | + |
| 68 | +Besides storage efficiency, we’re also seeing very promising results when it |
| 69 | +comes to read throughput. The two plots below show the number of events |
| 70 | +processed per second for two different types of tasks, comparing ATLAS |
| 71 | +DAOD_PHYSLITE data sets stored in TTree and RNTuple (stored on an SSD). As you |
| 72 | +can see, RNTuple is remarkably faster than TTree, and similar observations are |
| 73 | +made for other data sets [[1]](#rntuple-evolution), [[5]](#rntuple-status). |
| 74 | + |
| 75 | +<center> |
| 76 | + <img |
| 77 | + src="{{'/assets/images/rntuple_blog_adl.png' | relative_url}}" |
| 78 | + alt="PHYSLITE ADL benchmarks" |
| 79 | + style="width: 80%" width="1600" height="749" /> |
| 80 | +</center> |
| 81 | + |
| 82 | +Beyond performance, we have also been working hard on RNTuple’s interface and |
| 83 | +supported features. This includes compatibility with RDataFrame, being able to |
| 84 | +read and write C++ STL types as well as user-defined types and various other |
| 85 | +features to support existing experiment frameworks. |
| 86 | + |
| 87 | +## Can I try it out? |
| 88 | + |
| 89 | +Yes! To be able to read and write RNTuples, the first thing you’ll need is a ROOT |
| 90 | +installation that includes the |
| 91 | +[ROOT 7 experimental features enabled](https://root.cern/install/build_from_source/#enabling-experimental-features-aka-root7). |
| 92 | +This is the case for the default LXPLUS installation, which runs ROOT’s (at the |
| 93 | +time of writing) latest release, [6.30.02](https://root.cern/releases/release-63002/)! |
| 94 | +If you are running ROOT in a different way, you can easily check if ROOT 7 is |
| 95 | +enabled for your installation by running `root-config --has-root7` in your terminal. |
| 96 | +If this returns `yes`, you’re all set! If you get a `no`, you will need to use a different |
| 97 | +installation of ROOT that does. Check out the [ROOT installation page](https://root.cern/install) |
| 98 | +to get it. We strongly recommend using the most recent release in order to get |
| 99 | +the latest and greatest from RNTuple. |
| 100 | + |
| 101 | +Now, on to the fun part: using RNTuple! Of course, you could write a new RNTuple |
| 102 | +completely from scratch, using fields and data that you come up with. This is |
| 103 | +done using the [`RNTupleWriter`](https://root.cern/doc/master/classROOT_1_1Experimental_1_1RNTupleWriter.html) |
| 104 | +interface. Reading an RNTuple is then naturally done through the |
| 105 | +[`RNTupleReader`](https://root.cern/doc/master/classROOT_1_1Experimental_1_1RNTupleReader.html). |
| 106 | +To get an idea of what this looks like in practice, check out for example |
| 107 | +[this tutorial](https://root.cern/doc/master/ntpl001__staff_8C.html). |
| 108 | + |
| 109 | +Of course, it would be more interesting to try out RNTuple with real data, for |
| 110 | +example with data from an analysis ntuple that is currently stored as a TTree. |
| 111 | +Well, good news! RNTuple also comes with an [`RNTupleImporter`](https://root.cern/doc/master/classROOT_1_1Experimental_1_1RNTupleImporter.html) |
| 112 | +class that allows you to automatically convert your TTrees to RNTuples. This |
| 113 | +can be as simple as executing the following two lines in the ROOT prompt. The |
| 114 | +input file containing the source TTree is read remotely, meaning you can |
| 115 | +directly copy-paste these lines into your ROOT prompt. Of course, it’s entirely |
| 116 | +possible to use your own existing TTrees. |
| 117 | + |
| 118 | +```cpp |
| 119 | +root [0] auto importer = ROOT::Experimental::RNTupleImporter::Create( |
| 120 | + "http://root.cern/files/HiggsTauTauReduced/GluGluToHToTauTau.root", |
| 121 | + "Events", |
| 122 | + "my_rntuple.root") |
| 123 | +root [1] importer->Import() |
| 124 | +``` |
| 125 | + |
| 126 | +This will convert your TTree (called `Events` here) into an RNTuple also called |
| 127 | +`Events` and write it to `my_rntuple.root`. Easy enough, but maybe you want more |
| 128 | +control over this newly created RNTuple. For example, you might want to change |
| 129 | +its name, or set the compression settings to something other than the default. |
| 130 | +This (and more) can all be tweaked! Check out |
| 131 | +[the reference](https://root.cern/doc/master/classROOT_1_1Experimental_1_1RNTupleImporter.html) |
| 132 | +or [this tutorial](https://root.cern/doc/master/ntpl008__import_8C.html) to see |
| 133 | +what options are possible. |
| 134 | + |
| 135 | +Now, I already mentioned that we have been working on RNTuple compatibility with |
| 136 | +RDataFrame. Currently, with just one line change, you will be able to use |
| 137 | +your existing analysis code with data stored in RNTuple: |
| 138 | + |
| 139 | +```cpp |
| 140 | +// Change this: |
| 141 | +ROOT::RDataFrame df("Events", "http://root.cern/files/HiggsTauTauReduced/GluGluToHToTauTau.root"); |
| 142 | + |
| 143 | +// To this to use the RNTuple you just imported into "my_rntuple.root": |
| 144 | +ROOT::RDataFrame df = ROOT::RDF::Experimental::FromRNTuple("Events", "my_rntuple.root"); |
| 145 | + |
| 146 | +// Use your existing analysis as-is! |
| 147 | +``` |
| 148 | +
|
| 149 | +> 💡 _The automatic detection of RNTuples in RDataFrame is currently available in |
| 150 | +ROOT’s `master` branch and will be available in ROOT 6.32.00!_ |
| 151 | +
|
| 152 | +## Next steps for RNTuple |
| 153 | +
|
| 154 | +So, what’s next? Performance is always one of our main concerns. We are |
| 155 | +currently working on parallelizing the writing of RNTuples. In addition, we are |
| 156 | +working on what we like to call “interface ergonomics”, i.e. the way developers |
| 157 | +will interact with RNTuple. Be aware that this means that the RNTuple |
| 158 | +interfaces might still change a little in the coming months! Next to all of |
| 159 | +this, we are preparing for larger-scale performance testing to see in what |
| 160 | +areas we could further improve. Another area of work for the near future will |
| 161 | +be in the direction of _data set combinatorics_ – that is, finding smart(er) |
| 162 | +ways of accessing and combining existing RNTuple data. And of course, we |
| 163 | +will continue to work with the experiments to make sure the transition to |
| 164 | +RNTuple will be as smooth as possible. |
| 165 | +
|
| 166 | +To wrap things up, things are looking good for RNTuple, and while there is still |
| 167 | +enough work to be done, we’re excited and eager to make RNTuple as good as it |
| 168 | +can be! If you want to know more about the evolution and performance of |
| 169 | +RNTuple, be sure to check out the references below, as well as |
| 170 | +[our other publications](https://root.cern/about/publications/#io). |
| 171 | +If you are eager to dive deeper into the specifics of the RNTuple binary format, you can |
| 172 | +[read the specification here](https://github.com/root-project/root/blob/368dd4999733efe3eb9eb171bfc1b57560259443/tree/ntuple/v7/doc/specifications.md#L1). |
| 173 | +Finally, reach out to us on the [forum](https://root-forum.cern.ch/) if you have |
| 174 | +any questions or if you would like to |
| 175 | +[contribute to RNTuple or ROOT](https://root.cern/contribute/) in general! |
| 176 | +
|
| 177 | +## References |
| 178 | +
|
| 179 | +<a name="rntuple-evolution">[1]</a> J. Blomer, P. Canal, A. Naumann, and D. Piparo, “Evolution of the ROOT Tree I/O,” EPJ Web Conf., vol. 245, 2020, doi: [10.1051/epjconf/202024502030](https://doi.org/10.1051/epjconf/202024502030). |
| 180 | +
|
| 181 | +<a name="atlas-roadmap">[2]</a> ATLAS Collaboration, “ATLAS Software and Computing HL-LHC Roadmap,” CERN, Geneva, CERN-LHCC-2022-005, LHCC-G-182, 2022. Accessed: May 02, 2023. [Online]. Available: <http://cds.cern.ch/record/2802918>. |
| 182 | +
|
| 183 | +<a name="hep-data-formats">[3]</a> J. Blomer, “A quantitative review of data formats for HEP analyses,” J. Phys. Conf. Ser., vol. 1085, p. 032020, Sep. 2018, doi: [10.1088/1742-6596/1085/3/032020](https://doi.org/10.1088/1742-6596/1085/3/032020). |
| 184 | +
|
| 185 | +<a name="atlas-edm">[4]</a> J. Elmsheuser et al., “Evolution of the ATLAS analysis model for Run-3 and prospects for HL-LHC,” EPJ Web Conf., vol. 245, 2020, doi: [10.1051/epjconf/202024506014](https://doi.org/10.1051/epjconf/202024506014). |
| 186 | +
|
| 187 | +<a name="rntuple-status">[5]</a> J. Lopez-Gomez and J. Blomer, “RNTuple performance: Status and Outlook.” arXiv, Apr. 07, 2022. doi: [10.48550/arXiv.2204.09043](https://doi.org/10.48550/arXiv.2204.09043). |
| 188 | +
|
0 commit comments