Benchmark of batched reading and lazy loop for TTree #360

Moelf · 2024-10-24T10:28:40Z

Moelf
Oct 24, 2024
Maintainer

julia> using UnROOT

julia> function f()
       LazyTree("3148230F-FC33-5240-8EFC-89FF45563A47.root", "Events", r"^Muon_(pt|eta|phi|mass|charge)")
       end

julia> const tt = f();

julia> function g(tt)
           df = tt
           for evt in df
               # trigger I/O
               (; Muon_pt, Muon_eta, Muon_phi, Muon_mass, Muon_charge) = evt
           end
       end

# bacthed reading
julia> @time tt[:];
  2.085207 seconds (124.92 k allocations: 756.121 MiB, 11.99% gc time)

# lazy loop
julia> @time g(tt);
  2.059459 seconds (118.85 k allocations: 529.655 MiB, 5.56% gc time)

import uproot

In [24]: %%time
    ...: with up.open("./3148230F-FC33-5240-8EFC-89FF45563A47.root") as f:
    ...:     df = f["Events"].arrays(filter_name="/Muon_(pt|eta|phi|mass|charge)/")
    ...:
CPU times: user 2.17 s, sys: 49.8 ms, total: 2.22 s
Wall time: 2.16 s

I also tried running

echo 3 | sudo tee /proc/sys/vm/drop_caches

before each timing and don't see significant change:

julia> @time g(tt);
  1.983402 seconds (118.85 k allocations: 529.655 MiB, 3.55% gc time)

julia> @time tt[:];
  2.453485 seconds (124.92 k allocations: 756.121 MiB, 14.48% gc time)

In [24]: %%time
    ...: with up.open("./3148230F-FC33-5240-8EFC-89FF45563A47.root") as f:
    ...:     df = f["Events"].arrays(filter_name="/Muon_(pt|eta|phi|mass|charge)/")
    ...:
CPU times: user 2.17 s, sys: 49.8 ms, total: 2.22 s
Wall time: 2.16 s

I'm curious as to what @jpivarski thinks -- because my time is ~consistent with uproot I'd like to think they're reasonable, but then I don't understand why lazy loop is so fast -- because I also know it has pretty big overhead just from the "check if cache is still valid" overhead, any other benchmark ideas?

jpivarski · 2024-10-24T10:51:57Z

jpivarski
Oct 24, 2024
Collaborator

Laziness can be a performance problem for low-latency networks, but these tests are being performed on local disk, which can respond to many requests for subsets of the data about as fast as one request for all the data, as long as those subsets are reasonably sequential.

Checking to see if a cache is still valid also shouldn't be expensive (or it would undermine the value of the cache), but which cache are you talking about? The OS's virtual memory or something in the Uproot or UnROOT implementation?

7 replies

Moelf Oct 24, 2024
Maintainer Author

as expect that doesn't affect anything when we're just reading the whole thing .arrays()

In [27]: %%time
    ...: with up.open("./3148230F-FC33-5240-8EFC-89FF45563A47.root", array_cache=No
    ...: ne) as f:
    ...:     df = f["Events"].arrays(filter_name="/Muon_(pt|eta|phi|mass|charge)/")
    ...:
CPU times: user 2.58 s, sys: 62.2 ms, total: 2.64 s
Wall time: 2.56 s

Moelf Oct 24, 2024
Maintainer Author

this is what I mean by cache has significant overhead:

julia> function f()
       LazyTree("3148230F-FC33-5240-8EFC-89FF45563A47.root", "Events", r"^Muon_(pt|eta|phi|mass|charge)")
       end
f (generic function with 1 method)

julia> const tt = f();

julia> function g(tt)
           s = 0.0f0
           for i in 1:999
               s+=sum(tt[i].Muon_pt)
           end
           return s
       end

julia> tt.Muon_pt.b.fBasketEntry
255-element Vector{Int64}:
       0
    1000
    8582
...

julia> @time g(tt)
  0.233217 seconds (423.00 k allocations: 21.979 MiB, 99.44% compilation time)
24404.04f0

# since it's the same basket, no more I/O
julia> @time g(tt)
  0.000045 seconds
24404.04f0

julia> @b g(tt)
35.137 μs

julia> const inmem_tt = tt[1:999];

julia> @b g(inmem_tt)
2.492 μs

so we have an overhead of 330 ns per indexing when we're also checking if the TBasket cache is a hit.

jpivarski Oct 24, 2024
Collaborator

This is a Julia hash-map key lookup, right? (Are keys strings?) 330 ns doesn't sound like a lot of time to me. Okay, I did a web search, and various language implementations are quoting numbers like 30 ns.

Moelf Oct 24, 2024
Maintainer Author

the problem might be it's not a hash-map look up. The way I implemented the TBasket cache is the following.

For each TBranch in Julia, we carry around 1) last read basket (decompressed and interpreted) 2) last read basket range.

When you index br[idx], we effectively do:

# acquire lock
if idx ∉ br.last_read_range
    br.last_read_basket = read_new_basket(br, idx)
end
# release lock
return br.last_read_basket[idx] #with correct offset

Moelf Oct 24, 2024
Maintainer Author

huh, ok so a lot of it is the lock...

# with lock
julia> @b g(tt)
22.859 μs

# without lock
julia> @b g(tt)
5.012 μs

Moelf · 2024-10-25T08:15:18Z

Moelf
Oct 25, 2024
Maintainer Author

Lock has a limited impact in realistic application

# without lock
> julia --project -t4 UnROOT_loop.jl
[ Info: 1st run
 10.942170 seconds (26.97 M allocations: 17.499 GiB, 25.29% gc time, 1.51% compilation time)
[ Info: 2nd run
 10.022244 seconds (26.16 M allocations: 17.458 GiB, 22.72% gc time)

# with lock
> julia --project -t4 UnROOT_loop.jl
[ Info: 1st run
 11.743019 seconds (26.97 M allocations: 17.499 GiB, 23.01% gc time, 1.40% compilation time)
[ Info: 2nd run
 11.145259 seconds (26.16 M allocations: 17.458 GiB, 21.00% gc time)

at most (tried a few times sometimes it's smaller) it's about 10%.

Overhead of reading a new basket

Possibly because the overhead of "reading a new basket" is large enough to dominate if you look at the entire workload:

julia> f() = LazyTree("./Run2012BC_DoubleMuParked_Muons.root", "Events");

julia> const tt = f();

julia> function g()
           tt[8000].Muon_pt # flush cache
           tt
       end

julia> function h(tt)
           tt[9200].Muon_pt # trigger basket I/O
       end

julia> @be g() h evals=1
Benchmark: 385 samples with 1 evaluation
 min    94.098 μs (82 allocs: 122.547 KiB)
 median 96.783 μs (82 allocs: 122.547 KiB)
 mean   123.332 μs (82 allocs: 122.547 KiB, 1.76% gc time)
 max    4.658 ms (82 allocs: 122.547 KiB, 96.20% gc time)

Summary:

Operation	Time
Cache-hit Indexing w/o lock	20ns
Lock&Unlock	300ns
Cache-hit Indexing w/ lock	350ns
Cache-miss indexing (new basket)	100000ns

1 reply

Moelf Oct 25, 2024
Maintainer Author

from the summary table, it looks like if each basket only has ~1000 events we should still see big impact from lock... but whole suite test doesn't show any visible impact

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark of batched reading and lazy loop for TTree #360

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Benchmark of batched reading and lazy loop for TTree #360

Moelf Oct 24, 2024 Maintainer

Replies: 2 comments · 8 replies

jpivarski Oct 24, 2024 Collaborator

Moelf Oct 24, 2024 Maintainer Author

Moelf Oct 24, 2024 Maintainer Author

jpivarski Oct 24, 2024 Collaborator

Moelf Oct 24, 2024 Maintainer Author

Moelf Oct 24, 2024 Maintainer Author

Moelf Oct 25, 2024 Maintainer Author

Lock has a limited impact in realistic application

Overhead of reading a new basket

Summary:

Moelf Oct 25, 2024 Maintainer Author

Moelf
Oct 24, 2024
Maintainer

Replies: 2 comments 8 replies

jpivarski
Oct 24, 2024
Collaborator

Moelf Oct 24, 2024
Maintainer Author

Moelf Oct 24, 2024
Maintainer Author

jpivarski Oct 24, 2024
Collaborator

Moelf Oct 24, 2024
Maintainer Author

Moelf Oct 24, 2024
Maintainer Author

Moelf
Oct 25, 2024
Maintainer Author

Moelf Oct 25, 2024
Maintainer Author