Awkward Queries #307

BenGalewsky · 2023-09-13T14:15:14Z

As an analyzer I want to specify my ServiceX queries using awkward syntax so I can perform row-level cuts without learning a new language

Description

We will use Awkward DASK to create a task graph for selects along with necessary_columns method to determine properties to include in the results. This will be translated into Qastle to pass on to the code generators.

We can add annotations to the task graph to indicate where the select goes beyond what ServiceX can handle.

Assumptions

It will only do row-level filtering
For the first pass, we won't attempt to unify the selections between the ServiceX parts and the coffea parts.

The text was updated successfully, but these errors were encountered:

BenGalewsky · 2023-09-13T14:15:58Z

This code snippet was submitted by Lindsey Gray

from coffea.nanoevents import NanoEventsFactory, NanoAODSchema
from distributed import Client
import dask
import dask_awkward
import awkward as ak
import hist.dask as hda

def extract_pushdown(coll):
    hlg_sorted = coll.dask._toposort_layers()
    pushdown_deps = []
    for key in hlg_sorted:
        annotations = coll.dask.layers[key].annotations
        if annotations is not None and "pushdown" in annotations:
            #print(key, coll.dask.layers[key].annotations)
            pushdown_deps = [key] + pushdown_deps
    for dep in pushdown_deps:
        layer = coll.dask.layers[dep]
        fcn = list(layer.dsk.values())[0][0]
        if isinstance(layer, dask_awkward.layers.AwkwardBlockwiseLayer) and not isinstance(layer, dask_awkward.layers.AwkwardInputLayer):
            print(dir(layer))
            print(layer.dsk)
            print(list(layer.keys()))
            print(dep, fcn.fn)
            print(dir(fcn))
            print(fcn.arg_repackers[0])
        else:
            print(dep, fcn)

if __name__ == "__main__":
    #client = Client()


    dask.config.set({"awkward.optimization.enabled": True, "awkward.raise-failed-meta": True, "awkward.optimization.on-fail": "raise"})

    with dask.annotate(pushdown="servicex"):
        events = NanoEventsFactory.from_root(
            {"tests/samples/nano_dy.root": "Events"},
            metadata={"dataset": "nano_dy"},
            schemaclass=NanoAODSchema,
            permit_dask=True,
        ).events()

        mask = events.Muon.pt > 30
        events = events[ak.any(mask, axis=1)]
        
    myhist = hda.Hist.new.Regular(50, -2.5, 2.5, name="abseta").Double()

    myhist.fill(abseta=abs(events.Muon.eta))

    extract_pushdown(myhist)

ponyisi · 2024-06-05T17:02:21Z

We have significant support for expressions and filtering using awkward syntax now using the uproot-raw codegen.

ponyisi · 2024-09-05T00:30:08Z

Following some discussion with Jim Pivarski, a thought about a first way of tying ServiceX and dask-awkward together:

we need to provide dask-awkward with the schema of at least one input file. I would imagine a separate microservice that used the DID finder to look up the dataset files and extract metadata from one of them, then returning the schema to the user. (It could also determine the number of files, I guess)
dask-awkward can then compute the columns that are necessary for its operations.
at some point dask-awkward might be smart enough to come up with a cut expression that can be interpreted with uproot.open, but as a zeroth-order thing we might just ask the users to specify this as an argument to their servicex.dask_awkward() call.
We then submit the ServiceX transformation and the dask tasks. Some magic to allow the dask inputs to be created once the corresponding per-file transformations are done.

BenGalewsky · 2024-09-05T13:21:52Z

I would imagine a separate microservice that used the DID finder to look up the dataset files and extract metadata from one of them, then returning the schema to the user.

The return of the preflight check! We used to have a service that would review a sample file to decide if the transform would work before committing the rest of the workers. We decided it wasn't worth the effort and removed that functionality.

BenGalewsky assigned gordonwatts Sep 13, 2023

ponyisi added this to the 3.2 New milestone Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Awkward Queries #307

Awkward Queries #307

BenGalewsky commented Sep 13, 2023

BenGalewsky commented Sep 13, 2023

ponyisi commented Jun 5, 2024

ponyisi commented Sep 5, 2024

BenGalewsky commented Sep 5, 2024

Awkward Queries #307

Awkward Queries #307

Comments

BenGalewsky commented Sep 13, 2023

Description

Assumptions

BenGalewsky commented Sep 13, 2023

ponyisi commented Jun 5, 2024

ponyisi commented Sep 5, 2024

BenGalewsky commented Sep 5, 2024