class: middle, center, title-slide count: false

pyhf: pure-Python implementation of HistFactory with tensors and automatic differentiation

.huge[Lukas Heinrich], .huge[Matthew Feickert],[Giordon Stark]

.huge[(SCIPP, UC Santa Cruz)]

[email protected]

ATLAS Statistics Forum

December 10th, 2020

pyhf team

.grid[[ .circle.width-80[Lukas]

Lukas Heinrich

CERN ][ .circle.width-80[Matthew]

Matthew Feickert

Illinois ][ .circle.width-75[Giordon]

Giordon Stark


pyhf: HistFactory in pure Python


  • First non-ROOT implementation of the HistFactory p.d.f. template
    • .width-50[DOI]
  • pure-Python library with Python and CLI API
  • Open source tool for all of HEP



Required dependencies

Core libraries (though all lightweight installs):

  • SciPy - Scientific Python (optimization routines)
  • click - Command line interface
  • tqdm - Progress bars
  • jsonschema - HistFactory JSON specification
  • jsonpatch - Signal reinterpretation
  • PyYAML - Command line niceties ] .kol-1-2[

Optional dependencies

Depending on what users want to do:

.kol-1-1[ Getting "extras" is easy:

$ python -m pip install --upgrade pyhf[xmlio] # Gets uproot
$ python -m pip install --upgrade pyhf[backends] # Gets all backends
$ python -m pip install --upgrade pyhf[jax,xmlio,minuit] # Gets JAX, uproot, and iminuit


Open Source Industry Tools for Computation

.grid[ .kol-2-3[

  • All numerical operations implemented in .bold[tensor backends] through an API of $n$-dimensional array operations
  • Using deep learning frameworks as computational backends allows for .bold[exploitation of autodiff and GPU acceleration]
  • As huge buy in from industry we benefit for free as these frameworks are .bold[continually improved] by professional software engineers (physicists are not)[ .width-90[scaling_hardware] ] .kol-1-2[

  • Show hardware acceleration giving .bold[order of magnitude speedup] for some models!
  • Improvements over traditional
    • 10 hrs to 30 min; 20 min to 10 sec ] ][ .width-85[NumPy] .width-85[PyTorch] .width-85[Tensorflow]

.width-50[![JAX](figures/logos/JAX_logo.png)] ] ]

Current Features

.smaller[Note: the pyhf API is meant to allow for higher-level frameworks to build on top, such as cabinetry.

  • Missing a meta-language (DSL, metadata) that describes the data that can be passed to plotting utilities
  • cabinetry is meant to help with plotting things "correctly"
  • All of this work is openly developed with extensive feedback ]

See our roadmap to get an idea of where we're going!

Automatic Differentiation of pyhf Models

With tensor library backends gain access to exact (higher order) derivatives — accuracy is only limited by floating point precision

$$ \frac{\partial L}{\partial \mu}, \frac{\partial L}{\partial \theta_{i}} $$

.grid[ .kol-1-2[ .large[Exploit .bold[full gradient of the likelihood] with .bold[modern optimizers] to help speedup fit!]

.large[Gain this through the frameworks creating computational directed acyclic graphs and then applying the chain rule (to the operations)] ] .kol-1-2[ .center.width-80[DAG] ] ]

HEP Example: Likelihood Gradients

.footnote[Example adapted from Lukas Heinrich's PyHEP 2020 tutorial][ .width-90[carbon_plot_MLE_grads] ][ .width-90[MLE_grad_map_full] ][Having access to the gradients makes the fit orders of magnitude faster than finite difference]

Documentation and Development

.grid[[All documentation can be found at] .kol-1-2[ In this documentation you can find a list of:

All of our documentation is tested nightly, against our software, as well as updates to software and tools we depend on. In addition to this, we've made full use of:

  • Sphinx - main documentation
  • Jupyter - fundamentals and tutorials

] .kol-1-2[ Most recently gave a successful, in-depth tutorial at the ATLAS SUSY+Exotics workshop.

.width-100[atlas susy exotics workshop] ] ]

Why do users choose us?

.grid[[ Out of all the toolkits, why do you think your users choose to use yours? ] .kol-1-1[

  • Easy to use and install: PyPI, TestPyPI (bleeding edge), conda-forge, and Docker
  • Fast code, fast development cycle, fast feedback
  • Well-documented Python implementations, clear communication channels to devs and community
  • Command line complements the Pythonic API
    • We really love our CLI, it plays nicely with shell "behavior" such as piping
      $ pyhf prune --sample ttbar BkgOnly.json | pyhf inspect
  • Significant test-driven development (underlies all of our work) with 1000+ tests!
    $ pytest --collect-only | grep "<Function\|<Class" -c
  • Every commit tested in CI across Python 3.6, 3.7, 3.8 on Linux and MacOS systems with nightlies CI/CD

But we believe the biggest reason users choose pyhf is because .center.huge[pyhf is developed openly and freely] ] ]

Common scripts/macros/functions?

.grid[[ Is your toolkit using some external packages / common scripts / macros / functions to perform some of the operations like fit, limit setting, significance computation, Asimov-creation, ranking plot? ] .kol-1-1[

  • Fits, limit setting: SciPy and minuit
  • Test statistics are implemented in pyhf
  • Asimov creation: just a fit in pyhf to generate the Asimov dataset ] ]

Common software for ATLAS?

.grid[[ Which pieces of your toolkit could be factorized out into a package that would be developed/supported/distributed by ATLAS? ] .kol-1-1[
We don't necessarily believe any particular piece needs to be factorized out into a package maintained by ATLAS.

  • pure-Python implementation of HistFactory (a mathematical model)
  • pyhf is a low(er)-level library to interact with the HistFactory JSON workspaces
  • Higher-level tools are encouraged to build on top of pyhf to extend the functionality into plots, limit setting, and other debugging utilities

Additional common software?

.grid[[ What additional common software could your toolkit take advantage of? ] .kol-1-1[

  • Not sure
  • We are willing to try out new ideas all the time
  • If you have ideas, get in touch with us! ] ]

Contributing to central toolkit?

.grid[[ Would you be willing to contribute to the development of a centrally distributed toolkit that provides functionality for providing common statistical operations (e.g. calculating a $p$-value)? ] .kol-1-1[

  • Cannot make any promises at this time
  • All core developers are very busy with convener roles and contact roles in ATLAS and IRIS-HEP ] ]

The Bigger Picture

.kol-2-3[ pyhf fits into the "open science" ecosystem:

.center.width-100.tiny[ [![cranmer talk](figures/two_tastes.png)](
(stolen from Kyle Cranmer) ] ]


.kol-2-3[ .large[pyhf provides:]

  • .large[.bold[Accelerated] fitting library]
    • reducing time to insight/inference!
    • Hardware acceleration on GPUs and vectorized operations
    • Backend agnostic Python API and CLI
  • .large[Flexible .bold[declarative] schema]
    • JSON: ubiquitous, universal support, versionable
  • .large[Enabling technology for .bold[reinterpretation]]
    • JSON Patch files for efficient computation of new signal models
    • Unifying tool for theoretical and experimental physicists
  • .large[Project in growing .bold[Pythonic HEP ecosystem]]

.center.width-100[[![pyhf_logo](](] ]

class: middle


Thanks for listening!

Come talk with us!

.large[] ] .grid[[ .width-90[scikit-hep_logo] ][
.width-90[pyhf_logo] ][

.width-100[iris-hep_logo] ] ]

class: end-slide, center


External dependencies

Required dependencies from our setup.cfg:

.grid[ .kol-2-3[

install_requires =
  • SciPy - Scientific Python (optimization routines)
  • click - Command line interface
  • tqdm - Progress bars
  • jsonschema - HistFactory JSON specification
  • jsonpatch - Signal reinterpretation
  • pyyaml - Command line niceties ][ .width-50[scipy logo]

    .width-50[click logo]

    .width-25[tqdm logo] ] ]

Optional dependencies

We have lots of optional dependencies depending on what users want to do:

HistFactory Model

  • A flexible probability density function (p.d.f.) template to build statistical models in high energy physics
  • Developed in 2011 during work that lead to the Higgs discovery [CERN-OPEN-2012-016]
  • Widely used by the HEP community for .bold[measurements of known physics] (Standard Model) and
    .bold[searches for new physics] (beyond the Standard Model)[ .width-90[HIGG-2016-25] .bold[Standard Model] ][ .width-100[SUSY-2016-16] .bold[Beyond the Standard Model] ]

HistFactory Template

$$ f\left(\mathrm{data}\middle|\mathrm{parameters}\right) = f\left(\vec{n}, \vec{a}\middle|\vec{\eta}, \vec{\chi}\right) = \color{blue}{\prod_{c \,\in\, \textrm{channels}} \prod_{b \,\in\, \textrm{bins}_c} \textrm{Pois} \left(n_{cb} \middle| \nu_{cb}\left(\vec{\eta}, \vec{\chi}\right)\right)} \,\color{red}{\prod_{\chi \,\in\, \vec{\chi}} c_{\chi} \left(a_{\chi}\middle|\chi\right)} $$

.bold[Use:] Multiple disjoint channels (or regions) of binned distributions with multiple samples contributing to each with additional (possibly shared) systematics between sample estimates

.kol-1-2[ .bold[Main pieces:]

  • .blue[Main Poisson p.d.f. for simultaneous measurement of multiple channels]
  • .katex[Event rates] $\nu_{cb}$ (nominal rate $\nu_{scb}^{0}$ with rate modifiers)
  • .red[Constraint p.d.f. (+ data) for "auxiliary measurements"]
    • encode systematic uncertainties (e.g. normalization, shape)
  • $\vec{n}$: events, $\vec{a}$: auxiliary data, $\vec{\eta}$: unconstrained pars, $\vec{\chi}$: constrained pars ] .kol-1-2[ .center.width-100[SUSY-2016-16_annotated] .center[Example: .bold[Each bin] is separate (1-bin) channel,
    each .bold[histogram] (color) is a sample and share
    a .bold[normalization systematic] uncertainty] ]

HistFactory Template

$$ f\left(\vec{n}, \vec{a}\middle|\vec{\eta}, \vec{\chi}\right) = \color{blue}{\prod_{c \,\in\, \textrm{channels}} \prod_{b \,\in\, \textrm{bins}_c} \textrm{Pois} \left(n_{cb} \middle| \nu_{cb}\left(\vec{\eta}, \vec{\chi}\right)\right)} \,\color{red}{\prod_{\chi \,\in\, \vec{\chi}} c_{\chi} \left(a_{\chi}\middle|\chi\right)} $$

Mathematical grammar for a simultaneous fit with
  • .blue[multiple "channels"] (analysis regions, (stacks of) histograms)
  • each region can have .blue[multiple bins]
  • coupled to a set of .red[constraint terms]

.center[.bold[This is a _mathematical_ representation!] Nowhere is any software spec defined] .center[.bold[Until recently] (2018), the only implementation of HistFactory has been in [`ROOT`](]

HistFactory Template (in more detail)

$$ f\left(\vec{n}, \vec{a}\middle|\vec{\eta}, \vec{\chi}\right) = \color{blue}{\prod_{c \,\in\, \textrm{channels}} \prod_{b \,\in\, \textrm{bins}_c} \textrm{Pois} \left(n_{cb} \middle| \nu_{cb}\left(\vec{\eta}, \vec{\chi}\right)\right)} \,\color{red}{\prod_{\chi \,\in\, \vec{\chi}} c_{\chi} \left(a_{\chi}\middle|\chi\right)} $$

$$ \nu_{cb}(\vec{\eta}, \vec{\chi}) = \sum_{s \,\in\, \textrm{samples}} \underbrace{\left(\sum_{\kappa \,\in\, \vec{\kappa}} \kappa_{scb}(\vec{\eta}, \vec{\chi})\right)}_{\textrm{multiplicative}} \Bigg(\nu_{scb}^{0}(\vec{\eta}, \vec{\chi}) + \underbrace{\sum_{\Delta \,\in\, \vec{\Delta}} \Delta_{scb}(\vec{\eta}, \vec{\chi})}_{\textrm{additive}}\Bigg) $$

.bold[Use:] Multiple disjoint channels (or regions) of binned distributions with multiple samples contributing to each with additional (possibly shared) systematics between sample estimates

.bold[Main pieces:]

  • .blue[Main Poisson p.d.f. for simultaneous measurement of multiple channels]
  • .katex[Event rates] $\nu_{cb}$ from nominal rate $\nu_{scb}^{0}$ and rate modifiers $\kappa$ and $\Delta$
  • .red[Constraint p.d.f. (+ data) for "auxiliary measurements"]
    • encoding systematic uncertainties (normalization, shape, etc)
  • $\vec{n}$: events, $\vec{a}$: auxiliary data, $\vec{\eta}$: unconstrained pars, $\vec{\chi}$: constrained pars

Why is the likelihood important?


  • High information-density summary of analysis
  • Almost everything we do in the analysis ultimately affects the likelihood and is encapsulated in it
    • Trigger
    • Detector
    • Combined Performance / Physics Object Groups
    • Systematic Uncertainties
    • Event Selection
  • Unique representation of the analysis to reuse and preserve ] .kol-1-2.width-100[

    likelihood_connections ]


