Skip to content

Commit

Permalink
draft: Revise HelloWorld notebook for PyHEP Module of the Month (#2)
Browse files Browse the repository at this point in the history
* Update pre-commit hooks to add isort and apply isort to notebooks
* Revise HelloWorld notebook
* Update .gitignore
* Update jupyter-book version
  • Loading branch information
matthewfeickert authored Apr 7, 2021
1 parent 7c57c66 commit 8857f4d
Show file tree
Hide file tree
Showing 19 changed files with 182 additions and 85 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,9 @@ dmypy.json

# Pyre type checker
.pyre/

# build artifacts
_build
*.tar.gz
book/data/
book/1Lbb-likelihoods/
21 changes: 20 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,21 @@
repos:

- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v3.4.0
hooks:
- id: check-added-large-files
- id: check-case-conflict
- id: check-merge-conflict
- id: check-symlinks
- id: check-json
- id: check-yaml
- id: check-toml
- id: check-xml
- id: debug-statements
- id: end-of-file-fixer
- id: mixed-line-ending
- id: trailing-whitespace

- repo: https://github.com/psf/black
rev: 20.8b1
hooks:
Expand All @@ -10,9 +27,11 @@ repos:
- id: pyupgrade

- repo: https://github.com/nbQA-dev/nbQA
rev: 0.5.9
rev: 0.6.0
hooks:
- id: nbqa-black
additional_dependencies: [black==20.8b1]
- id: nbqa-pyupgrade
additional_dependencies: [pyupgrade==2.11.0]
- id: nbqa-isort
additional_dependencies: [isort==5.8.0]
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# `pyhf` Tutorial


Tutorial last given at the [March 2021 PyHEP topical meeting](https://indico.cern.ch/event/985425/).
Tutorial last given at the [April 2021 PyHEP topical meeting](https://indico.cern.ch/event/985425/).

**The tutorial is based off of [`pyhf` `v0.6.1`](https://pypi.org/project/pyhf/0.6.1/)**

Expand Down
3 changes: 3 additions & 0 deletions binder/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,6 @@ pyhf[xmlio,minuit,contrib]==0.6.1
ipywidgets~=7.5
pandas~=1.0
altair~=4.1
# jupyter notebooks
jupyter
jupyterlab
1 change: 1 addition & 0 deletions book/Combinations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
"outputs": [],
"source": [
"import json\n",
"\n",
"import pyhf"
]
},
Expand Down
165 changes: 97 additions & 68 deletions book/HelloWorld.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@
"outputs": [],
"source": [
"import json\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pyhf\n",
"from pyhf.contrib.viz import brazil # not imported by default!"
]
Expand Down Expand Up @@ -100,16 +100,37 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Whoa, hold up! What's with the auxiliary data? Recall the HistFactory definition\n",
"\n",
"$$\n",
"p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}} \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}} \\qquad \\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n",
"Whoa, hold up! What's with the auxiliary data? Recall the HistFactory definition"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"p(n,a|\\theta) = \\underbrace{\\prod_{\\mathrm{channel}\\ c}\\prod_{\\mathrm{bin}\\ b} \\mathrm{Pois}(n_{cb} | \\lambda_{cb}(\\theta))}_{\\text{main}} \\underbrace{\\prod_{\\mathrm{constraint}\\ \\chi} p_\\chi(a_\\chi | \\chi)}_{\\text{auxiliary}}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"for\n",
"\n",
"and the auxiliary data here is passed into a constraint function? There's two things going on here:\n",
"$$\n",
"\\lambda_{cb}(\\theta) = \\sum_{\\mathrm{sample}\\ s} \\lambda_{cbs}(\\theta)\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and the auxiliary data here is passed into a constraint function. There's two things going on here:\n",
"\n",
"1. the uncorrelated shapesys modifier is constrained by a Poisson\n",
"2. the auxiliary data is fully determined by the shapesys 'data' and the 'background' data\n",
"1. The uncorrelated shapesys modifier is constrained by a Poisson.\n",
"2. The auxiliary data is fully determined by the shapesys 'data' and the 'background' data.\n",
"\n",
"So if we were to explicitly write out the likelihood we were seeing symbolically, it would be:\n",
"\n",
Expand Down Expand Up @@ -157,8 +178,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Whew. Ok, I guess that makes sense.\n",
"\n",
"Everything looks good!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data and Parameters\n",
"\n",
"What about the actual data for the model then?"
Expand Down Expand Up @@ -197,7 +223,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This returns the data for the entire likelihood, the main model as well as the constraint (or auxiliary) model. We can also drop the auxdata to get the actual data."
"This returns the data for the entire likelihood for the 2 bin model, the main model as well as the constraint (or auxiliary) model. We can also drop the auxdata to get the actual data."
]
},
{
Expand Down Expand Up @@ -245,7 +271,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"So how do we know what these parameters correspond to? The way `pyhf` works under the hood, you can imagine an N-dimensional tensor that is purely bookkeeping all the different expected rates and modifications (read: systematics) involved that we can apply tensor algebra calculations to. This means that the ordering of the entries in each dimension is important, so we need to keep track of the order of the channels, the samples, the parameters. In `pyhf`, you can ask for the parameters or modifiers in a model:"
"So how do we know what these parameters correspond to? The way `pyhf` works under the hood, you can imagine an `n`-dimensional tensor that is purely bookkeeping all the different expected rates and modifications (read: systematics) involved that we can apply tensor algebra calculations to. This means that the ordering of the entries in each dimension is important, so we need to keep track of the order of the channels, the samples, the parameters. In `pyhf`, you can ask for the parameters or modifiers in a model:"
]
},
{
Expand All @@ -270,7 +296,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"However, these might not necessarily correspond with the order of the parameters when we build this tensor internally, and we keep this order under a different variable"
"However, these might **not necessarily** correspond with the order of the parameters when we build this tensor internally, and we keep this order under a different variable"
]
},
{
Expand Down Expand Up @@ -309,7 +335,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"and one of the parameters (the uncorrelated shape systematic) will be configured by $\\gamma_b$ - that is, one parameter for each bin of the sample in that channel. How does `pyhf` know that the parameter set is associated with two parameters? We can ask for information about the `paramset`:"
"and one of the parameters (the uncorrelated shape systematic) will be configured by $\\gamma_b$ — that is, one parameter for each bin of the sample in that channel. How does `pyhf` know that the parameter set is associated with two parameters? We can ask for information about the `paramset`:"
]
},
{
Expand Down Expand Up @@ -364,7 +390,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"... and their bounds..."
"...and their bounds"
]
},
{
Expand All @@ -380,7 +406,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"... and if the parameter should be held constant in a fit or not ..."
"... and if the parameter should be held constant in a fit or not"
]
},
{
Expand Down Expand Up @@ -413,7 +439,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"To get the expected actual data from the model (signal + background). We could turn off the signal, corresponding with the normalization factor. We just need to know the index of the parameter:"
"To get the expected actual data from the model (signal + background), we could turn off the signal, corresponding with the normalization factor. We just need to know the index of the parameter:"
]
},
{
Expand All @@ -438,7 +464,7 @@
"metadata": {},
"outputs": [],
"source": [
"bkg_pars = [*init_pars] # my clever way to \"copy\" the list by value\n",
"bkg_pars = init_pars.copy()\n",
"bkg_pars[model.config.poi_index] = 0\n",
"model.expected_actualdata(bkg_pars)"
]
Expand Down Expand Up @@ -488,9 +514,14 @@
"\n",
"$$\n",
"\\hat{\\theta}_\\text{MLE} = \\text{argmax}_\\theta L(\\theta | x)\n",
"$$\n",
"\n",
"Let's perform a fit on this model to the provided observations we just made up."
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's perform a unconstrained maximum likelihood fit on this model to the provided observations we just made up."
]
},
{
Expand Down Expand Up @@ -519,15 +550,25 @@
"\n",
"Great, so we can fit. But that's not usually the interesting part, since we cannot make any statements about this. What we prefer to do is perform a hypothesis test to either:\n",
"\n",
"* reject Standard Model hypothesis - a discovery fit\n",
"* reject Beyond the Standard Model hypothesis - an exclusion fit\n",
"\n",
"So we need to compute test statistics in order to evaluate the hypothesis. If we know the distribution of the test statistic under two different hypotheses, we can compute a [CLs](https://en.wikipedia.org/wiki/CLs_method_(particle_physics)) value. This is a modified pseudo-frequentist *p*-value.\n",
"* Reject Standard Model hypothesis — a discovery fit.\n",
"* Reject Beyond the Standard Model hypothesis — an exclusion fit.\n",
"\n",
"So we need to compute test statistics in order to evaluate the hypothesis. If we know the distribution of the test statistic under two different hypotheses, we can compute a [CLs](https://en.wikipedia.org/wiki/CLs_method_(particle_physics)) value. This is a modified pseudo-frequentist $p\\,$-value."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"\\text{CL}_\\text{s} = \\frac{\\text{CL}_\\text{s+b}}{\\text{CL}_\\text{b}}\n",
"$$\n",
"\n",
"\\text{CL}_\\text{s} = \\frac{\\text{CL}_\\text{s+b}}{\\text{CL}_\\text{b}} = \\frac{p_{\\text{s+b}}}{1-p_{\\text{b}}}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"with\n",
"\n",
"$$\n",
Expand All @@ -536,7 +577,7 @@
"\n",
"This is done by splitting the model parameters up into two groups:\n",
"\n",
"* parameters of interest (POI - can have multiple!)\n",
"* parameters of interest (POI — can have multiple!)\n",
"* nuisance parameters (NP)\n",
"\n",
"Recall that additionally, all parameters are either constrained or unconstrained. Parameters of interest happen to always be unconstrained parameters.\n",
Expand All @@ -558,8 +599,8 @@
"\n",
"So let's run a hypothesis test for\n",
"\n",
"* null hypothesis ($\\mu = 1$) - \"SUSY is real\"\n",
"* alternate hypothesis ($\\mu = 0$) - \"Standard Model explains it all\""
"* null hypothesis ($\\mu = 1$) — \"SUSY is real\"\n",
"* alternate hypothesis ($\\mu = 0$) — \"Standard Model explains it all\""
]
},
{
Expand Down Expand Up @@ -627,33 +668,30 @@
"source": [
"## Simple Upper Limit\n",
"\n",
"To get upper limits, we just need to run multiple hypothesis tests for a lot of different null hypotheses of BSM with $\\mu \\in [0, \\ldots, 5.0]$ and then find the value of $\\mu$ for which the null hypothesis is rejected (a 95% $\\text{CL}_\\text{s}$)."
"To get upper limits, we just need to run multiple hypothesis tests for a lot of different null hypotheses of BSM with $\\mu \\in [0, \\ldots, 5.0]$ and then find the value of $\\mu$ for which the null hypothesis is rejected (a 95% $\\text{CL}_\\text{s}$). We can do all of this very easily just using the [`upperlimit` API](https://pyhf.readthedocs.io/en/v0.6.1/_generated/pyhf.infer.intervals.upperlimit.html), which also can calculate the upper limit by interpolating"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"poi_values = np.linspace(0.1, 5, 50)\n",
"results = [\n",
" pyhf.infer.hypotest(\n",
" poi_value,\n",
" observations,\n",
" model,\n",
" test_stat=\"qtilde\",\n",
" return_expected_set=True,\n",
" )\n",
" for poi_value in poi_values\n",
"]"
"obs_limit, exp_limits, (scan, results) = pyhf.infer.intervals.upperlimit(\n",
" observations, model, poi_values, level=0.05, return_results=True\n",
")\n",
"print(f\"Upper limit (obs): μ = {obs_limit:.4f}\")\n",
"print(f\"Upper limit (exp): μ = {exp_limits[2]:.4f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So this ran pretty fast. We can plot the \"standard Brazil band\" using the `pyhf.contrib` module (which needs `pyhf[contrib]`):"
"We can plot the standard \"Brazil band\" of the observed and expected $\\text{CL}_\\text{s}$ values using the `pyhf.contrib` module (which needs `pyhf[contrib]`):"
]
},
{
Expand All @@ -665,8 +703,6 @@
"fig, ax = plt.subplots()\n",
"fig.set_size_inches(10.5, 7)\n",
"ax.set_title(\"Hypothesis Tests\")\n",
"ax.set_xlabel(r\"$\\mu$\")\n",
"ax.set_ylabel(r\"$\\mathrm{CL}_{s}$\")\n",
"\n",
"brazil.plot_results(ax, poi_values, results)"
]
Expand All @@ -675,7 +711,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"And then just calculate the upper limit by interpolating:"
"Note that if you wnated to do all of this \"by hand\" you still could pretty easily with the `pyhf` APIs"
]
},
{
Expand All @@ -684,31 +720,24 @@
"metadata": {},
"outputs": [],
"source": [
"# Perform a hypothesis test scan across POIs\n",
"results = [\n",
" pyhf.infer.hypotest(\n",
" poi_value,\n",
" observations,\n",
" model,\n",
" test_stat=\"qtilde\",\n",
" return_expected_set=True,\n",
" )\n",
" for poi_value in poi_values\n",
"]\n",
"\n",
"# Calculate upper limit\n",
"observed = np.asarray([h[0] for h in results]).ravel()\n",
"expected = np.asarray([h[1][2] for h in results]).ravel()\n",
"print(f\"Upper limit (obs): μ = {np.interp(0.05, observed[::-1], poi_values[::-1]):.4f}\")\n",
"print(f\"Upper limit (exp): μ = {np.interp(0.05, expected[::-1], poi_values[::-1]):.4f}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, you can also just use the [`upperlimit` API](https://pyhf.readthedocs.io/en/v0.6.1/_generated/pyhf.infer.intervals.upperlimit.html)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"obs_limit, exp_limits, (scan, results) = pyhf.infer.intervals.upperlimit(\n",
" observations, model, poi_values, level=0.05, return_results=True\n",
")\n",
"print(f\"Upper limit (obs): μ = {obs_limit:.4f}\")\n",
"print(f\"Upper limit (exp): μ = {exp_limits[2]:.4f}\")"
]
}
],
"metadata": {
Expand Down
Loading

0 comments on commit 8857f4d

Please sign in to comment.