feat: cms ttbar training notebook #127

ekauffma · 2023-04-27T12:28:34Z

Add notebook to train BDT (assigns jets to their parent partons).
The models trained in this notebook can be used in the ttbar_analysis_pipeline.ipynb notebook.

analyses/cms-open-data-ttbar/jetassignment_training.py

analyses/cms-open-data-ttbar/utils/__init__.py

alexander-held · 2023-05-02T21:38:55Z

analyses/cms-open-data-ttbar/jetassignment_training.py

+# <img src="utils/jetcombinations.png" alt="jetcombinations" width="700"/>
+#
+# The combination with the highest BDT score will be selected for each event.
+# ____


This doesn't seem to render as line, at least not on GitHub.

I think it needs a blank line above and below to work.

Suggested change

# ____

#

# ____

#

alexander-held · 2023-05-02T21:41:22Z

analyses/cms-open-data-ttbar/jetassignment_training.py

+fileset_keys = list(fileset.keys())
+for key in fileset_keys:
+    if key!="ttbar__nominal":
+        fileset.pop(key)


a bit more compact suggestion:

fileset = {"ttbar__nominal": fileset["ttbar__nominal"]}

alexander-held · 2023-05-02T21:43:40Z

analyses/cms-open-data-ttbar/jetassignment_training.py

+output, metrics = run(fileset, 
+                      "Events", 
+                      processor_instance = JetClassifier(permutations_dict, labels_dict))


It would be probably useful to save the output after this step, allowing the subsequent training to be re-run from just the saved output. Having to re-run coffea every time might not be super convenient. We can do that in a follow-up though.

* add notebooks for correctionlib, Triton and MLFlow demos

…p#132) * changed label top_hadron and top_lepton to b_tophad and b_toplep * added descriptions of files in cms-open-data-ttbar directory to readme

…s-hep#137) * add example test for streaming ROOT file data

* delay the import of tritonclient.grpc to the point where it is actually being used

* link to new root-project repository for RDF AGC implementation

* Remove unused imports (asyncio, uproot) * Add missing import (logging) * Remove unused variable

* update Dask client & cluster creation for the EAF at Fermilab

* revert to the 2.5% effect for W+jets scale variations used before the correctionlib migration

* added description of statistics to docs * added description of systematics * added binning info

iris-hep#145) * add new local_data_cache argument to utils.construct_fileset * allow for locally caching data by first downloading it before use

…ence (iris-hep#149) * add utility tool to validate histogram contents * add reference file histos_1_file_per_process.json

* fix errors flagged by ruff * remove broken code from interpolate.py

* sync plotEvents.{ipynb,py} * remove unused imports from plotEvents.{py,ipynb}

* add ruff to CI

* removed unnecessary particle dependency * moved ML model fitting into USE_INFERENCE wrapping * updated func_adl get_query method * turn initial ML feature plots into grid

…ris-hep#157) * fix existing reference file to correspond to correct effect of scale variation * sort keys in json dumps of reference yields * added reference histogram yields for various N_FILES_MAX_PER_SAMPLE settings

* fix binning for analysis task description

* moved a lot of functionality from the notebook to utility modules * cloudpickle ensures that Dask workers still have access to relevant functionality * moved additional config from YAML to a python config file * updated default chunk size to 200k

alexander-held · 2023-08-30T15:19:35Z

It looks like something went wrong here and the diff now shows a ton of files as changed, any idea why?

ekauffma · 2023-08-30T17:32:07Z

It looks like something went wrong here and the diff now shows a ton of files as changed, any idea why?

I have no idea. I did not see this when I made the pull request.. I'll take some time to look into this.

alexander-held · 2023-08-30T17:33:30Z

Maybe rebasing might fix it? No idea either why this happened.

alexander-held · 2023-09-08T11:39:17Z

analyses/cms-open-data-ttbar/jetassignment_training.py

+
+# %% tags=[]
+# remove model directory after uploading to triton
+# !rm -r reconstruction_bdt_xgb


It would be good to remove this feature by default to allow the model files to be still available after running the notebook. Perhaps we can delete the folder before creating new files to avoid any kind of collisions, but even that is maybe optional.

alexander-held · 2023-09-08T15:41:51Z

Please also have a look at the config written out, there are some issues with brackets not being closed.

ekauffma and others added 5 commits April 20, 2023 11:09

training notebook works without mlflow or dask

c958a7e

added permutations dict function and images

e45961f

made use_mlflow option better in training notebook

f35a445

Merge branch 'iris-hep:main' into add-training

d2acc61

some notebook reorganizing

7ea92c9

ekauffma changed the title ~~Add training~~ feat: cms ttbar training notebook Apr 27, 2023

alexander-held mentioned this pull request Apr 30, 2023

Towards AGC task v2 #101

Open

7 tasks

alexander-held reviewed Apr 30, 2023

View reviewed changes

analyses/cms-open-data-ttbar/jetassignment_training.py Show resolved Hide resolved

added triton info to training notebook

9b238c1

alexander-held reviewed May 2, 2023

View reviewed changes

analyses/cms-open-data-ttbar/utils/__init__.py Outdated Show resolved Hide resolved

alexander-held reviewed May 2, 2023

View reviewed changes

ekauffma marked this pull request as ready for review May 3, 2023 04:48

ekauffma and others added 16 commits May 3, 2023 06:49

Merge branch 'main' into add-training

699de4d

typo

22be8de

docs: add notebooks used for AGC-2023 workshop (iris-hep#133)

c0ca3b3

* add notebooks for correctionlib, Triton and MLFlow demos

feat: update object labels and add readme for ttbar analysis (iris-he…

2157075

…p#132) * changed label top_hadron and top_lepton to b_tophad and b_toplep * added descriptions of files in cms-open-data-ttbar directory to readme

test: add example of streaming ServiceX output as ROOT file data (iri…

2f681f9

…s-hep#137) * add example test for streaming ROOT file data

fix: do not import tritonclient.grpc unless needed (iris-hep#139)

6beb3e6

* delay the import of tritonclient.grpc to the point where it is actually being used

added facility instructions (iris-hep#135)

3f5178d

docs: update links for RDF implementation (iris-hep#147)

a0836c7

* link to new root-project repository for RDF AGC implementation

fix: imports, remove unused variable (iris-hep#146)

c78163c

* Remove unused imports (asyncio, uproot) * Add missing import (logging) * Remove unused variable

feat: updating init code for obtaining Dask client on EAF (iris-hep#150)

9f71642

* update Dask client & cluster creation for the EAF at Fermilab

fix: changed wjets scale variation percentages (iris-hep#152)

a43ab12

* revert to the 2.5% effect for W+jets scale variations used before the correctionlib migration

docs: improve task description (iris-hep#138)

e7ab4e5

* added description of statistics to docs * added description of systematics * added binning info

feat: teach construct_fileset to cache remote files in local directory (

54b4381

iris-hep#145) * add new local_data_cache argument to utils.construct_fileset * allow for locally caching data by first downloading it before use

feat: add utility tool to validate histogram contents against a refer…

fd6edaf

…ence (iris-hep#149) * add utility tool to validate histogram contents * add reference file histos_1_file_per_process.json

fix: ruff linter errors (iris-hep#148)

f84caaf

* fix errors flagged by ruff * remove broken code from interpolate.py

fix: ruff linter errors on plotEvents.py (iris-hep#155)

98d5fd6

* sync plotEvents.{ipynb,py} * remove unused imports from plotEvents.{py,ipynb}

eguiraud and others added 11 commits June 7, 2023 22:43

ci: add ruff linter check (iris-hep#154)

ee576f6

* add ruff to CI

feat: improve integration of ml task (iris-hep#153)

05b5322

* removed unnecessary particle dependency * moved ML model fitting into USE_INFERENCE wrapping * updated func_adl get_query method * turn initial ML feature plots into grid

docs: fix binning for analysis task description (iris-hep#169)

7c132aa

* fix binning for analysis task description

training notebook works without mlflow or dask

abb4657

added permutations dict function and images

521f83b

made use_mlflow option better in training notebook

fb7fc02

some notebook reorganizing

c4b67b9

typo

2796e09

resolve merge conflicts

616d723

ekauffma marked this pull request as draft June 30, 2023 14:58

ekauffma added 7 commits July 6, 2023 12:54

removed notebook output

3db12f8

getting feature columns and plotting them now works again

a392788

fixed plotting features

e7bba17

fixed end of training notebook

f6c7629

fixed some ruff linter errors

858c566

fixed more ruff linter errors

f8552ce

fixed one more ruff linter error

bd5f755

alexander-held self-requested a review July 17, 2023 13:34

alexander-held reviewed Sep 8, 2023

View reviewed changes

ekauffma closed this Sep 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cms ttbar training notebook #127

feat: cms ttbar training notebook #127

ekauffma commented Apr 27, 2023

alexander-held May 2, 2023

alexander-held May 2, 2023

alexander-held May 2, 2023

alexander-held May 2, 2023

alexander-held May 2, 2023

alexander-held commented Aug 30, 2023

ekauffma commented Aug 30, 2023

alexander-held commented Aug 30, 2023

alexander-held Sep 8, 2023

alexander-held commented Sep 8, 2023

-# ____
+#
+# ____
+#

feat: cms ttbar training notebook #127

feat: cms ttbar training notebook #127

Conversation

ekauffma commented Apr 27, 2023

alexander-held May 2, 2023

Choose a reason for hiding this comment

alexander-held May 2, 2023

Choose a reason for hiding this comment

alexander-held May 2, 2023

Choose a reason for hiding this comment

alexander-held May 2, 2023

Choose a reason for hiding this comment

alexander-held May 2, 2023

Choose a reason for hiding this comment

alexander-held commented Aug 30, 2023

ekauffma commented Aug 30, 2023

alexander-held commented Aug 30, 2023

alexander-held Sep 8, 2023

Choose a reason for hiding this comment

alexander-held commented Sep 8, 2023