Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cms ttbar training notebook #127

Closed
wants to merge 40 commits into from

Conversation

ekauffma
Copy link
Collaborator

Add notebook to train BDT (assigns jets to their parent partons).
The models trained in this notebook can be used in the ttbar_analysis_pipeline.ipynb notebook.

@ekauffma ekauffma changed the title Add training feat: cms ttbar training notebook Apr 27, 2023
@alexander-held alexander-held mentioned this pull request Apr 30, 2023
7 tasks
# <img src="utils/jetcombinations.png" alt="jetcombinations" width="700"/>
#
# The combination with the highest BDT score will be selected for each event.
# ____
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to render as line, at least not on GitHub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it needs a blank line above and below to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ____
#
# ____
#

Comment on lines 419 to 422
fileset_keys = list(fileset.keys())
for key in fileset_keys:
if key!="ttbar__nominal":
fileset.pop(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bit more compact suggestion:

fileset = {"ttbar__nominal": fileset["ttbar__nominal"]}

Comment on lines 442 to 444
output, metrics = run(fileset,
"Events",
processor_instance = JetClassifier(permutations_dict, labels_dict))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be probably useful to save the output after this step, allowing the subsequent training to be re-run from just the saved output. Having to re-run coffea every time might not be super convenient. We can do that in a follow-up though.

@ekauffma ekauffma marked this pull request as ready for review May 3, 2023 04:48
ekauffma and others added 16 commits May 3, 2023 06:49
* add notebooks for correctionlib, Triton and MLFlow demos
…p#132)

* changed label top_hadron and top_lepton to b_tophad and b_toplep
* added descriptions of files in cms-open-data-ttbar directory to readme
* delay the import of tritonclient.grpc to the point where it is actually being used
* link to new root-project repository for RDF AGC implementation
* Remove unused imports (asyncio, uproot)
* Add missing import (logging)
* Remove unused variable
* update Dask client & cluster creation for the EAF at Fermilab
* revert to the 2.5% effect for W+jets scale variations used before the correctionlib migration
* added description of statistics to docs
* added description of systematics
* added binning info
iris-hep#145)

* add new local_data_cache argument to utils.construct_fileset
* allow for locally caching data by first downloading it before use
…ence (iris-hep#149)

* add utility tool to validate histogram contents
* add reference file histos_1_file_per_process.json
* fix errors flagged by ruff
* remove broken code from interpolate.py
* sync plotEvents.{ipynb,py}
* remove unused imports from plotEvents.{py,ipynb}
eguiraud and others added 11 commits June 7, 2023 22:43
* removed unnecessary particle dependency
* moved ML model fitting into USE_INFERENCE wrapping
* updated func_adl get_query method
* turn initial ML feature plots into grid
…ris-hep#157)

* fix existing reference file to correspond to correct effect of scale variation
* sort keys in json dumps of reference yields
* added reference histogram yields for various N_FILES_MAX_PER_SAMPLE settings
* fix binning for analysis task description
* moved a lot of functionality from the notebook to utility modules
* cloudpickle ensures that Dask workers still have access to relevant functionality
* moved additional config from YAML to a python config file
* updated default chunk size to 200k
@ekauffma ekauffma marked this pull request as draft June 30, 2023 14:58
@alexander-held alexander-held self-requested a review July 17, 2023 13:34
@alexander-held
Copy link
Member

It looks like something went wrong here and the diff now shows a ton of files as changed, any idea why?

@ekauffma
Copy link
Collaborator Author

It looks like something went wrong here and the diff now shows a ton of files as changed, any idea why?

I have no idea. I did not see this when I made the pull request.. I'll take some time to look into this.

@alexander-held
Copy link
Member

Maybe rebasing might fix it? No idea either why this happened.


# %% tags=[]
# remove model directory after uploading to triton
# !rm -r reconstruction_bdt_xgb
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to remove this feature by default to allow the model files to be still available after running the notebook. Perhaps we can delete the folder before creating new files to avoid any kind of collisions, but even that is maybe optional.

@alexander-held
Copy link
Member

Please also have a look at the config written out, there are some issues with brackets not being closed.

@ekauffma ekauffma closed this Sep 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants