Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce functionality for chunking and breaking IID experiments #20

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

mcw92
Copy link
Member

@mcw92 mcw92 commented Dec 9, 2024

This PR introduces functionality for the chunking and breaking IID experiments. In particular, the evaluation has been extended to calculate and save local and global confusion matrices in order to enable calculation of arbitrary metrics for the breaking IID experiments.

The following changes have been made:

  • Make the synthetic data generation consistent throughout the code. This means that in the serial case, the dataset generated with generate_and_distribute_synthetic_dataset without local or global imbalances equals the completely balanced dataset generated with make_classification_dataset when using the same random state. This ensures comparability of the strong scaling experiment series with and without chunking as the same datasets are created when passing the same random state.
  • Fix passing additional keyword arguments in both train_parallel_on_synthetic data and train_parallel_on_balanced_synthetic_data. This was completely missing in the former case. In addition, the argument parser was lacking some of the keyword arguments used in sklearn's make_classification and train_test_split used under the hood.
  • Introduce job script generation scripts for both chunking and breaking IID experiments.
  • Add calculation and saving of local and global confusion matrices, including tests.
  • Add evaluation from checkpoints for breaking IID experiments.
  • Refactor train module into train_serial and train_parallel.

Notes to self:

  • sklearn's RandomForestClassifier internally uses weighted voting in its predict() method, i.e., the predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. The predicted class thus is the one with highest mean probability estimate across the trees. As the DistributedRandomForest class in specialcouscous only implements plain voting, I also implemented plain voting for calculation of the local confusion matrices instead of using predict() in order to ensure consistency and comparability.
  • Possible problem with the confusion matrix might occur when the local data does not contain all classes for extremely imbalanced datasets / data partitioning. However, I am not sure about this.
  • As building a globally shared model turned out to be infeasible for most of our use cases / experiments, the functionality for calculating the confusion matrix and also evaluating breaking IID experiments from checkpoints mainly focuses on the case where the global model is not shared but distributed. That is why a shared test set is required in all our experiments.

@mcw92 mcw92 added the enhancement New feature or request label Dec 9, 2024
@mcw92 mcw92 requested a review from fluegelk December 9, 2024 12:11
@mcw92 mcw92 self-assigned this Dec 9, 2024
Copy link
Contributor

github-actions bot commented Dec 9, 2024

Name Stmts Miss Cover Missing
specialcouscous/__init__.py 0 0 100%
specialcouscous/evaluation_metrics.py 66 1 98% 139
specialcouscous/rf_parallel.py 139 10 93% 93-97, 202, 206, 279-281, 479, 541
specialcouscous/synthetic_classification_data.py 215 49 77% 88-90, 185, 304-324, 358, 469, 471, 561-567, 585, 871-885, 1095-1151, 1224-1246
specialcouscous/train/__init__.py 0 0 100%
specialcouscous/train/train_parallel.py 262 1 99% 117
specialcouscous/train/train_serial.py 75 1 99% 189
specialcouscous/utils/__init__.py 61 33 46% 31, 81-82, 106-287
specialcouscous/utils/plot.py 136 74 46% 152, 277-302, 319-405, 421-547
specialcouscous/utils/result_handling.py 22 1 95% 79
specialcouscous/utils/slurm.py 79 72 9% 22-116, 133-149, 166-177
specialcouscous/utils/timing.py 35 0 100%
TOTAL 1090 242 78%

@codecov-commenter
Copy link

codecov-commenter commented Dec 9, 2024

Codecov Report

Attention: Patch coverage is 98.12030% with 5 lines in your changes missing coverage. Please review.

Project coverage is 77.79%. Comparing base (5c7ce6a) to head (d079d5c).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
specialcouscous/utils/__init__.py 0.00% 2 Missing ⚠️
specialcouscous/rf_parallel.py 96.66% 1 Missing ⚠️
specialcouscous/train/train_parallel.py 99.29% 1 Missing ⚠️
specialcouscous/train/train_serial.py 98.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #20      +/-   ##
==========================================
+ Coverage   75.30%   77.79%   +2.48%     
==========================================
  Files           9       10       +1     
  Lines         972     1090     +118     
==========================================
+ Hits          732      848     +116     
- Misses        240      242       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants