Add weighted aggregation, early stopping and persistent logging #9

TensorSpd · 2024-12-22T22:15:36Z

Description

Summary of Changes:

I have implemented several key enhancements to the Federated Learning (FL) Aggregator component to improve the training process's efficiency, fairness, and transparency. The main changes include:

Weighted Aggregation by Dataset Size:
- What: The aggregator now performs a weighted average of client models based on each client's dataset size. This ensures that clients contributing more data have a proportionally greater influence on the global model.
- Why: This approach makes the FL process fairer and more representative, especially when participants have varying amounts of data.
Early Stopping Mechanism:
- What: Introduced an early stopping feature that halts the training process if the global model's accuracy does not improve for a specified number of consecutive rounds (early_stopping_patience).
- Why: This optimization saves computational resources by preventing unnecessary training once improvements plateau.
Aggregator State Persistence (agg_state.json):
- What: Added functionality to persist the aggregator's state, including the best accuracy achieved and the count of consecutive rounds without improvement, into agg_state.json.
- Why: This allows the training process to resume seamlessly after interruptions without losing progress, enhancing reliability.
Dashboard/Metric Setup:
- What: Configured the aggregator to set up a public folder structure for each project (public/fl/<project_name>), copy dashboard files, and track accuracy metrics in accuracy_metrics.json.
- Why: This facilitates easy access to training logs and model performance metrics post-training, improving transparency and monitoring.
Bug Fixes and Miscellaneous Improvements:
- What: Resolved a NameError by ensuring the json module is imported where necessary.
- Why: Ensures smooth execution without runtime errors related to missing imports.
- What: Ensured that training_loss_round_{round_num}.json is correctly generated and moved to the public folder, making training loss logs accessible after project completion.
- Why: Provides persistent and structured logs for better analysis and visualization of training progress.

Motivation:

These enhancements aim to make the federated learning pipeline more robust, fair, and efficient. Weighted aggregation ensures that models are influenced appropriately based on the data contributed, while early stopping conserves computational resources. Persisting the aggregator state enhances reliability, and the improved logging and dashboard setup aids in monitoring and analyzing the training process effectively.

Additional Context:

These changes have been implemented locally and are ready for integration.
The early stopping mechanism relies on the aggregator's ability to track and compare model accuracy across rounds.
The dashboard setup provides a centralized location for viewing training metrics and logs, essential for ongoing monitoring and evaluation of the FL process.

If this PR closes any open issues, please tag them accordingly (e.g., Closes #123).

Affected Dependencies

Added: Utilized the json module for handling JSON operations.
No new external dependencies were introduced; all changes leverage existing Python standard libraries and PyTorch.

How has this been tested?

Unit Tests:
- Verified that training_loss_round_{round_num}.json is correctly created and updated during each training round.
- Confirmed that dataset_size.json is accurately generated and used for weighted aggregation.
Integration Tests:
- Conducted end-to-end training sessions to ensure the aggregator correctly performs weighted aggregation based on dataset sizes.
- Tested the early stopping mechanism by simulating rounds with and without accuracy improvements to verify that training stops as expected.
Manual Testing:
- Ran the modified aggregator and client scripts locally to ensure that logs are correctly copied to the public folder.
- Checked that agg_state.json properly records and updates the best accuracy and no-improvement round count.
- Accessed the dashboard in the public/fl/<project_name> folder to verify that training loss and accuracy metrics are displayed correctly.

Instructions to Reproduce:

Set Up:
- Ensure that both the aggregator and client scripts are updated with the changes from this PR.
- Initialize a federated learning project in syftbox as usual.
Run Training:
- Let the aggregator and client processes run.
- Operation for dropping config files to launch folder and everything else after that is exact same as mentioned in main documentation.
- Monitor the logs to verify that training_loss_round_{round_num}.json is being generated and copied to the public folder.
Verify Aggregation:
- Check that the aggregator correctly reads dataset_size.json from each client and performs weighted aggregation.
- Observe that training stops early if the accuracy does not improve for the specified patience rounds.
Review Logs and Metrics:
- Access the public/fl/<project_name> folder to review the training loss JSON files and the accuracy metrics dashboard.

Add weighted aggregation, early stopping and persistent logging

02135b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add weighted aggregation, early stopping and persistent logging #9

Add weighted aggregation, early stopping and persistent logging #9

TensorSpd commented Dec 22, 2024

Add weighted aggregation, early stopping and persistent logging #9

Are you sure you want to change the base?

Add weighted aggregation, early stopping and persistent logging #9

Conversation

TensorSpd commented Dec 22, 2024

Description

Affected Dependencies

How has this been tested?