Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Datasize Tracking, persistent training loss json and share model … #4

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

TensorSpd
Copy link

Description

Summary of Changes:

I have implemented several key enhancements to the Federated Learning (FL) Client component to improve the training process's efficiency, fairness, and transparency. The main changes include:

  1. Per-Epoch Loss Logging to JSON:

    • What: Modified the train_model() function to log the average training loss for each epoch into a training_loss_round_{round_num}.json file in addition to the existing text logs.
    • Why: Structured JSON logging facilitates easier visualization and analysis of training progress, enabling better monitoring and debugging.
  2. Dataset Size Tracking:

    • What: After loading and concatenating local datasets, the client now generates a dataset_size.json file that records the total number of samples used in training.
    • Why: This information is essential for the aggregator to perform weighted aggregation, ensuring that participants with larger datasets have a proportionally greater influence on the global model.
  3. Persistence of Training Loss JSON:

    • What: After each training round, the training_loss_round_{round_num}.json file is copied to the client's public folder (public/fl/<project_name>/), ensuring that the training logs remain accessible even after the project concludes.
    • Why: Ensures that valuable training insights are preserved for future reference and analysis, enhancing transparency and accountability.
  4. Sharing Model and Dataset Info:

    • What: Enhanced existing functions to copy both the trained model (trained_model_round_{round_num}.pt) and dataset_size.json to the aggregator's side.
    • Why: Enables the aggregator to utilize dataset sizes for weighted aggregation, improving the fairness and effectiveness of the global model updates.
  5. Bug Fixes and Miscellaneous Improvements:

    • What: Resolved a FileNotFoundError by ensuring the fl_config.json is correctly placed in the running/<project_name>/ folder on the client side.
    • Why: Prevents runtime errors related to missing configuration files, ensuring smooth execution of the training process.

Motivation:

These enhancements aim to make the federated learning pipeline more robust, fair, and efficient. Logging training loss in a structured format allows for better monitoring and analysis, while tracking dataset sizes ensures that the aggregation process fairly represents each participant's contribution. Persisting logs in the public folder enhances transparency, and the bug fixes ensure reliability and smooth operation.

Additional Context:

  • These changes have been implemented locally and are ready for integration.
  • The early stopping mechanism on the aggregator side complements these client-side enhancements by optimizing the overall training process.
  • The structured logging and dataset size tracking lay the groundwork for more sophisticated monitoring and analysis tools in the future.

Affected Dependencies

  • Added: Utilized the json module for handling JSON operations.
  • No new external dependencies were introduced; all changes leverage existing Python standard libraries and PyTorch.

How has this been tested?

  • Unit Tests:

    • Verified that training_loss_round_{round_num}.json is correctly created and updated during each training round.
    • Confirmed that dataset_size.json accurately reflects the number of samples used in training.
  • Integration Tests:

    • Conducted end-to-end training sessions to ensure that the client correctly logs training losses and dataset sizes.
    • Tested the functionality to copy training_loss_round_{round_num}.json to the public folder upon training completion.
    • Ensured that both the trained model and dataset_size.json are successfully shared with the aggregator.
  • Manual Testing:

    • Ran the modified client scripts locally to ensure that training_loss_round_{round_num}.json is generated and moved to the public folder.
    • Checked the integrity of the dataset_size.json and verified that it contains accurate dataset size information.
    • Confirmed that the aggregator successfully reads the dataset_size.json files for weighted aggregation.
    • Ensured that the FileNotFoundError related to fl_config.json is resolved by correctly placing the configuration file in the running/<project_name>/ folder.

Instructions to Reproduce:

  1. Set Up:

    • Ensure that both the aggregator and client scripts are updated with the changes from this PR.
    • Initialize a federated learning project as usual, ensuring that the fl_config.json, model.py, and global_model_weights.pt are correctly placed in the launch folder.
  2. Run Training:

    • let the aggregator and client processes run in syftbox.
    • Monitor the client logs to verify that training_loss_round_{round_num}.json is being generated and copied to the public folder.
  3. Verify Aggregation:

    • Check that the aggregator correctly reads dataset_size.json from each client and performs weighted aggregation.
    • Observe that training stops early if the accuracy does not improve for the specified patience rounds.
  4. Review Logs and Metrics:

    • Access the public/fl/<project_name> folder to review the training loss JSON files and verify their persistence after training completion.
    • Ensure that the dataset_size.json is present and accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant