Luna 16 is a machine learning project focused on developing and training models for medical image analysis. The project aims to leverage advanced techniques in deep learning to improve the accuracy and efficiency of image-based diagnostics.
The Luna 16 project is dedicated to the development and training of machine learning models specifically designed for medical image analysis. By utilizing state-of-the-art deep learning techniques, the project aims to enhance the precision and speed of diagnostic processes based on medical images such as CT scans.
- PyTorch: For building and training deep learning models.
- MLFlow: For experiment tracking, model management, and system metrics logging.
- TensorBoard: For visualizing training metrics and model performance.
- Docker: To create reproducible environments for development and deployment.
- Ray: For distributed hyperparameter tuning.
By integrating these technologies, Luna 16 aims to push the boundaries of medical image analysis, providing tools and models that can significantly aid in the diagnostic process.
This project primarily serves educational purposes, guiding users through the entire process of model development. It emphasizes the importance of infrastructure components like MLFlow and TensorBoard for monitoring and logging, which are beneficial for any machine learning project. Additionally, it incorporates Docker for reproducibility and Ray for hyperparameter tuning. The project is designed with modularity in mind, allowing easy replacement of components such as models, datasets, trainers, and loss functions. As a result, it can serve as a starting point for any other machine learning project, particularly those focused on experimentation. Essentially, it acts as a template for any machine learning project, providing a robust foundation for future work.
- Advanced Deep Learning Models: Luna 16 employs cutting-edge neural network architectures to analyze and interpret medical images.
- Automated Data Processing: The project includes robust data processing pipelines to handle large volumes of medical imaging data efficiently.
- Comprehensive Training Framework: A well-structured training module is provided to facilitate the training and fine-tuning of models.
- Monitoring and Logging: Integrated tools for monitoring model performance and system metrics, including MLFlow and TensorBoard, ensure that the training process is transparent and well-documented.
- Cloud Experimentation: The project supports running experiments on cloud platforms like RunPod, enabling scalable and flexible model training.
This project is still in progress and was initially prepared for experimenting with different models. Currently, it only implements the most basic "real" model. Several key components are missing, including:
- Segmentation of CT scans. The project currently focuses on classification tasks, using cutouts of CT scans. However, adding segmentation capabilities to extract nodules from CT scans directly is planned.
- Baseline and other models. I would particularly like to add a Zero Rule Baseline and non-neural network models. Additionally, incorporating other neural network models with different architectures and convolution techniques is planned.
- Hyperparameter tuning. The project aims to integrate Ray for distributed hyperparameter tuning to optimize model performance.
- Cloud deployment. The project will include tools and scripts for deploying models to cloud platforms for inference and production use.
- Additional datasets. The project will include more datasets for training and validation, covering a wider range of medical imaging tasks and modalities.
- Try Torchrun. Use
torchrunto run training scripts. - Add gradient clipping.
The Luna 16 project is designed with a modular architecture that separates concerns and promotes reusability and extensibility.
The Model Module in Luna 16 is designed to provide a structured and flexible interface for building and training machine learning models, specifically tailored for medical image analysis.
The base model interface defines the essential methods that any model implementation should have. These include methods for fitting the model to training data, validating the model, and obtaining the model's signature. The interface ensures that all models adhere to a consistent structure, making it easier to integrate and manage different models within the project.
Key methods include:
fit_epoch: Trains the model for one epoch.get_module: Returns the underlying PyTorch module.get_signature: Generates a model signature for MLFlow logging.
The NoduleClassificationModel class implements the base model interface and provides the specific logic for training and validating a model designed to classify nodules in medical images. This class includes methods for:
- Initializing the model with necessary components such as the neural network module, optimizer, batch iterator, and logger.
- Training the model (
do_training) and validating it (do_validation). - Computing the loss for each batch (
compute_batch_loss). - Logging metrics to monitor the model's performance (
log_metrics).
The NoduleClassificationModel is designed to be flexible and extensible, allowing for fine-tuning and adaptation to different datasets and training requirements. It leverages PyTorch for model building and training, and integrates with MLFlow for experiment tracking and logging.
To use the Model Module, instantiate the NoduleClassificationModel with the appropriate neural network module, optimizer, and other dependencies. Then, use the fit_epoch method to train the model for the desired number of epochs, and monitor the training process using the integrated logging and monitoring tools.
By following this structured approach, the Model Module ensures that the development and training of machine learning models in Luna 16 are efficient, reproducible, and easy to manage.
DataModule provides a structured and efficient way to handle datasets, making it easier to train and validate models. It manages the creation of data loaders, which are essential for batching and shuffling data during training and validation processes.
The DataModule takes in training and validation datasets and creates corresponding data loaders. These data loaders are responsible for batching the data and ensuring that it is fed into the model in an optimal manner. The class also handles device configuration, ensuring that data is loaded onto the appropriate device (CPU or GPU) for training.
Key methods include:
__init__: Initializes theDataModulewith batch size, training dataset, and validation dataset. It also sets up the device configuration and creates data loaders.get_dataloader: Returns the appropriate data loader (training or validation) based on the input flag.
By using the DataModule, Luna 16 ensures that data handling is efficient and consistent, allowing for smooth and effective training and validation of machine learning models.
The Training Module in Luna 16 is responsible for orchestrating the training process of machine learning models. By using the Trainer and BaseTrainer, the Luna 16 project ensures that the training process is modular, extensible, and easy to manage.
The BaseTrainer is a protocol that defines the essential methods required for training a model. It ensures that any trainer implementation adheres to a consistent interface, making it easier to manage and extend the training process. The key methods defined in BaseTrainer include:
fit: Trains the model for a specified number of epochs.fit_epoch: Trains the model for a single epoch.
The Trainer class implements the BaseTrainer protocol and provides the concrete logic for training models. It handles the entire training lifecycle, including:
- Initializing the trainer with a name, version, and logger.
- Managing the training loop, including epoch-wise training and logging.
- Profiling the training process to optimize performance.
The Trainer class ensures that the training process is well-organized, efficient, and easy to monitor. It integrates with logging and profiling tools to provide detailed insights into the training process.
The Launcher Modules in Luna 16 serve as the entry points for training and fine-tuning machine learning models. They encapsulate the logic required to set up, train, and evaluate models, making it easier to manage the entire training lifecycle.
The LunaClassificationLauncher is designed to handle the training of classification models. It integrates various services and components to streamline the training process. Key responsibilities include:
- Initialization: Sets up the training environment, including logging and batch iterators.
- Model Training: Provides methods to train models from scratch (
fit) or continue training from a saved state (load_fit). - Hyperparameter Tuning: Supports distributed hyperparameter tuning using Ray (
tune_parameters).
The LunaMalignantClassificationLauncher is similar to the LunaClassificationLauncher but is specifically tailored for training models that classify malignant nodules in medical images. It shares the same core functionalities but is configured to handle the specific requirements of malignant classification tasks.
In the CLI, these launcher modules act as "endpoints" that bring together various services and components to execute training tasks. For example, the train_luna_classification command uses the LunaClassificationLauncher to initiate the training process.
Augmentations are techniques used to artificially expand the size and variability of a training dataset by applying random transformations to the input data. This helps improve the robustness and generalization of machine learning models by exposing them to a wider range of variations during training.
- Flip: Randomly flips the image along specified dimensions to introduce variability in orientation.
- Offset: Applies random translations to the image, shifting it by a specified offset to simulate different positions.
- Scale: Randomly scales the image by a factor to introduce variability in size.
- Rotate: Rotates the image by a random angle to simulate different orientations.
- Noise: Adds random noise to the image to simulate variations in image quality and improve model robustness.
These augmentations help ensure that the models trained in Luna 16 are more resilient to variations in the input data, leading to better performance on unseen data.
The BaseIteratorProvider class defines the interface for batch iterators in the Luna 16 project.
Key method:
enumerate_batches: An abstract method that must be implemented by any subclass to iterate over batches of data, providing the necessary logging and handling during the iteration process.
The BatchIteratorProvider class is responsible for managing the iteration over batches of data during the training and validation processes. It logs the start and end of batch processing, as well as periodic updates during the iteration. This class ensures that the training loop is well-documented and that progress can be monitored effectively.
The ServiceContainer is a core component in Luna 16 that manages the lifecycle of various services used throughout the project.
The ServiceContainer class allows you to register services directly or through creator functions. It also supports registering cleanup functions that are called when the services are closed. This ensures that resources are properly managed and released when they are no longer needed.
Key methods include:
register_service: Registers a service instance directly.register_creator: Registers a function that creates a service instance.call_all_creators: Calls all registered creator functions to instantiate services.get_service: Retrieves a registered service instance.close_all_services: Calls all registered cleanup functions to close services.
By using the ServiceContainer, Luna 16 ensures that services are managed in a consistent and efficient manner, reducing the risk of resource leaks and making the codebase easier to maintain.
CtCutoutService is a service class responsible for creating cutouts from CT scans for candidate nodules.
It manages the process of reading CT scan data, extracting relevant cutouts, and saving them for further analysis.
The CtCutoutService uses the Ct class to read CT scans and extract cutouts for each candidate nodule. It then saves these cutouts to disk and updates the list of processed candidates.
The CutoutsDataset class is a custom dataset designed for handling cutouts of CT scans. These cutouts are small, focused sections of larger CT scans that contain candidate nodules. The primary purpose of this dataset is to facilitate the training and validation of machine learning models by providing a structured way to access and manipulate these cutouts.
- Initialization: The dataset is initialized with parameters such as the ratio of positive to negative samples, whether the dataset is for training or validation, and any transformations or filters to be applied to the data.
- Data Splitting: The dataset splits the candidates into nodules and non-nodules, as well as malignant and non-malignant categories. This allows for balanced sampling during training and validation.
- Data Loading: The
__getitem__method retrieves a specific cutout based on the index, applies any specified transformations and filters, and returns the processed data along with the corresponding labels.
The MalignantCutoutsDataset class extends the CutoutsDataset to specifically handle datasets focused on malignant nodules. It overrides methods to ensure that the dataset includes a balanced mix of malignant, benign, and non-nodule samples.
- Initialization: Inherits initialization parameters from
CutoutsDatasetand ensures that the dataset is tailored for malignant classification tasks. - Data Loading: The
__getitem__method is overridden to handle the specific requirements of malignant classification, ensuring that the correct labels are assigned to each cutout.
Both CutoutsDataset and MalignantCutoutsDataset utilize the cutout files generated by the CtCutoutService. The CtCutoutService processes CT scans to create small, focused cutouts around candidate nodules and saves these cutouts as .npz files. These files contain the CT chunk data and metadata such as the center coordinates of the cutout.
- File Loading: When a dataset instance retrieves a cutout, it loads the corresponding
.npzfile generated by theCtCutoutService. - Data Processing: The loaded data is then processed through any specified transformations and filters to prepare it for model training or validation.
TBD
Monitoring and logging are essential components of the Luna 16 project, providing visibility into the training process, model performance, and system metrics.
The BaseMessageHandler and MessageHandler classes in Luna 16 are used to log, monitor, and track metrics during the training and validation processes. These handlers facilitate communication between different components of the system through a structured messaging protocol.
The BaseMessageHandler is a protocol that defines the essential method handle_message which any message handler implementation should have. This method takes a message of type Message and a registry of type services.ServiceContainer. The protocol ensures that all message handlers adhere to a consistent interface, making it easier to manage and extend the logging and monitoring functionalities.
The MessageHandler class implements the BaseMessageHandler protocol and provides the concrete logic for handling messages. The handle_message method processes incoming messages by invoking the appropriate handlers based on the message type.
- Message Dataclasses: The project defines various message dataclasses in
messages.py, such asLogMetrics,LogStart,LogEpoch,LogBatchStart,LogBatch,LogBatchEnd,LogResult,LogParams, andLogModel. These dataclasses encapsulate different types of information that need to be logged and monitored during the training process. - Logging and Monitoring: When a specific event occurs (e.g., the start of training, the end of a batch, or the computation of metrics), an instance of the corresponding message dataclass is created and passed to the
handle_messagemethod of theMessageHandler. The handler then processes the message by invoking the appropriate registered handlers, which log the information to the desired logging and monitoring tools (e.g., MLFlow, TensorBoard). - Usage in Models: In the
BaseModelclasses, theMessageHandleris used to log metrics and results during training and validation. For example, thelog_metricsmethod creates instances ofLogMetricsandLogResultmessages and passes them to thehandle_messagemethod of theMessageHandler. This ensures that all relevant information is logged and monitored consistently throughout the training process.
By using the MessageHandler classes, Luna 16 ensures that logging and monitoring are modular, extensible, and easy to manage. The structured messaging protocol allows for clear and consistent communication between different components, facilitating effective tracking of metrics and system performance.
Luna 16 provides integrated dashboards for monitoring and visualizing training metrics, model performance, and system statistics. These dashboards leverage tools such as MLFlow and TensorBoard to provide real-time insights into the training process.
To install Luna16, you can use uv:
For CPU version:
uv sync --extra cpuFor CUDA version:
uv sync --extra cu124To run the tests, use the following command:
PYTHONPATH=. uv run pytest luna16/tests/Luna 16 provides a command-line interface (CLI) for running various tasks related to model training, evaluation, and experimentation.
TBD
PYTHONPATH=. uv run python -m luna16.cli create_cutoutsLuna 16 supports training models from scratch as well as continuing training from a saved state. The training process can be customized with parameters such as the number of epochs, batch size, and learning rate.
To train a model (set to version 0.0.1) from scratch, use the following command:
PYTHONPATH=. uv run python -m luna16.cli \
train_luna_classification \
0.0.1 \
--epochs=5 \
--batch-size=64The same variant can be used for malignant classification:
PYTHONPATH=. uv run python -m luna16.cli \
train_luna_malignant_classification \
0.0.1 \
--epochs=5 \
--batch-size=64To continue training a model (set to version 0.0.2) from a saved state (file loader will load classification model with version 0.0.1), use the following command:
PYTHONPATH=. uv run python -m luna16.cli \
load_train_luna_classification \
0.0.1 \
file \
classification \
0.0.1 \
--epochs=5 \
--batch-size=64The same variant can be used for malignant classification:
PYTHONPATH=. uv run python -m luna16.cli \
load_train_luna_malignant_classification \
0.0.1 \
file \
classification \
0.0.1 \
--epochs=5 \
--batch-size=64Luna 16 provides integrated monitoring and dashboards for tracking training metrics, model performance, and system statistics. The dashboards can be accessed through tools such as MLFlow and TensorBoard.
To start the MLFlow server and access the tracking UI, use the following command:
make mlflow-migrate
make mlflow-serverTo start the TensorBoard server and access the visualization dashboard, use the following command:
make tensorboard-classificationThe same variant can be used for malignant classification:
make tensorboard-malignant-classificationLuna 16 supports profiling the training process to optimize performance and resource utilization. The profiling tools can be used to identify bottlenecks, optimize hyperparameters, and improve training efficiency.
To profile the training process using PyTorch Profiler, use the following command:
PYTHONPATH=. uv run python -m luna16.cli \
train_luna_classification \
0.0.0-profile \
--epochs=10 \
--batch-size=64 \
--profileThe same variant can be used for malignant classification:
PYTHONPATH=. uv run python -m luna16.cli \
train_luna_malignant_classification \
0.0.0-profile \
--epochs=10 \
--batch-size=64 \
--profileHyperparameter tuning using Ray is not yet fully functional.
In this project, I can optimize several hyperparameters.
Here are some key hyperparameters:
- Learning Rate: The step size used by the optimizer to update the model weights.
- Batch Size: The number of samples processed before the model is updated.
- Number of Epochs: The number of times the entire dataset is passed through the model.
- Number of Convolutional Blocks (n_blocks): The number of LunaBlock layers in the model.
- Number of Convolutional Channels (conv_channels): The number of output channels for the convolutional layers.
- Dropout Rate: The probability of dropping out neurons during training to prevent overfitting.
- Optimizer Type: The type of optimizer used (e.g., Adam, SGD).
TBD: Write notes on experiment running on RunPod.
TBD
TBD
TBD: Mention Books and Papers
- Dive into Deep Learning
- Designing Machine Learning Systems
- Deep Learning with PyTorch