diff --git a/manuscript/02-Ops_Overview.Rmd b/manuscript/02-Ops_Overview.Rmd index eaa5c1c..2db32e8 100644 --- a/manuscript/02-Ops_Overview.Rmd +++ b/manuscript/02-Ops_Overview.Rmd @@ -1,7 +1,165 @@ -# Ops Tools and Principles + +# Ops Tools & Principles MLOps integrates a range of DevOps techniques and tools to enhance the development and deployment of machine learning models. By promoting cooperation between development and operations teams, MLOps strives to improve communication, enhance efficiency, and reduce delays in the development process. Advanced version control systems can be employed to achieve these objectives. Automation plays a significant role in achieving these goals. For instance, CI/CD pipelines streamline repetitive tasks like building, testing, and deploying software. The management of infrastructure can also be automated, by using infrastructure as code to facilitate an automated provisioning, scaling, and management of infrastructure. -To enhance flexibility and scalability in the operational process, containers and microservices are used to package and deploy software. Finally, monitoring and logging tools are used to track the performance of deployed and containerized software and address any issues that arise. \ No newline at end of file +To enhance flexibility and scalability in the operational process, containers and microservices are used to package and deploy software. Finally, monitoring and logging tools are used to track the performance of deployed and containerized software and address any issues that arise. + + +## Containerization + +Containerization is an essential component in operations as it enables deploying and running applications in a standardized, portable, and scalable way. This is achieved by packaging an application and its dependencies into a container image, which contains all the necessary code, runtime, system tools, libraries, and settings needed to run the application, isolated from the host operating system. Containers are lightweight, portable, and can run on any platform that supports containerization, such as Docker or Kubernetes. + +All of this makes them beneficial compared to deploying an application on a virtual machine or traditionally directly on a machine. Virtual machines would emulate an entire computer system and require a hypervisor to run, which introduces additional overhead. Similarly, a traditional deployment involves installing software directly onto a physical or virtual machine without the use of containers or virtualization. Not to mention the lack of portability of both. + +![](./images/01-Introduction/ops-containerization.drawio.svg){ width=100% } + +The concept of container images is analogous to shipping containers in the physical world. Like shipping containers can be loaded with different types of cargo, a container image can be used to create different containers with various applications and configurations. Both the physical containers and container images are standardized, just like blueprints, enabling multiple operators to work with them. This allows for the deployment and management of applications in various environments and cloud platforms, making containerization a versatile solution. + +Containerization offers several benefits for MLOps teams. By packaging the machine learning application and its dependencies into a container image, reproducibility is achieved, ensuring consistent results across different environments and facilitating troubleshooting. Containers are portable which enables easy movement of machine learning applications between various environments, including development, testing, and production. Scalability is also a significant advantage of containerization, as scaling up or down compute resources in an easy fashion allows to handle large-scale machine learning workloads and adjust to changing demand quickly. Additionally, containerization enables version control of machine learning applications and their dependencies, making it easier to track changes, roll back to previous versions, and maintain consistency across different environments. To effectively manage model versions, simply saving the code into a version control system is insufficient. It's crucial to include an accurate description of the environment, which encompasses Python libraries, their versions, system dependencies, and more. Virtual machines (VMs) can provide this description, but container images have become the preferred industry standard due to their lightweight nature. +Finally, containerization facilitates integration with other DevOps tools and processes, such as CI/CD pipelines, enhancing the efficiency and effectiveness of MLOps operations. + + + + +## Version Control + +Version control is a system that records changes to a file or set of files over time, to be able to recall specific versions later. It is an essential tool for any software development project as it allows multiple developers to work together, track changes, and easily rollback in case of errors. There are two main types of version control systems: centralized and distributed. + +1. Centralized Version Control Systems (CVCS) : In a centralized version control system, there is a single central repository that contains all the versions of the files, and developers must check out files from the repository in order to make changes. Examples of CVCS include Subversion and Perforce. + +2. Distributed Version Control Systems (DVCS) : In a distributed version control system, each developer has a local copy of the entire repository, including all the versions of the files. This allows developers to work offline, and it makes it easy to share changes with other developers. Examples of DVCS include Git, Mercurial and Bazaar + +Version control is a vital component of software development that offers several benefits. First, it keeps track of changes made to files, enabling developers to revert to a previous version in case something goes wrong. Collaboration is also made easier with version control, as it allows multiple developers to work on a project simultaneously and share changes with others. In addition, it provide backup capabilities by keeping a history of all changes, allowing you to retrieve lost files. Version control also allows auditing of changes, tracking who made a specific change, when, and why. Finally, it enables developers to create branches of a project, facilitating simultaneous work on different features without affecting the main project, with merging later. + +Versioning all components of a machine learning project, such as code, data, and models, is essential for reproducibility and managing models in production. While versioning code-based components is similar to typical software engineering projects, versioning machine learning models and data requires specific version control systems. There is no universal standard for versioning machine learning models, and the definition of "a model" can vary depending on the exact setup and tools used. + +Popular tools such as Azure ML, AWS Sagemaker, Kubeflow, and MLflow offer their own mechanisms for model versioning. For data versioning, there are tools like Data Version Control (DVC) and Git Large File Storage (LFS). The de-facto standard for code versioning is Git. The code-versioning system Github is used for this project, which will be depicted in more detail in the following. + +### Github + +GitHub provides a variety of branching options to enable flexible collaboration workflows. Each branch serves a specific purpose in the development process, and using them effectively can help teams collaborate more efficiently and effectively. + +![](./images/01-Introduction/ops-version-control.drawio.svg){ width=100% } + +*Main Branch:* The main branch is the default branch in a repository. It represents the latest stable version and production-ready state of a codebase, and changes to the code are merged into the main branch as they are completed and tested. +*Feature Branch:* A feature branch is used to develop a new feature or functionality. It is typically created off the main branch, and once the feature is completed, it can be merged back into the main branch. +*Hotfix Branch:* A hotfix branch is used to quickly fix critical issues in the production code. It is typically created off the main branch, and once the hotfix is completed, it can be merged back into the main branch. +*Release Branch:* A release branch is a separate branch that is created specifically for preparing a new version of the software for release. Once all the features and fixes for the new release have been added and tested, the release branch is merged back into the main branch, and a new version of the software is tagged and released. + +### Git lifecycle + +After a programmer has made changes to their code, they would typically use Git to manage those changes through a series of steps. First, they would use the command `git status` to see which files have been changed and are ready to be committed. They would then stage the changes they want to include in the commit using the command `git add `, followed by creating a new commit with a message describing the changes using `git commit -m "MESSAGE"`. + +After committing changes locally, the programmer may want to share those changes with others. They would do this by pushing their local commits to a remote repository using the command `git push`. Once the changes are pushed, others can pull those changes down to their local machines and continue working on the project by using the command `git pull`. + +![](./images/01-Introduction/ops-git-commands.png){ width=100% } + +If the programmer is collaborating with others, they may need to merge their changes with changes made by others. This can be done using the `git merge ` command, which combines two branches of development history. The programmer may need to resolve any conflicts that arise during the merge. + +If the programmer encounters any issues or bugs after pushing their changes, they can use Git to revert to a previous version of the code by checking out an older commit using the command git checkout. Git's ability to track changes and revert to previous versions makes it an essential tool for managing code in collaborative projects. + +While automating the code review process is generally viewed as advantageous, it is still typical to have a manual code review as the final step before approving a pull or merge request to be merged into the main branch. It is considered a best practice to mandate a manual approval from one or more reviewers who are not the authors of the code changes. + + +## CI/CD + +Continuous Integration (CI) and Continuous Delivery / Continuous Delivery (CD) are related software development practices that work together to automate and streamline the software development and deployment process of code changes to production. Deploying new software and models without CI/CD often requires a lot of implicit knowledge and manual steps. + +![](./images/01-Introduction/ops-ci-cd.drawio.svg){ width=100% } + +1. *Continuous Integration (CI)*: is a software development practice that involves frequently integrating code changes into a shared central repository. The goal of CI is to catch and fix integration errors as soon as they are introduced, rather than waiting for them to accumulate over time. This is typically done by running automated tests and builds, to catch any errors that might have been introduced with new code changes, for example when merging a Git feature branch into the main branch. + +2. *Continuous Delivery (CD)*: is the practice that involves automating the process of building, testing, and deploying software to a production-like environment. The goal is to ensure that code changes can be safely and quickly deployed to production. This is typically done by automating the deployment process and by testing the software in a staging environment before deploying it to production. + +3. *Continuous Deployment (CD):* is the practice of automatically deploying code changes to production once they pass automated tests and checks. The goal is to minimize the time it takes to get new features and bug fixes into the hands of end-users. In this process, the software is delivered directly to the end-user without manual testing and verification. + +The terms *Continuous Delivery* and *Continuous Deployment* are often used interchangeably, but they have distinct meanings. Continuous delivery refers to the process of building, testing, and running software on a production-like environment, while continuous deployment refers specifically to the process of running the new version on the production environment itself. However, fully automated deployments may not always be desirable or feasible, depending on the organization's business needs and the complexity of the software being deployed. While continuous deployment builds on continuous delivery, the latter can offer significant value on its own. + +CI/CD integrates the principles of continuous integration and continuous delivery in a seamless workflow, allowing teams to catch and fix issues early and quickly deliver new features to users. The pipeline is often triggered by a code commit. Ideally, a Data Scientist would push the changes made to the code at each incremental step of development to a share repository, including metadata and documentation. This code commit would trigger the CI/CD pipeline to build, test, package, and deploy the model software. In contrast to the local development, the CI/CD steps will test the model changes on the full dataset and aiming to deploy for production. + +CI and CD practices help to increase the speed and quality of software development, by automating repetitive tasks and catching errors early, reducing the time and effort required to release new features, and increasing the stability of the deployed software. Examples for CI/CD Tools that enable automated testing with already existing build servers are for example GitHub Actions, Gitlab CI/CD, AWS Code Build, or Azure DevOps + +The following code snippet shows an exemplary GitHub Actions pipeline to test, build and push a Docker image to the DockerHub registry. The code is structured in three parts. +At first, the environment variables are defined under `env`. Two variables are defined here which are later called with by the command `env.VARIABLE`. +The second part defines when the pipeline is or should be triggered. The exampele shows three possibilites to trigger a pipelines, when pushing on the master branch `push`, when a pull request to the master branch is granted `pull_request`, or when the pipeline is triggered manually via the Github interface `workflow_dispatch`. +The third part of the code example introduces the actual jobs and steps performed by the pipeline. The pipeline consists of two jobs `pytest` and `docker`. The first represents the CI part of the pipeline. The run environment of the job is set up and the necessary requirements are installed. Afterward unit tests are run using the pytest library. If the `pytest` job was successful, the `docker` job will be triggered. The job builds the Dockerfile and pushes it automatically to the specified Dockerhub repository specified in `tags`. The step introduces another variable just like the `env.Variable` before, the `secrets.`. Secrets are a way by Github to safely store classified information like username and passwords. They can be set up using the Github Interface and used in the Github Actions CI using `secrets.SECRET-NAME`. + +\footnotesize +```yaml +name: Docker CI base + +env: + DIRECTORY: base + DOCKERREPO: seblum/mlops-public + +on: + push: + branches: master + paths: $DIRECTORY/** + pull_request: + branches: [ master ] + workflow_dispatch: + +jobs: + pytest: + runs-on: ubuntu-latest + defaults: + run: + working-directory: ./${{ env.DIRECTORY }} + steps: + - uses: actions/checkout@v3 + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: '3.x' + - name: Install dependencies + run: | + python -m pip install --upgrade pip + pip install -r requirements.txt + pip install pytest + pip install pytest-cov + - name: Test with pytest + run: | + pytest test_app.py --doctest-modules --junitxml=junit/test-results.xml --cov=com --cov-report=xml --cov-report=html + docker: + needs: pytest + runs-on: ubuntu-latest + steps: + - name: Set up QEMU + uses: docker/setup-qemu-action@v2 + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v2 + - name: Login to DockerHub + uses: docker/login-action@v2 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + - name: Build and push + uses: docker/build-push-action@v3 + with: + file: ./${{ env.DIRECTORY }}/Dockerfile + push: true + tags: ${{ env.DOCKERREPO }}:${{ env.DIRECTORY }} +``` +\normalsize + +## Infrastructure as code + +Infrastructure as Code (IaC) is a software engineering approach that enables the automation of infrastructure provisioning and management using machine-readable configuration files rather than manual processes or interactive interfaces. + +This means that the infrastructure is defined using code, instead of manually setting up servers, networks, and other infrastructure components. This code can be version controlled, tested, and deployed just like any other software code. It also allows to automate the process of building and deploying infrastructure resources, enabling faster and more reliable delivery of services, as well as ensuring to provide the same environment every time. It also comes with the benefit of an increased scalability, improved security, and better visibility into infrastructure changes. + +It is recommended to utilize infrastructure-as-code to deploy an ML platform. Popular tools for implementing IaC are for example Terraform, CloudFormation, and Ansible. Chapter 6 gives a more detailed description and a tutorial on how to use Infrastructure as code using *Terraform*. diff --git a/manuscript/09.2-Deployment-Usage_Pipeline-Workflow.Rmd b/manuscript/09.2-Deployment-Usage_Pipeline-Workflow.Rmd index 9fb0d2e..a0b1a54 100644 --- a/manuscript/09.2-Deployment-Usage_Pipeline-Workflow.Rmd +++ b/manuscript/09.2-Deployment-Usage_Pipeline-Workflow.Rmd @@ -121,7 +121,6 @@ model_params = { "pooling": "avg", # needed for resnet50 "verbose": 2, } - ``` \normalsize diff --git a/manuscript/09.3-Deployment-Usage_Training-Model-Pipeline.Rmd b/manuscript/09.3-Deployment-Usage_Training-Model-Pipeline.Rmd index 2be797f..14b263c 100644 --- a/manuscript/09.3-Deployment-Usage_Training-Model-Pipeline.Rmd +++ b/manuscript/09.3-Deployment-Usage_Training-Model-Pipeline.Rmd @@ -1,4 +1,354 @@ -## Pipeline Workflow Steps +## Training Pipeline Steps As mentioned previously, the machine learning pipeline for this particular use case comprises three primary stages: preprocessing, training, and serving. Furthermore, only the model that achieves the highest accuracy is chosen for deployment, which introduces an additional step for model comparison. Each of these steps will be further explained in the upcoming sections. + +### Data Preprocessing + +The data processing stage involves three primary processes. First, the raw data is loaded from an S3 Bucket. Second, the data is preprocessed and converted into the required format. Finally, the preprocessed data is stored in a way that allows it to be utilized by subsequent models. The data processing functionality is implemented within the given `data_preprocessing` function. The `utils` module, imported at the beginning, provides the functionality to access, load, and store data from S3. The data is normalized and transformed into a NumPy array to make it compatible with TensorFlow Keras models. The function returns the names and paths of the preprocessed and uploaded data, making it convenient for selecting them for future model training. Moreover, the data preprocessing stage establishes a connection with MLflow to record the sizes of the datasets. + +#### Importing Required Libraries {.unlisted .unnumbered} +The following code imports the necessary libraries and modules required for the code execution. It includes libraries for handling file operations, data manipulation, machine learning, progress tracking, as well as custom modules. + +\footnotesize +```python +# Imports necessary packages +import os +from datetime import datetime +from typing import Tuple + +import mlflow +import numpy as np +from keras.utils.np_utils import to_categorical +from sklearn.utils import shuffle +from src.utils import AWSSession, timeit +from tqdm import tqdm + +# Import custom modules +from src.utils import AWSSession +``` +\normalsize + +#### Data Preprocessing Function Definition {.unlisted .unnumbered} +At first, the `data_preprocessing` function is defined, which performs the data preprocessing steps. The function takes three arguments: `mlflow_experiment_id` (the MLflow experiment ID for logging), aws_bucket (the S3 bucket for reading raw data and storing preprocessed data), and path_preprocessed (the subdirectory for storing preprocessed data, with a default value of "preprocessed"). The function returns a tuple of four strings representing the paths of the preprocessed data. + +\footnotesize +```python +@timeit +def data_preprocessing( + mlflow_experiment_id: str, + aws_bucket: str, + path_preprocessed: str = "preprocessed", +) -> Tuple[str, str, str, str]: + """Preprocesses data for further use within model training. Raw data is read from given S3 Bucket, normalized, and stored ad a NumPy Array within S3 again. Output directory is on "/preprocessed". The shape of the data set is logged to MLflow. + + Args: + mlflow_experiment_id (str): Experiment ID of the MLflow run to log data + aws_bucket (str): S3 Bucket to read raw data from and write preprocessed data + path_preprocessed (str, optional): Subdirectory to store the preprocessed data on the provided S3 Bucket. Defaults to "preprocessed". + + Returns: + Tuple[str, str, str, str]: Four strings denoting the path of the preprocessed data stored as NumPy Arrays: X_train_data_path, y_train_data_path, X_test_data_path, y_test_data_path + """ +``` +\normalsize + +#### Setting MLflow Tracking URI and AWS Session {.unlisted .unnumbered} +Afterward, the MLflow tracking URI is set and an AWS session created using the AWS Access Key obtained from the environment variables and using the custom class `AWSSession()`. + +\footnotesize +```python + mlflow_tracking_uri = os.getenv("MLFLOW_TRACKING_URI") + mlflow.set_tracking_uri(mlflow_tracking_uri) + + # Instantiate aws session based on AWS Access Key + # AWS Access Key is fetched within AWS Session by os.getenv + aws_session = AWSSession() + aws_session.set_sessions() +``` +\normalsize + +#### Setting Paths and Helper Functions {.unlisted .unnumbered} +The paths for storing raw and preprocessed data within the S3 bucket are defined in a next step. As well as the helper functions `_load_and_convert_images`, `_create_label` and `_merge_data`. The `_load_and_convert_images` function loads and converts images from an S3 bucket folder into a NumPy array. The `_create_label` function creates a label array for a given dataset, while the `_merge_data` function merges two datasets into a single dataset. + +\footnotesize +```python + # Set paths within s3 + path_raw_data = f"s3://{aws_bucket}/data/" + + folder_benign_train = f"{path_raw_data}train/benign" + folder_malignant_train = f"{path_raw_data}train/malignant" + + folder_benign_test = f"{path_raw_data}test/benign" + folder_malignant_test = f"{path_raw_data}test/malignant" + + # Inner helper functions to load the data to a NumPy Array, create labels, and merge Array + @timeit + def _load_and_convert_images(folder_path: str) -> np.array: + ims = [ + aws_session.read_image_from_s3(s3_bucket=aws_bucket, imname=filename) + for filename in tqdm(aws_session.list_files_in_bucket(folder_path)) + ] + return np.array(ims, dtype="uint8") + + def _create_label(x_dataset: np.array) -> np.array: + return np.zeros(x_dataset.shape[0]) + + def _merge_data(set_one: np.array, set_two: np.array) -> np.array: + return np.concatenate((set_one, set_two), axis=0) +``` +\normalsize + +#### Preprocessing Steps and MLflow Logging {.unlisted .unnumbered} +This section performs the main preprocessing steps. It loads images from the S3 bucket, creates labels, merges data, shuffles the data, performs data normalization, and uploads the preprocessed data as NumPy arrays to the S3 bucket. The MLflow logging is also performed, recording the sizes of the training and testing data. + +\footnotesize +```python + # Start a MLflow run to log the size of the data + timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + with mlflow.start_run(experiment_id=mlflow_experiment_id, run_name=f"{timestamp}_Preprocessing") as run: + print("\n> Loading images from S3...") + # Load in training pictures + X_benign = _load_and_convert_images(folder_benign_train) + X_malignant = _load_and_convert_images(folder_malignant_train) + + # Load in testing pictures + X_benign_test = _load_and_convert_images(folder_benign_test) + X_malignant_test = _load_and_convert_images(folder_malignant_test) + + # Log train-test size in MLflow + print("\n> Log data parameters") + mlflow.log_param("train_size_benign", X_benign.shape[0]) + mlflow.log_param("train_size_malignant", X_malignant.shape[0]) + mlflow.log_param("test_size_benign", X_benign_test.shape[0]) + mlflow.log_param("test_size_malignant", X_malignant_test.shape[0]) + + print("\n> Preprocessing...") + # Create labels + y_benign = _create_label(X_benign) + y_malignant = _create_label(X_malignant) + + y_benign_test = _create_label(X_benign_test) + y_malignant_test = _create_label(X_malignant_test) + + # Merge data + y_train = _merge_data(y_benign, y_malignant) + y_test = _merge_data(y_benign_test, y_malignant_test) + + X_train = _merge_data(X_benign, X_malignant) + X_test = _merge_data(X_benign_test, X_malignant_test) + + # Shuffle data + X_train, y_train = shuffle(X_train, y_train) + X_test, y_test = shuffle(X_test, y_test) + + y_train = to_categorical(y_train, num_classes=2) + y_test = to_categorical(y_test, num_classes=2) + + # With data augmentation to prevent overfitting + X_train = X_train / 255.0 + X_test = X_test / 255.0 +``` +\normalsize + +#### Uploading preprocessed data {.unlisted .unnumbered} +The four preprocessed numpy arrays (X_train, y_train, X_test, y_test) are uploaded to an S3 bucket. The arrays are stored as pickle files with specific file keys in the bucket. Finally, the paths of the preprocessed data are create and and returned as a tuple of strings. + +\footnotesize +```python + print("\n> Upload numpy arrays to S3...") + aws_session.upload_npy_to_s3( + data=X_train, + s3_bucket=aws_bucket, + file_key=f"{path_preprocessed}/X_train.pkl", + ) + aws_session.upload_npy_to_s3( + data=y_train, + s3_bucket=aws_bucket, + file_key=f"{path_preprocessed}/y_train.pkl", + ) + aws_session.upload_npy_to_s3( + data=X_test, + s3_bucket=aws_bucket, + file_key=f"{path_preprocessed}/X_test.pkl", + ) + aws_session.upload_npy_to_s3( + data=y_test, + s3_bucket=aws_bucket, + file_key=f"{path_preprocessed}/y_test.pkl", + ) + + X_train_data_path = f"{path_preprocessed}/X_train.pkl" + y_train_data_path = f"{path_preprocessed}/y_train.pkl" + X_test_data_path = f"{path_preprocessed}/X_test.pkl" + y_test_data_path = f"{path_preprocessed}/y_test.pkl" + + # Return directory paths of the data stored in S3 + return X_train_data_path, y_train_data_path, X_test_data_path, y_test_data_path +``` +\normalsize + +### Model Training + +The training step is designed to accommodate different models based on the selected model. The custom `model.utils` package, imported at the beginning, enables the selection and retrieval of models. The chosen model can be specified by passing its name to the `get_model` function, which then returns the corresponding model. These models are implemented using TensorFlow Keras and their code is stored in the `/model` directory. The model is trained using the `model_params` parameters provided to the training function, which include all the necessary hyperparameters. The training and evaluation are conducted using the preprocessed data from the previous step, which is downloaded from S3 at the beginning. Depending on the selected model, a KFold cross-validation is performed to improve the model's fit. + +MLflow is utilized to track the model's progress. By invoking `mlflow.start_run()`, a new MLflow run is initiated. The `model_params` are logged using `mlflow.log_params`, and MLflow autolog is enabled for Keras models through `mlflow.keras.autolog()`. After successful training, the models are stored in the model registry. The trained model is logged using `mlflow.keras.register_model`, with the specified `model_name` as the destination. + +The function returns the MLflow run ID and crucial information about the model, such as its name, version, and stage. + +#### Importing Dependencies {.unlisted .unnumbered} +This section imports the necessary dependencies for the code, including libraries for machine learning, data manipulation, and utility functions. + +\footnotesize +```python +# Imports necessary packages +import json +import os +from datetime import datetime +from enum import Enum +from typing import Tuple + +import mlflow +import mlflow.keras +import numpy as np +from keras import backend as K +from keras.callbacks import ReduceLROnPlateau +from sklearn.metrics import accuracy_score +from sklearn.model_selection import KFold + +# Import custom modules +from src.model.utils import Model_Class, get_model +from src.utils import AWSSession +``` +\normalsize + +#### Defining the *train_model* Function {.unlisted .unnumbered} +The actual code starts by defining the train_model function, which takes several parameters for training a machine learning model, logging the results to MLflow, and returning relevant information. The MLflow tracking URI is retrieved from the environment variable and sets it as the tracking URI for MLflow. + +\footnotesize +```python +def train_model( + mlflow_experiment_id: str, + model_class: Enum, + model_params: dict, + aws_bucket: str, + import_dict: dict = {}, +) -> Tuple[str, str, int, str]: + """ + Trains a machine learning model and logs the results to MLflow. + + Args: + mlflow_experiment_id (str): The ID of the MLflow experiment to log the results. + model_class (Enum): The class of the model to train. + model_params (dict): A dictionary containing the parameters for the model. + aws_bucket (str): The AWS S3 bucket name for data storage. + import_dict (dict, optional): A dictionary containing paths for importing data. Defaults to {}. + + Returns: + Tuple[str, str, int, str]: A tuple containing the run ID, model name, model version, and current stage. + + Raises: + None + """ + mlflow_tracking_uri = os.getenv("MLFLOW_TRACKING_URI") + mlflow.set_tracking_uri(mlflow_tracking_uri) +``` +\normalsize + +#### Loading Data {.unlisted .unnumbered} +This section handles the loading of data required for training the model. It retrieves the file paths for the training and testing data from the `import_dict` parameter and loads the corresponding NumPy arrays from an AWS S3 bucket using the `AWSSession` class. + +\footnotesize +```python + print("\n> Loading data...") + X_train_data_path = import_dict.get("X_train_data_path") + y_train_data_path = import_dict.get("y_train_data_path") + X_test_data_path = import_dict.get("X_test_data_path") + y_test_data_path = import_dict.get("y_test_data_path") + + # Instantiate aws session based on AWS Access Key + # AWS Access Key is fetched within AWS Session by os.getenv + aws_session = AWSSession() + aws_session.set_sessions() + + # Read NumPy Arrays from S3 + X_train = aws_session.download_npy_from_s3(s3_bucket=aws_bucket, file_key=X_train_data_path) + y_train = aws_session.download_npy_from_s3(s3_bucket=aws_bucket, file_key=y_train_data_path) + X_test = aws_session.download_npy_from_s3(s3_bucket=aws_bucket, file_key=X_test_data_path) + y_test = aws_session.download_npy_from_s3(s3_bucket=aws_bucket, file_key=y_test_data_path) +``` +\normalsize + +#### Training the Model {.unlisted .unnumbered} +After the data is loaded, the training process for the machine learning model is started. It begins by printing the model class and generating a timestamp for the run name. Then, it starts an MLflow run with the specified experiment ID and run name. The model parameters are logged using MLflow's log_params function. Additionally, a callback for reducing the learning rate during training is configured using the ReduceLROnPlateau class from Keras. + +The model training handles two different scenarios based on the selected `model_class`. If it is set to cross-validation (`Model_Class.CrossVal`), the model is trained using cross-validation. Otherwise, it is trained using the specified model class. + +\footnotesize +```python + print("\n> Training model...") + print(model_class) + timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") + with mlflow.start_run(experiment_id=mlflow_experiment_id, run_name=f"{timestamp}-{model_class}") as run: + mlflow.log_params(model_params) + learning_rate_reduction = ReduceLROnPlateau(monitor="accuracy", patience=5, verbose=1, factor=0.5, min_lr=1e-7) + + # If CrossVal is selected, train BasicNet as Cross-Validated Model + if model_class == Model_Class.CrossVal.value: + kfold = KFold(n_splits=3, shuffle=True, random_state=11) + cvscores = [] + for train, test in kfold.split(X_train, y_train): + model = get_model(Model_Class.Basic.value, model_params) + + # Train Model + model.fit( + X_train[train], + y_train[train], + epochs=model_params.get("epochs"), + batch_size=model_params.get("batch_size"), + verbose=model_params.get("verbose"), + ) + scores = model.evaluate(X_train[test], y_train[test], verbose=0) + print("%s: %.2f%%" % (model.metrics_names[1], scores[1] * 100)) + cvscores.append(scores[1] * 100) + K.clear_session() + else: + model = get_model(model_class, model_params) + mlflow.keras.autolog() + + # Train Model + model.fit( + X_train, + y_train, + validation_split=model_params.get("validation_split"), + epochs=model_params.get("epochs"), + batch_size=model_params.get("batch_size"), + verbose=model_params.get("verbose"), + callbacks=[learning_rate_reduction], + ) + mlflow.keras.autolog(disable=True) +``` +\normalsize + +#### Testing and Evaluating the Model {.unlisted .unnumbered} +After the model training, the trained model is tested on the test data and its prediction accuracy evaluated. The accuracy score is calculated using the `accuracy_score` function from scikit-learn and logged as a metric using MLflow. Afterward, the trained and evaluated model is registered with MLflow using the `register_model` function. The resulting model name, version, and stage are obtained to finally return them in the functions `return` statement. + +\footnotesize +```python + run_id = run.info.run_id + model_uri = f"runs:/{run_id}/{model_class}" + + # Testing model on test data to evaluate + print("\n> Testing model...") + y_pred = model.predict(X_test) + prediction_accuracy = accuracy_score(np.argmax(y_test, axis=1), np.argmax(y_pred, axis=1)) + mlflow.log_metric("prediction_accuracy", prediction_accuracy) + print(f"Prediction Accuracy: {prediction_accuracy}") + + print("\n> Register model...") + mv = mlflow.register_model(model_uri, model_class) + + # Return run ID, model name, model version, and current stage of the model + return run_id, mv.name, mv.version, mv.current_stage +``` +\normalsize \ No newline at end of file diff --git a/manuscript/09.5-Deployment-Usage_Model-Inferencing.Rmd b/manuscript/09.5-Deployment-Usage_Model-Inferencing.Rmd new file mode 100644 index 0000000..2c96283 --- /dev/null +++ b/manuscript/09.5-Deployment-Usage_Model-Inferencing.Rmd @@ -0,0 +1,82 @@ + +## Model Inferencing + +The process of serving and making inferences utilizes Docker containers and runs them within Kubernetes pods. + +The concept involves running a Docker container that serves the pre-trained TensorFlow model using FastAPI. This containerized model is responsible for providing predictions and responses to incoming requests. Additionally, a Streamlit app is used to interact with the served model, enabling users to make inferences by sending input data to the model and receiving the corresponding predictions. + +### Streamlit App + +The Streamlit app offers a simple interface for performing inferences on the served model. The user interface enables users to upload a `jpg` image. Upon clicking the `predict` button, the image is sent to the model serving app, where a prediction is made. The prediction results are then returned as a JSON file, which can be downloaded upon request. + +**Importing Dependencies** +This section imports the necessary dependencies for the code, including libraries for file handling, JSON processing, working with images, making HTTP requests, and creating the Streamlit application. + +```python +# Imports necessary packages +import io +import json +import os + +import pandas as pd +import requests +import streamlit as st +from PIL import Image + +``` + +#### Setting Up the Streamlit Application {.unlisted .unnumbered} +At first, the header and subheader for the Streamlit application are set. Afterward, the FastAPI serving IP and port are retrieved from environment variables. They constructs the FastAPI endpoint URL and are later used to send a POST request to. + +```python +st.header("MLOps Engineering Project") +st.subheader("Skin Cancer Detection") + +# FastAPI endpoint +FASTAPI_SERVING_IP = os.getenv("FASTAPI_SERVING_IP") +FASTAPI_SERVING_PORT = os.getenv("FASTAPI_SERVING_PORT") +FASTAPI_ENDPOINT = f"http://{FASTAPI_SERVING_IP}:{FASTAPI_SERVING_PORT}/predict" + +``` + +#### Uploading test image {.unlisted .unnumbered} +The `st.file_uploader` allows the user to upload a test image in JPG format using the Streamlit file uploader widget. The type of the uploaded file is limited to `.jpg`. If a test image has been uploaded, the image is processed by opening it with PIL and creating a file-like object. + + +```python +test_image = st.file_uploader("", type=["jpg"], accept_multiple_files=False) + +if test_image: + image = Image.open(test_image) + image_file = io.BytesIO(test_image.getvalue()) + files = {"file": image_file} + +``` + +#### Displaying the uploaded image and performing prediction {.unlisted .unnumbered} +A two-column layout in the Streamlit app is created That displays the uploaded image in the first column. In the second columns, a button for the user to start the prediction process is displayed. When the button is clicked, it sends a POST request to the FastAPI endpoint with the uploaded image file. The prediction results are displayed as JSON and can be downloaded as a JSON file. + +```python + col1, col2 = st.columns(2) + + with col1: + # Display the uploaded image in the first column + st.image(test_image, caption="", use_column_width="always") + + with col2: + if st.button("Start Prediction"): + with st.spinner("Prediction in Progress. Please Wait..."): + # Send a POST request to FastAPI for prediction + output = requests.post(FASTAPI_ENDPOINT, files=files, timeout=8000) + st.success("Success! Click the Download button below to retrieve prediction results (JSON format)") + # Display the prediction results in JSON format + st.json(output.json()) + # Add a download button to download the prediction results as a JSON file + st.download_button( + label="Download", + data=json.dumps(output.json()), + file_name="cnn_skin_cancer_prediction_results.json", + ) + +``` + diff --git a/manuscript/99-Glossary.Rmd b/manuscript/99-Glossary.Rmd index 34c3ea2..610585b 100644 --- a/manuscript/99-Glossary.Rmd +++ b/manuscript/99-Glossary.Rmd @@ -62,4 +62,4 @@ VPC : Virtual Private Cloud - A virtual network service that enables users to create isolated and customizable network environments within their cloud infrastructure. VM -: Virtual Machine - An emulation of a computer system that allows multiple operating systems to run on a single physical machine, providing isolation and flexibility for various tasks. \ No newline at end of file +: Virtual Machine - An emulation of a computer system that allows multiple operating systems to run on a single physical machine, providing isolation and flexibility for various tasks.