Integrating MLOps into Your Machine Learning Workflow

Chapter 1: Understanding MLOps

MLOps represents a vital engineering practice that merges the development of machine learning models (like training, hyperparameter tuning, and selection) with operational processes to ensure smooth and standardized deployment of these models in real-world applications. If you're just starting, you might find this concept a bit overwhelming. Here's a straightforward explanation:

MLOps comprises various engineering and operational tasks that make your machine learning models accessible to users and applications throughout your organization. In essence, it’s a framework that enables you to share your machine learning work online, helping to achieve specific business goals. While this is a simplified definition, it serves as a good introduction for newcomers.

This tutorial will walk you through each stage of the machine learning workflow, focusing particularly on Experiment Logging and model tracking.

Section 1.1: Identifying the Business Challenge

Machine learning provides solutions to specific business challenges. The problem-framing phase can vary significantly in duration, ranging from several days to weeks, depending on the complexity of the issue. During this stage, data scientists collaborate with subject matter experts (SMEs) to fully understand the problem, interview key stakeholders, gather relevant data, and establish project objectives.

For this tutorial, I will reference a case study from the Darden School of Business published in Harvard Business. The scenario revolves around Greg and Sarah, who are planning to marry. Greg seeks to find the perfect ring for Sarah, and based on a friend's suggestion, he decides to purchase a diamond, allowing Sarah to choose her preferred style. He collects data on 6,000 diamonds, including details like price and attributes such as cut, color, and shape.

Section 1.2: Data Acquisition

Once the problem has been defined, data is gathered through enterprise databases using the Extract, Transform, Load (ETL) process. ETL involves extracting data from various sources, transforming it into a suitable format, and loading it into a data warehouse or another consolidated repository.

# load the dataset from pycaret

from pycaret.datasets import get_data

data = get_data('diamond')

Chapter 2: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is where the initial data investigation takes place. The main goal of EDA is to assess the quality of the data, checking for missing values, outliers, and analyzing statistical characteristics like feature distribution and correlation.

# plot scatter plot for carat weight and price

import plotly.express as px

fig = px.scatter(x=data['Carat Weight'], y=data['Price'],

facet_col=data['Cut'], opacity=0.25, template='plotly_dark', trendline='ols',

trendline_color_override='red', title='SARAH GETS A DIAMOND - A CASE STUDY')

fig.show()

In this analysis, you can observe that the price distribution is right-skewed. We can quickly determine if a log transformation would normalize this distribution, benefiting algorithms that assume normality.

import numpy as np

# create a copy of data

data_copy = data.copy()

# create a new feature Log_Price

data_copy['Log_Price'] = np.log(data['Price'])

# plot histogram

fig = px.histogram(data_copy, x=["Log_Price"], title='Histogram of Log Price', template='plotly_dark')

fig.show()

The results confirm our hypothesis: the transformation helps mitigate skewness, bringing the target variable closer to normality. Thus, we will apply this transformation to the price variable before proceeding with model training.

Section 2.1: Preparing the Data for Modeling

Now, we prepare the data for model training, which includes tasks like splitting the data into training and test sets, handling missing values, one-hot encoding, target encoding, feature engineering, and selection.

# initialize setup

from pycaret.regression import *

s = setup(data, target='Price', transform_target=True, log_experiment=True, experiment_name='diamond')

When you call the setup function in PyCaret, it analyzes the dataset and determines the data types of the input features. If everything is accurate, you can proceed by pressing enter.

I set log_experiment=True and experiment_name='diamond', which instructs PyCaret to automatically log all metrics, hyperparameters, and model artifacts as you move through the modeling process. This integration with MLflow makes it seamless.

Chapter 3: Model Training and Evaluation

This stage involves training numerous machine learning models, tuning hyperparameters, and evaluating performance metrics. The aim is to select the best model based on predetermined business metrics.

# compare all models

best = compare_models()

Next, we can assess the residuals of the trained model:

# check the residuals of trained model

plot_model(best, plot='residuals_interactive')

To gain insights into feature importance, we can visualize it as well:

# check feature importance

plot_model(best, plot='feature')

Chapter 4: Deployment and Monitoring

The final phase focuses primarily on MLOps. This includes tasks such as packaging the final model, creating a Docker image, writing the scoring script, and ultimately compiling everything into an API for making predictions on new data.

Historically, this process has been cumbersome and required extensive technical expertise, which is challenging to cover in a single tutorial. However, I'll demonstrate how to implement essential MLOps features using PyCaret, an open-source, low-code machine learning library in Python.

# within notebook (notice the ! sign in front)

!mlflow ui

Now open your browser and enter "localhost:5000" to access a UI displaying a table that summarizes the training runs, including performance metrics and other metadata.

The paths to your logged models are available for loading them later:

# load model

from pycaret.regression import load_model

pipeline = load_model('C:/Users/moezs/mlruns/1/b8c10d259b294b28a3e233a9d2c209c0/artifacts/model/model')

# print pipeline

print(pipeline)

You can also prepare your data for prediction:

# create a copy of data and drop Price

data2 = data.copy()

data2.drop('Price', axis=1, inplace=True)

# generate predictions

from pycaret.regression import predict_model

predictions = predict_model(pipeline, data=data2)

predictions.head()

Congratulations on reaching this point! Your trained pipeline has provided valuable insights. All transformations, including target transformation and missing value imputation, were handled automatically in the background.

Thank you for reading!

About the Author

I specialize in writing about data science, machine learning, and PyCaret. If you'd like to stay updated, feel free to follow me on Medium, LinkedIn, and Twitter.

Chapter 5: MLOps in Action

In this section, we'll explore practical applications of MLOps, starting with a video that illustrates operationalizing your ML workflow using pipeline templates.

This video provides a hands-on demonstration of how to implement MLOps strategies effectively.

Chapter 6: Continuous Integration for ML

Next, we'll dive into continuous integration practices specific to machine learning in this tutorial.

This video serves as an introduction to the essential concepts of continuous integration tailored for machine learning projects.

spirosgyros.net

Integrating MLOps into Your Machine Learning Workflow

Chapter 1: Understanding MLOps

Section 1.1: Identifying the Business Challenge

Section 1.2: Data Acquisition

Chapter 2: Exploratory Data Analysis (EDA)

Section 2.1: Preparing the Data for Modeling

Chapter 3: Model Training and Evaluation

Chapter 4: Deployment and Monitoring

About the Author

Chapter 5: MLOps in Action

Chapter 6: Continuous Integration for ML

Share the page:

Recent Post:

Insights from My Week of Daily Writing Endeavors

Creating Your Own AI Assistant with Sage on Poe

Unlocking the Power of Polymorphism in Go: A Comprehensive Guide