Integrating MLOps into Your Machine Learning Workflow
Written on
Chapter 1: Understanding MLOps
MLOps represents a vital engineering practice that merges the development of machine learning models (like training, hyperparameter tuning, and selection) with operational processes to ensure smooth and standardized deployment of these models in real-world applications. If you're just starting, you might find this concept a bit overwhelming. Here's a straightforward explanation:
MLOps comprises various engineering and operational tasks that make your machine learning models accessible to users and applications throughout your organization. In essence, it’s a framework that enables you to share your machine learning work online, helping to achieve specific business goals. While this is a simplified definition, it serves as a good introduction for newcomers.
This tutorial will walk you through each stage of the machine learning workflow, focusing particularly on Experiment Logging and model tracking.
Section 1.1: Identifying the Business Challenge
Machine learning provides solutions to specific business challenges. The problem-framing phase can vary significantly in duration, ranging from several days to weeks, depending on the complexity of the issue. During this stage, data scientists collaborate with subject matter experts (SMEs) to fully understand the problem, interview key stakeholders, gather relevant data, and establish project objectives.
For this tutorial, I will reference a case study from the Darden School of Business published in Harvard Business. The scenario revolves around Greg and Sarah, who are planning to marry. Greg seeks to find the perfect ring for Sarah, and based on a friend's suggestion, he decides to purchase a diamond, allowing Sarah to choose her preferred style. He collects data on 6,000 diamonds, including details like price and attributes such as cut, color, and shape.
Section 1.2: Data Acquisition
Once the problem has been defined, data is gathered through enterprise databases using the Extract, Transform, Load (ETL) process. ETL involves extracting data from various sources, transforming it into a suitable format, and loading it into a data warehouse or another consolidated repository.
# load the dataset from pycaret
from pycaret.datasets import get_data
data = get_data('diamond')
Chapter 2: Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is where the initial data investigation takes place. The main goal of EDA is to assess the quality of the data, checking for missing values, outliers, and analyzing statistical characteristics like feature distribution and correlation.
# plot scatter plot for carat weight and price
import plotly.express as px
fig = px.scatter(x=data['Carat Weight'], y=data['Price'],
facet_col=data['Cut'], opacity=0.25, template='plotly_dark', trendline='ols',
trendline_color_override='red', title='SARAH GETS A DIAMOND - A CASE STUDY')
fig.show()
In this analysis, you can observe that the price distribution is right-skewed. We can quickly determine if a log transformation would normalize this distribution, benefiting algorithms that assume normality.
import numpy as np
# create a copy of data
data_copy = data.copy()
# create a new feature Log_Price
data_copy['Log_Price'] = np.log(data['Price'])
# plot histogram
fig = px.histogram(data_copy, x=["Log_Price"], title='Histogram of Log Price', template='plotly_dark')
fig.show()
The results confirm our hypothesis: the transformation helps mitigate skewness, bringing the target variable closer to normality. Thus, we will apply this transformation to the price variable before proceeding with model training.
Section 2.1: Preparing the Data for Modeling
Now, we prepare the data for model training, which includes tasks like splitting the data into training and test sets, handling missing values, one-hot encoding, target encoding, feature engineering, and selection.
# initialize setup
from pycaret.regression import *
s = setup(data, target='Price', transform_target=True, log_experiment=True, experiment_name='diamond')
When you call the setup function in PyCaret, it analyzes the dataset and determines the data types of the input features. If everything is accurate, you can proceed by pressing enter.
I set log_experiment=True and experiment_name='diamond', which instructs PyCaret to automatically log all metrics, hyperparameters, and model artifacts as you move through the modeling process. This integration with MLflow makes it seamless.
Chapter 3: Model Training and Evaluation
This stage involves training numerous machine learning models, tuning hyperparameters, and evaluating performance metrics. The aim is to select the best model based on predetermined business metrics.
# compare all models
best = compare_models()
Next, we can assess the residuals of the trained model:
# check the residuals of trained model
plot_model(best, plot='residuals_interactive')
To gain insights into feature importance, we can visualize it as well:
# check feature importance
plot_model(best, plot='feature')
Chapter 4: Deployment and Monitoring
The final phase focuses primarily on MLOps. This includes tasks such as packaging the final model, creating a Docker image, writing the scoring script, and ultimately compiling everything into an API for making predictions on new data.
Historically, this process has been cumbersome and required extensive technical expertise, which is challenging to cover in a single tutorial. However, I'll demonstrate how to implement essential MLOps features using PyCaret, an open-source, low-code machine learning library in Python.
# within notebook (notice the ! sign in front)
!mlflow ui
Now open your browser and enter "localhost:5000" to access a UI displaying a table that summarizes the training runs, including performance metrics and other metadata.
The paths to your logged models are available for loading them later:
# load model
from pycaret.regression import load_model
pipeline = load_model('C:/Users/moezs/mlruns/1/b8c10d259b294b28a3e233a9d2c209c0/artifacts/model/model')
# print pipeline
print(pipeline)
You can also prepare your data for prediction:
# create a copy of data and drop Price
data2 = data.copy()
data2.drop('Price', axis=1, inplace=True)
# generate predictions
from pycaret.regression import predict_model
predictions = predict_model(pipeline, data=data2)
predictions.head()
Congratulations on reaching this point! Your trained pipeline has provided valuable insights. All transformations, including target transformation and missing value imputation, were handled automatically in the background.
Thank you for reading!
About the Author
I specialize in writing about data science, machine learning, and PyCaret. If you'd like to stay updated, feel free to follow me on Medium, LinkedIn, and Twitter.
Chapter 5: MLOps in Action
In this section, we'll explore practical applications of MLOps, starting with a video that illustrates operationalizing your ML workflow using pipeline templates.
This video provides a hands-on demonstration of how to implement MLOps strategies effectively.
Chapter 6: Continuous Integration for ML
Next, we'll dive into continuous integration practices specific to machine learning in this tutorial.
This video serves as an introduction to the essential concepts of continuous integration tailored for machine learning projects.