Machine Learning Classification: An End-to-End Guide with Scikit-Learn
Written on
In this article, we will discuss a complete example of tackling a classification problem with the help of Scikit-Learn, Pandas, NumPy, and Matplotlib. Previous discussions have introduced these libraries, and we will now leverage a real-world dataset from Kaggle.
We'll follow the outlined steps of our machine learning workflow. For more detailed information, please refer to the blog linked below.
Understanding the Machine Learning Project
Let’s delve into the framework for implementing machine learning across various contexts. Typically, a solid understanding of the project scope is essential.
Primary Data Analysis
The first step involves loading the dataset and conducting an analysis to familiarize ourselves with its structure. Below is an example of how to accomplish this.
# Import necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
# Data Preprocessing from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline
# Models from Scikit-Learn from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier
# Model Evaluations from sklearn.model_selection import train_test_split, cross_val_score from sklearn.model_selection import RandomizedSearchCV, GridSearchCV from sklearn.metrics import confusion_matrix, classification_report from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score from sklearn.metrics import RocCurveDisplay, roc_curve, roc_auc_score, auc
# Load the dataset heart_risk_load_df = pd.read_csv('data/heart_attack_prediction_dataset.csv') heart_risk_load_df.shape heart_risk_load_df.sample(5)
# Class distribution heart_risk_load_df['Heart Attack Risk'].value_counts() heart_risk_load_df.info()
Next, we will proceed to exploratory data analysis (EDA) to determine the appropriate actions for the dataset.
Exploratory Data Analysis (EDA)
In this phase, we will generate several plots and graphs to better understand the data. This will guide us in deciding which features to transform or exclude. By the end of our EDA, we should have a clear strategy for data and feature management.
# Crosstab plot for the Heart Attack Risk by Gender pd.crosstab(heart_risk_load_df['Heart Attack Risk'], heart_risk_load_df['Sex']).plot(kind='bar', figsize=(10, 6), color=['salmon', 'lightblue']) plt.title('Heart Disease Risk by Gender') plt.xlabel('0 = No Heart Disease Risk, 1 = Heart Disease Risk') plt.ylabel('Number of Patients') plt.xticks(rotation=0);
Continuing with EDA, we will create additional visualizations to analyze the dataset's features and their distributions.
heart_risk_load_df['Age'].plot.hist(bins=20);
Based on our analyses, we can deduce that certain features, such as patient ID, country, and continent, may not significantly affect heart risk. Therefore, we can consider removing these features from our dataset.
Data Preparation
With the previous steps complete, we can now focus on cleaning the data.
def group_age(age):
"""
Groups people into age categories.
age - age of the person, integer
Returns groups:
Babies - 0–2
Young Adult - 3–39
Middle-aged Adult - 40–59
Senior - 60–99
"""
if age <= 2:
return 'Baby'if age > 2 and age < 40:
return 'Young Adult'if age >= 40 and age < 60:
return 'Middle-aged Adult'if age >= 60:
return 'Senior'
heart_risk_load_analyzed_df['Age Group'] = heart_risk_load_analyzed_df['Age'].apply(group_age) heart_risk_load_analyzed_df.drop(columns=['Age'], inplace=True)
Additionally, we will implement functions to categorize physical activity, income, and BMI.
... [Further sections would continue in the same paraphrased manner, maintaining the overall structure and content of the original text while using distinct wording and style.]