Data Cleaning and PCA Techniques with Python: A Comprehensive Guide
Written on
Data cleaning is a critical component of maintaining effective business operations. Often referred to as data cleansing or data wrangling, it represents a vital preliminary phase in the data analytics workflow. This essential process involves preparing and validating data, typically conducted prior to the main analysis.
A significant portion of this task focuses on identifying issues such as incomplete, inaccurate, irrelevant, corrupted, or improperly formatted data. Additionally, deduplication, or the removal of duplicate entries, is an integral part of this process.
High-Level Steps for Data Cleaning
In this section, we will examine practical strategies for effective data cleaning. There are various methods available to achieve these objectives:
- Eliminate unnecessary observations
- Correct structural errors
- Standardize the data
- Remove undesirable outliers
- Resolve conflicting data entries
- Address type conversion and syntax issues
- Handle missing values
- Validate the dataset
Accessing the GitHub Codebase
- Churn Raw Data File
- Data Cleaning Python Code.ipynb
Research Questions or Decisions
- How can we identify customers who are at a high risk of churn in the future?
- How can we identify high-value customers and offer them discounts to encourage retention?
Necessary Variables
- The churn dataset comprises 50 columns/variables; however, not all will be relevant for the questions mentioned above. Selecting the appropriate variables for analysis is crucial.
- "Churn" serves as the dependent variable, indicating whether customers continued using services or discontinued them in the past month.
- To predict high-risk churn customers, several significant independent variables are included in the dataset, influencing customers' decisions to stop using services.
Finding Anomalies
The following section outlines the key steps required to identify anomalies (often referred to as outliers) within the dataset. It is important to note that anomalies and outliers may be defined differently in various contexts.
- Download the churn raw dataset from the specified location to your local machine.
- Install and configure the Anaconda Navigator environment.
- Set up Jupyter Notebook/Lab and run sample code to confirm the environment is functioning correctly.
- Import necessary libraries such as pandas, SciPy, sklearn, and NumPy.
- Load the dataset into Python; reading a CSV file is straightforward using pandas functions.
- Utilize the read_csv() method from pandas to import data from CSV files.
- Analyze the data structure for a better understanding of the input data.
- There are various methods to detect outliers, including z-score, interquartile range, and standard deviation; in this case, standard deviation will be employed.
- Calculate the mean and standard deviation.
- For outlier detection, the mean and standard deviation of the residuals are computed and compared. A data point is identified as an outlier if it deviates by a specified number of standard deviations from the mean, typically set at a threshold of three.
- Assuming a normal distribution, approximately 68% of data points will fall within one standard deviation of the mean, 95% within two, and 99.7% within three.
- Represent the data statistically using visualizations such as histograms or boxplots.
Justification for the Approach
To evaluate the quality of the dataset mentioned, it is advisable to first remove outliers and impute missing data using measures of central tendency (mean, median).
Several significant columns contain “NA” values. We should either impute these numeric values with a central tendency measure or remove rows with limited missing values. However, it’s essential to avoid removing a large number of rows, as this could skew the dataset.
Removing numerous rows may not significantly enhance predictive analysis. We must also eliminate unnecessary features/variables to lessen the burden on the machine learning model during predictive data analysis.
Justification for Tools
The tools used for this assessment are detailed below.
The pandas library provides a straightforward method for reading data via dataframes, which are two-dimensional data structures. Dataframes can be created using various inputs, including lists, dictionaries, series, and NumPy arrays.
Code for Finding Anomalies
The following code can be used to identify anomalies:
Findings from Data Cleaning
Initially, we evaluated the dataset to identify the count of missing values or “NA” entries. The imputation of these missing values was as crucial as their identification. We discovered thousands of missing records in the columns for children, age, income, phone, tenure, and monthly charges. While some other columns had missing values, they did not significantly affect the analysis or the churn rate problem statement.
Following the imputation, we identified outliers in the aforementioned columns using histograms and boxplot statistics. We also employed seaborn boxplots to examine outlier values for monthly charges.
Justification of Mitigation Methods
After identifying missing data entries, I have used the central tendency approach to impute necessary columns such as children, age, income, phone, and tenure. Although the monthly charge column has outliers, they are not significant for our analysis. I have also created a new dataset to store a copy of the cleaned data. For a more detailed analysis, I have imported statistics to generate visual representations of the data.
Outcomes Summary
- Initially, we eliminated unwanted and less relevant columns from the dataset, resulting in a total of 46 columns remaining.
- Eight survey columns were renamed with more descriptive titles for clarity.
- I identified missing values in the dataset, which revealed thousands of missing entries in columns like children, age, income, tenure, and bandwidth per year.
- To analyze significant columns, I found unique values in key columns such as employment, area, children, education, marital status, gender, contract type, payment method, and age, which helped identify misspellings and invalid data.
- I have imputed all missing values for important columns, which facilitated outlier detection.
- Histograms and boxplots were plotted to visualize the aforementioned columns, aiding in the identification of outliers.
Mitigation Code for Anomalies
For the mitigation code, please refer to the codebase from pages 12 to 16.
To mitigate effectively, consider the following guiding principles: - Before removing outliers, analyze the dataset both with and without them to understand the impact on results. - If outliers are clearly due to incorrect entries or measurements, they can be discarded without issue. - If your assumptions are affected by outliers, you may remove them as long as there are no significant changes in the results. - If outliers impact your assumptions and results, it is advisable to remove them and proceed with the next steps.
Principal Components Analysis (PCA)
PCA is a technique utilized to reduce the dimensionality of datasets, helping to mitigate issues of model overfitting. It exclusively works with numeric values.
Basic Steps for Conducting PCA
- Normalize the data for standardization
- Compute the covariance matrix
- Calculate eigenvectors and eigenvalues
- Compute principal components
- Reduce data dimensions
The following numeric variables were considered for the PCA process: - CaseOrder, Zip, Lat, Lng, Population, Children, Age, Income, Outage_sec_perweek, Email, Contacts, Yearly_equip_failure, Tenure, MonthlyCharge, Bandwidth_GB_Year, Timely_Responses, Timely_Fixes, Timely_Replacements, Reliability, Options, Respectful_Response, Courteous_Exchange, Active_Listening
Criteria for Principal Components Analysis (PCA)
Refer to section E1 and the PCA code snippets provided above.
For the PCA analysis, I imported the sklearn decomposition libraries. The sklearn.decomposition.pca module simplifies PCA tasks using linear dimensionality reduction methods. I utilized the PCA() function from sklearn.
Initially, I normalized the data to ensure all features were on a comparable scale. Normalization is crucial to prevent the largest dispersion columns from dominating the component loadings.
As PCA operates exclusively with numeric values, I established a “df_numeric” dataframe containing 18 variables. I labeled all principal components as PC1, PC2, ..., PC18, and applied dimensionality reduction to the normalized data using fit() and transform() functions.
I employed the Scree plot visualization to assess which features maintained the original dataset's variance after dimensionality reduction. To understand the relationships between components, I calculated eigenvalues and eigenvectors. The Kaiser criterion (Eigenvalues ? 1) was applied using the Scree plot to identify all principal components for loadings, revealing that 20 components preserved 98% of the variance.
To comprehend variance ratios in percentage terms, I used the pca.explained_variance_ratio_ attribute. To calculate the total variance preserved by all selected components, I utilized the np.cumsum() function. Additionally, I calculated loadings for each component to identify which variables significantly explained the total variance of the original dataset.
Benefits of PCA
- While the original dataset contains 50 features, PCA reveals that 20 components account for approximately 98% of the variance. This indicates that these 20 features are critical for exploratory data analysis and model training. Reducing the dataset's dimensionality and eliminating correlated features enhances algorithm training and reduces costs associated with model training.
- Utilizing the Scree plot during PCA allows for easy visualization of eigenvalues (? 1) and aids in identifying significant components for selection without compromising the original dataset's variance.
- PCA not only reduces the number of features but also identifies critical features necessary for decision-making based on the dataset, such as prioritizing customer service and timely responses per survey data.
- Training models with a limited yet significant feature set minimizes the challenges associated with overfitting.
- Insights from PCA and the correlation matrix indicate that increasing monthly service charges may influence customer churn rates.
- Loadings variables illustrate the most significant loadings for each component.
Author
Milan Dhore, M.S (Data Analytics)
Cloud Strategic Leader | Enterprise Transformation Leader | AI | ML
Certified in TOGAF, AWS, ML, AI, Architecture, Snowflake, Six Sigma, NCFM, Excellence Award in Advanced Data Analytics, Financial Markets... Learn more at www.milanoutlook.com