· Data Science · 2 min read
📋 Prerequisites
- Basic Python knowledge
- Familiarity with pandas and NumPy
🎯 What You'll Learn
- Understand why data cleaning is critical for modeling
- Handle missing values and outliers effectively
- Encode categorical variables appropriately
- Apply scaling and normalization for numeric features
Introduction
As a data scientist, you will spend a significant portion of your workflow cleaning and preprocessing data before modeling. The quality of your preprocessing directly impacts the performance and interpretability of your models.
This tutorial will guide you through practical, industry-standard data cleaning and preprocessing techniques using Python.
Why Data Cleaning Matters
✅ Real-world data is messy, incomplete, and inconsistent.
✅ Proper cleaning prevents biases and incorrect conclusions.
✅ Clean data enables better model performance and generalization.
Workflow Overview
1️⃣ Identify and handle missing values.
2️⃣ Detect and treat outliers.
3️⃣ Encode categorical variables.
4️⃣ Scale and normalize numerical features.
Example Dataset
We will use a mock dataset simulating a customer dataset for churn prediction:
import pandas as pd
import numpy as np
data = {
'Age': [25, np.nan, 47, 35, 52, 23, 40, np.nan, 30],
'Income': [50000, 60000, 52000, 58000, 62000, 48000, 75000, 54000, np.nan],
'Gender': ['Male', 'Female', 'Female', 'Male', np.nan, 'Female', 'Male', 'Female', 'Male'],
'Churn': [0, 1, 0, 0, 1, 0, 0, 1, 0]
}
df = pd.DataFrame(data)
print(df)
Handling Missing Values
Check missing data:
print(df.isnull().sum())
Strategies:
- Drop missing values (if dataset is large enough).
- Impute missing values:
- Numerical: mean, median, or predictive imputation.
- Categorical: mode.
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Income'].fillna(df['Income'].mean(), inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
Outlier Detection and Treatment
Check for outliers using IQR:
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Income'] < Q1 - 1.5 * IQR) | (df['Income'] > Q3 + 1.5 * IQR)]
print(outliers)
Options:
- Remove outliers.
- Cap values (Winsorization).
- Transform data using log or robust scalers.
Encoding Categorical Variables
Use one-hot encoding for nominal categories:
df = pd.get_dummies(df, columns=['Gender'], drop_first=True)
print(df)
Feature Scaling
For models sensitive to feature scaling, apply standardization or normalization:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])
print(df)
Conclusion
🎯 You now understand how to:
✅ Identify and handle missing values.
✅ Detect and manage outliers.
✅ Encode categorical variables appropriately.
✅ Scale features for model readiness.
Proper data cleaning and preprocessing help you build reliable, interpretable, and performant machine learning models.
What’s Next?
✅ Explore advanced feature engineering techniques.
✅ Learn about pipelines to automate preprocessing steps in your workflow.
✅ Move forward to model building and evaluation tutorials on SuperML.
Join our SuperML Community to share your workflows, ask questions, and learn with other data scientists.
Happy Cleaning! 🧹