Course Content
Python for Data Science Basics
Essential Python skills for data science and ML
Introduction
As a data scientist, you will spend a significant portion of your workflow cleaning and preprocessing data before modeling. The quality of your preprocessing directly impacts the performance and interpretability of your models.
This tutorial will guide you through practical, industry-standard data cleaning and preprocessing techniques using Python.
Why Data Cleaning Matters
β
Real-world data is messy, incomplete, and inconsistent.
β
Proper cleaning prevents biases and incorrect conclusions.
β
Clean data enables better model performance and generalization.
Workflow Overview
1οΈβ£ Identify and handle missing values.
2οΈβ£ Detect and treat outliers.
3οΈβ£ Encode categorical variables.
4οΈβ£ Scale and normalize numerical features.
Example Dataset
We will use a mock dataset simulating a customer dataset for churn prediction:
import pandas as pd
import numpy as np
data = {
'Age': [25, np.nan, 47, 35, 52, 23, 40, np.nan, 30],
'Income': [50000, 60000, 52000, 58000, 62000, 48000, 75000, 54000, np.nan],
'Gender': ['Male', 'Female', 'Female', 'Male', np.nan, 'Female', 'Male', 'Female', 'Male'],
'Churn': [0, 1, 0, 0, 1, 0, 0, 1, 0]
}
df = pd.DataFrame(data)
print(df)
Handling Missing Values
Check missing data:
print(df.isnull().sum())
Strategies:
- Drop missing values (if dataset is large enough).
- Impute missing values:
- Numerical: mean, median, or predictive imputation.
- Categorical: mode.
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Income'].fillna(df['Income'].mean(), inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
Outlier Detection and Treatment
Check for outliers using IQR:
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['Income'] < Q1 - 1.5 * IQR) | (df['Income'] > Q3 + 1.5 * IQR)]
print(outliers)
Options:
- Remove outliers.
- Cap values (Winsorization).
- Transform data using log or robust scalers.
Encoding Categorical Variables
Use one-hot encoding for nominal categories:
df = pd.get_dummies(df, columns=['Gender'], drop_first=True)
print(df)
Feature Scaling
For models sensitive to feature scaling, apply standardization or normalization:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])
print(df)
Conclusion
π― You now understand how to:
β
Clean and preprocess data effectively. β
Identify and handle missing values.
β
Detect and manage outliers.
β
Encode categorical variables appropriately.
β
Scale features for model readiness.
Proper data cleaning and preprocessing help you build reliable, interpretable, and performant machine learning models.
Whatβs Next?
β
Explore advanced feature engineering techniques.
β
Learn about pipelines to automate preprocessing steps in your workflow.
β
Move forward to model building and evaluation tutorials on SuperML.
Join our SuperML Community to share your workflows, ask questions, and learn with other data scientists.
Happy Cleaning! π§Ή