Python for Data Science Basics

Introduction

As a data scientist, you will spend a significant portion of your workflow cleaning and preprocessing data before modeling. The quality of your preprocessing directly impacts the performance and interpretability of your models.

This tutorial will guide you through practical, industry-standard data cleaning and preprocessing techniques using Python.

Why Data Cleaning Matters

✅ Real-world data is messy, incomplete, and inconsistent.
✅ Proper cleaning prevents biases and incorrect conclusions.
✅ Clean data enables better model performance and generalization.

Workflow Overview

1️⃣ Identify and handle missing values.
2️⃣ Detect and treat outliers.
3️⃣ Encode categorical variables.
4️⃣ Scale and normalize numerical features.

Example Dataset

We will use a mock dataset simulating a customer dataset for churn prediction:

import pandas as pd
import numpy as np

data = {
    'Age': [25, np.nan, 47, 35, 52, 23, 40, np.nan, 30],
    'Income': [50000, 60000, 52000, 58000, 62000, 48000, 75000, 54000, np.nan],
    'Gender': ['Male', 'Female', 'Female', 'Male', np.nan, 'Female', 'Male', 'Female', 'Male'],
    'Churn': [0, 1, 0, 0, 1, 0, 0, 1, 0]
}

df = pd.DataFrame(data)
print(df)

Handling Missing Values

Check missing data:

print(df.isnull().sum())

Strategies:

Drop missing values (if dataset is large enough).
Impute missing values:
- Numerical: mean, median, or predictive imputation.
- Categorical: mode.

df['Age'].fillna(df['Age'].median(), inplace=True)
df['Income'].fillna(df['Income'].mean(), inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)

Outlier Detection and Treatment

Check for outliers using IQR:

Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1

outliers = df[(df['Income'] < Q1 - 1.5 * IQR) | (df['Income'] > Q3 + 1.5 * IQR)]
print(outliers)

Options:

Remove outliers.
Cap values (Winsorization).
Transform data using log or robust scalers.

Encoding Categorical Variables

Use one-hot encoding for nominal categories:

df = pd.get_dummies(df, columns=['Gender'], drop_first=True)
print(df)

Feature Scaling

For models sensitive to feature scaling, apply standardization or normalization:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])
print(df)

Conclusion

🎯 You now understand how to:

✅ Clean and preprocess data effectively. ✅ Identify and handle missing values.
✅ Detect and manage outliers.
✅ Encode categorical variables appropriately.
✅ Scale features for model readiness.

Proper data cleaning and preprocessing help you build reliable, interpretable, and performant machine learning models.

What’s Next?

✅ Explore advanced feature engineering techniques.
✅ Learn about pipelines to automate preprocessing steps in your workflow.
✅ Move forward to model building and evaluation tutorials on SuperML.

Join our SuperML Community to share your workflows, ask questions, and learn with other data scientists.

Happy Cleaning! 🧹

Course Content

Introduction

Why Data Cleaning Matters

Workflow Overview

Example Dataset

Handling Missing Values

Outlier Detection and Treatment

Encoding Categorical Variables

Feature Scaling

Conclusion

What’s Next?

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies