Press ESC to exit fullscreen
πŸ“– Lesson ⏱️ 120 minutes

Feature Engineering

Advanced feature engineering techniques

Introduction

Feature engineering is the process of transforming raw data into meaningful features that improve the performance of machine learning models. It is often the most critical step in the machine learning workflow.


Why is Feature Engineering Important?

βœ… It helps models capture patterns effectively.
βœ… It improves accuracy and reduces bias.
βœ… It allows algorithms to process categorical and numerical data effectively.
βœ… Good features often matter more than complex algorithms.


Common Feature Engineering Techniques

1️⃣ Handling Missing Values

  • Remove missing data if minimal.
  • Impute using mean, median, or mode.
  • Advanced: Use predictive models for imputation.

2️⃣ Encoding Categorical Variables

  • Label Encoding: Assign numerical values to categories.
  • One-Hot Encoding: Create binary columns for each category.

3️⃣ Feature Scaling

  • Standardization (Z-score normalization):
    z = (x - ΞΌ) / Οƒ

  • Min-Max Scaling:
    x_scaled = (x - x_min) / (x_max - x_min)

4️⃣ Feature Creation

  • Creating new features from existing data (e.g., extracting date parts, combining features).

Example: Feature Engineering in Python

Import Libraries

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler

Sample Data

data = {
    'Age': [25, 30, None, 45, 35],
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Female'],
    'Salary': [50000, 60000, 55000, None, 65000]
}
df = pd.DataFrame(data)
print(df)

Handling Missing Values

df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)

Encoding Categorical Variables

# Label Encoding
le = LabelEncoder()
df['Gender_Label'] = le.fit_transform(df['Gender'])

# One-Hot Encoding
df = pd.get_dummies(df, columns=['Gender'], drop_first=True)
print(df)

Feature Scaling

scaler = StandardScaler()
df[['Age_scaled', 'Salary_scaled']] = scaler.fit_transform(df[['Age', 'Salary']])
print(df)

Conclusion

πŸŽ‰ You now understand how to:

βœ… Handle missing data effectively.
βœ… Encode categorical variables using label and one-hot encoding.
βœ… Scale numerical features for effective model training.

Feature engineering is crucial for improving model performance and ensuring that machine learning algorithms can learn from your data efficiently.


What’s Next?

  • Experiment with feature creation by combining or transforming existing columns.
  • Explore advanced techniques like feature selection and dimensionality reduction (PCA).
  • Continue your learning journey with our Ensemble Methods tutorial.

Join our SuperML Community to discuss your feature engineering strategies and get feedback on your projects!