Course Content
Feature Engineering
Creating better features for your models
Introduction
Feature engineering is the process of transforming raw data into meaningful features that improve the performance of machine learning models. It is often the most critical step in the machine learning workflow.
Why is Feature Engineering Important?
β
It helps models capture patterns effectively.
β
It improves accuracy and reduces bias.
β
It allows algorithms to process categorical and numerical data effectively.
β
Good features often matter more than complex algorithms.
Common Feature Engineering Techniques
1οΈβ£ Handling Missing Values
- Remove missing data if minimal.
- Impute using mean, median, or mode.
- Advanced: Use predictive models for imputation.
2οΈβ£ Encoding Categorical Variables
- Label Encoding: Assign numerical values to categories.
- One-Hot Encoding: Create binary columns for each category.
3οΈβ£ Feature Scaling
Standardization (Z-score normalization):
z = (x - ΞΌ) / Ο
Min-Max Scaling:
x_scaled = (x - x_min) / (x_max - x_min)
4οΈβ£ Feature Creation
- Creating new features from existing data (e.g., extracting date parts, combining features).
Example: Feature Engineering in Python
Import Libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
Sample Data
data = {
'Age': [25, 30, None, 45, 35],
'Gender': ['Male', 'Female', 'Female', 'Male', 'Female'],
'Salary': [50000, 60000, 55000, None, 65000]
}
df = pd.DataFrame(data)
print(df)
Handling Missing Values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
Encoding Categorical Variables
# Label Encoding
le = LabelEncoder()
df['Gender_Label'] = le.fit_transform(df['Gender'])
# One-Hot Encoding
df = pd.get_dummies(df, columns=['Gender'], drop_first=True)
print(df)
Feature Scaling
scaler = StandardScaler()
df[['Age_scaled', 'Salary_scaled']] = scaler.fit_transform(df[['Age', 'Salary']])
print(df)
Conclusion
π You now understand how to:
β
Handle missing data effectively.
β
Encode categorical variables using label and one-hot encoding.
β
Scale numerical features for effective model training.
Feature engineering is crucial for improving model performance and ensuring that machine learning algorithms can learn from your data efficiently.
Whatβs Next?
- Experiment with feature creation by combining or transforming existing columns.
- Explore advanced techniques like feature selection and dimensionality reduction (PCA).
- Continue your learning journey with our Ensemble Methods tutorial.
Join our SuperML Community to discuss your feature engineering strategies and get feedback on your projects!