Press ESC to exit fullscreen
πŸ“– Lesson ⏱️ 90 minutes

Data Loading and Preprocessing

Loading datasets and preparing data with Java APIs

Data Loading and Preprocessing with SuperML Java

Data preprocessing is a critical step in any machine learning pipeline. SuperML Java provides comprehensive tools for loading data from various sources and preparing it for model training. This tutorial covers everything from basic data loading to advanced preprocessing techniques.

Understanding Data in Machine Learning

Before diving into code, let’s understand what makes good ML data:

  • Clean: No missing or corrupted values
  • Consistent: Uniform format and scale
  • Relevant: Features that correlate with target variable
  • Sufficient: Enough samples for reliable training

Data Loading with SuperML Java

Loading from CSV Files

The most common data source is CSV files. SuperML Java makes this simple:

import org.superml.datasets.Datasets;
import org.superml.preprocessing.*;

// Basic CSV loading
Dataset data = Datasets.fromCSV("data/sales.csv");

// CSV with custom options
Dataset data = DataLoader.fromCSV("data/sales.csv")
    .withHeader(true)
    .withDelimiter(",")
    .withEncoding("UTF-8")
    .load();

// Specify column types
Dataset data = DataLoader.fromCSV("data/sales.csv")
    .withColumnTypes(Map.of(
        "price", ColumnType.NUMERIC,
        "category", ColumnType.CATEGORICAL,
        "date", ColumnType.DATE
    ))
    .load();

Loading from Databases

For enterprise applications, database integration is essential:

import java.sql.Connection;
import java.sql.DriverManager;

// Database connection
Connection conn = DriverManager.getConnection(
    "jdbc:postgresql://localhost:5432/sales_db",
    "username", "password"
);

// Load from SQL query
Dataset data = DataLoader.fromDatabase(conn, 
    "SELECT price, category, sales_date, quantity FROM sales WHERE sales_date >= '2024-01-01'");

// Parameterized queries
String query = "SELECT * FROM customers WHERE region = ? AND signup_date >= ?";
Dataset data = DataLoader.fromDatabase(conn, query, "North", "2024-01-01");

Loading from JSON

For web applications and APIs:

// Simple JSON loading
Dataset data = DataLoader.fromJSON("data/products.json");

// JSON with nested structure
Dataset data = DataLoader.fromJSON("data/complex_data.json")
    .withPath("$.results[*]")  // JSONPath expression
    .load();

// JSON from URL
Dataset data = DataLoader.fromURL("https://api.example.com/data.json")
    .withAuth("Bearer " + apiToken)
    .load();

Loading from Other Sources

// Excel files
Dataset data = DataLoader.fromExcel("data/report.xlsx")
    .withSheet("Sales Data")
    .withHeaderRow(1)
    .load();

// Parquet files (for big data)
Dataset data = DataLoader.fromParquet("data/large_dataset.parquet");

// In-memory arrays
double[][] features = {{1, 2}, {3, 4}, {5, 6}};
double[] targets = {1, 0, 1};
Dataset data = DataLoader.fromArrays(features, targets);

Data Exploration

Before preprocessing, understand your data:

// Basic information
System.out.println("Dataset shape: " + data.getShape());
System.out.println("Feature names: " + data.getFeatureNames());
System.out.println("Target name: " + data.getTargetName());

// Statistical summary
DataSummary summary = data.describe();
System.out.println(summary);

// Missing values
Map<String, Integer> missingCounts = data.getMissingCounts();
missingCounts.forEach((column, count) -> 
    System.out.println(column + ": " + count + " missing values"));

// Data types
Map<String, ColumnType> types = data.getColumnTypes();

Visualization Support

// Basic plots (if visualization module is included)
data.plot().histogram("price").show();
data.plot().scatter("price", "quantity").show();
data.plot().correlation().show();

Handling Missing Data

Missing data is common in real-world datasets. SuperML Java provides several strategies:

Detection and Analysis

// Check for missing values
boolean hasMissing = data.hasMissingValues();
Map<String, Double> missingPercentages = data.getMissingPercentages();

// Visualize missing patterns
MissingDataAnalyzer analyzer = new MissingDataAnalyzer(data);
analyzer.showMissingPattern();

Removal Strategies

// Drop rows with any missing values
Dataset cleaned = data.dropMissing();

// Drop rows with missing values in specific columns
Dataset cleaned = data.dropMissing("price", "quantity");

// Drop columns with too many missing values (>50%)
Dataset cleaned = data.dropColumns(0.5);

Imputation Strategies

// Fill with constant values
Dataset filled = data.fillMissing(0.0);  // Fill with 0
Dataset filled = data.fillMissing("category", "Unknown");

// Fill with statistical measures
Dataset filled = data.fillMissing(Strategy.MEAN);     // Mean for numeric
Dataset filled = data.fillMissing(Strategy.MEDIAN);   // Median for numeric
Dataset filled = data.fillMissing(Strategy.MODE);     // Mode for categorical

// Column-specific strategies
Dataset filled = data.fillMissing(Map.of(
    "price", Strategy.MEAN,
    "category", Strategy.MODE,
    "description", "Not Available"
));

// Forward/backward fill for time series
Dataset filled = data.fillMissing(Strategy.FORWARD_FILL);
Dataset filled = data.fillMissing(Strategy.BACKWARD_FILL);

Advanced Imputation

// K-Nearest Neighbors imputation
KNNImputer imputer = new KNNImputer(k=5);
Dataset filled = imputer.fitTransform(data);

// Regression-based imputation
RegressionImputer imputer = new RegressionImputer();
Dataset filled = imputer.fitTransform(data);

// Multiple imputation
MultipleImputer imputer = new MultipleImputer(iterations=5);
List<Dataset> imputedDatasets = imputer.fitTransform(data);

Feature Scaling and Normalization

Machine learning algorithms often require features to be on similar scales:

Standardization (Z-score normalization)

// Standardize all numeric features (mean=0, std=1)
StandardScaler scaler = new StandardScaler();
Dataset scaled = scaler.fitTransform(data);

// Standardize specific columns
Dataset scaled = scaler.fitTransform(data, "price", "quantity", "rating");

// Manual standardization
Dataset scaled = data.standardize();

Min-Max Normalization

// Scale to [0, 1] range
MinMaxScaler scaler = new MinMaxScaler();
Dataset scaled = scaler.fitTransform(data);

// Custom range [min, max]
MinMaxScaler scaler = new MinMaxScaler(-1, 1);
Dataset scaled = scaler.fitTransform(data);

// Quick normalization
Dataset normalized = data.normalize();

Robust Scaling

// Scale using median and IQR (robust to outliers)
RobustScaler scaler = new RobustScaler();
Dataset scaled = scaler.fitTransform(data);

Unit Vector Scaling

// Scale each sample to unit norm
Normalizer normalizer = new Normalizer();
Dataset normalized = normalizer.fitTransform(data);

Categorical Data Encoding

Convert categorical variables to numeric format:

One-Hot Encoding

// Basic one-hot encoding
OneHotEncoder encoder = new OneHotEncoder();
Dataset encoded = encoder.fitTransform(data, "category", "region");

// With options
OneHotEncoder encoder = new OneHotEncoder()
    .withDropFirst(true)    // Avoid multicollinearity
    .withSparse(true)       // Memory efficient for high cardinality
    .withHandleUnknown(OneHotEncoder.HandleUnknown.IGNORE);

Dataset encoded = encoder.fitTransform(data, "category");

Label Encoding

// Convert categories to integers
LabelEncoder encoder = new LabelEncoder();
Dataset encoded = encoder.fitTransform(data, "category");

// For ordinal data with custom order
OrdinalEncoder encoder = new OrdinalEncoder()
    .withOrder("category", List.of("Low", "Medium", "High"));
Dataset encoded = encoder.fitTransform(data);

Target Encoding

// Encode categories based on target variable
TargetEncoder encoder = new TargetEncoder();
Dataset encoded = encoder.fitTransform(data, "category", "target");

Feature Engineering

Create new features from existing ones:

Polynomial Features

// Add polynomial combinations
PolynomialFeatures poly = new PolynomialFeatures(degree=2);
Dataset expanded = poly.fitTransform(data);

// With interaction terms only
PolynomialFeatures poly = new PolynomialFeatures(degree=2)
    .withInteractionOnly(true);

Binning/Discretization

// Equal-width binning
Binner binner = new Binner(strategy=BinningStrategy.UNIFORM, bins=5);
Dataset binned = binner.fitTransform(data, "price");

// Equal-frequency binning
Binner binner = new Binner(strategy=BinningStrategy.QUANTILE, bins=4);
Dataset binned = binner.fitTransform(data, "age");

// Custom bin edges
Binner binner = new Binner(edges=new double[]{0, 25, 50, 75, 100});
Dataset binned = binner.fitTransform(data, "score");

Date/Time Features

// Extract date components
DateFeatureExtractor extractor = new DateFeatureExtractor();
Dataset withDateFeatures = extractor.fitTransform(data, "order_date");

// This creates features like: year, month, day, day_of_week, hour, etc.

// Custom date features
DateFeatureExtractor extractor = new DateFeatureExtractor()
    .withFeatures(DateFeature.YEAR, DateFeature.QUARTER, DateFeature.IS_WEEKEND);

Text Features

// TF-IDF vectorization
TfIdfVectorizer vectorizer = new TfIdfVectorizer()
    .withMaxFeatures(1000)
    .withNgrams(1, 2)  // Unigrams and bigrams
    .withStopWords(StopWords.ENGLISH);

Dataset vectorized = vectorizer.fitTransform(data, "description");

// Count vectorization
CountVectorizer vectorizer = new CountVectorizer();
Dataset vectorized = vectorizer.fitTransform(data, "text_column");

Outlier Detection and Handling

Identify and handle outliers that might affect model performance:

Statistical Methods

// Z-score method
OutlierDetector detector = new ZScoreDetector(threshold=3.0);
boolean[] outliers = detector.detect(data, "price");

// IQR method
OutlierDetector detector = new IQRDetector(multiplier=1.5);
boolean[] outliers = detector.detect(data, "quantity");

// Modified Z-score (robust)
OutlierDetector detector = new ModifiedZScoreDetector(threshold=3.5);

Machine Learning Methods

// Isolation Forest
IsolationForest detector = new IsolationForest(contamination=0.1);
boolean[] outliers = detector.fitDetect(data);

// Local Outlier Factor
LocalOutlierFactor detector = new LocalOutlierFactor(neighbors=20);
double[] scores = detector.fitDetect(data);

Handling Outliers

// Remove outliers
Dataset cleaned = data.removeOutliers(outliers);

// Cap outliers at percentiles
Dataset capped = data.capOutliers("price", 0.05, 0.95);  // 5th and 95th percentiles

// Transform outliers
Dataset transformed = data.transformOutliers("price", Math::log);

Data Splitting

Prepare data for training and evaluation:

Basic Train-Test Split

// 80-20 split
DataSplit split = data.split(0.8);
Dataset trainData = split.getTrain();
Dataset testData = split.getTest();

// With stratification (for classification)
DataSplit split = data.split(0.8, stratify=true);

// Custom split with validation set
DataSplit split = data.split(0.6, 0.2, 0.2);  // 60% train, 20% val, 20% test
Dataset trainData = split.getTrain();
Dataset valData = split.getValidation();
Dataset testData = split.getTest();

Time Series Splitting

// For time series data
TimeSeriesSplit splitter = new TimeSeriesSplit(testSize=0.2);
DataSplit split = splitter.split(data, "date_column");

Cross-Validation Splits

// K-fold cross-validation
KFoldSplitter splitter = new KFoldSplitter(k=5);
List<DataSplit> folds = splitter.split(data);

// Stratified K-fold
StratifiedKFoldSplitter splitter = new StratifiedKFoldSplitter(k=5);
List<DataSplit> folds = splitter.split(data);

Pipeline Creation

Combine multiple preprocessing steps:

Basic Pipeline

// Create preprocessing pipeline
Pipeline pipeline = new Pipeline()
    .add(new MissingValueImputer(Strategy.MEAN))
    .add(new StandardScaler())
    .add(new OneHotEncoder("category", "region"));

// Apply pipeline
Dataset processed = pipeline.fitTransform(data);

// Save pipeline for later use
pipeline.save("preprocessing_pipeline.pkl");

// Load and apply saved pipeline
Pipeline loadedPipeline = Pipeline.load("preprocessing_pipeline.pkl");
Dataset newProcessed = loadedPipeline.transform(newData);

Advanced Pipeline with Column Selection

// Column-specific transformations
ColumnTransformer transformer = new ColumnTransformer()
    .addNumeric(List.of("price", "quantity"), new StandardScaler())
    .addCategorical(List.of("category", "region"), new OneHotEncoder())
    .addText(List.of("description"), new TfIdfVectorizer())
    .addDate(List.of("order_date"), new DateFeatureExtractor());

Dataset processed = transformer.fitTransform(data);

Real-World Example: E-commerce Dataset

Let’s put it all together with a complete preprocessing example:

public class EcommerceDataPreprocessor {
    
    public Dataset preprocessEcommerceData(String dataPath) {
        // Load data
        Dataset data = DataLoader.fromCSV(dataPath)
            .withColumnTypes(Map.of(
                "product_id", ColumnType.CATEGORICAL,
                "price", ColumnType.NUMERIC,
                "category", ColumnType.CATEGORICAL,
                "rating", ColumnType.NUMERIC,
                "review_count", ColumnType.NUMERIC,
                "order_date", ColumnType.DATE,
                "description", ColumnType.TEXT
            ))
            .load();
        
        // Initial data exploration
        System.out.println("Original dataset shape: " + data.getShape());
        System.out.println("Missing values: " + data.getMissingCounts());
        
        // Handle missing values
        data = data.fillMissing(Map.of(
            "rating", Strategy.MEDIAN,
            "review_count", 0.0,
            "description", "No description"
        ));
        
        // Outlier detection and handling
        OutlierDetector detector = new IQRDetector();
        boolean[] priceOutliers = detector.detect(data, "price");
        data = data.capOutliers("price", 0.01, 0.99);
        
        // Feature engineering
        DateFeatureExtractor dateExtractor = new DateFeatureExtractor()
            .withFeatures(DateFeature.YEAR, DateFeature.MONTH, 
                         DateFeature.DAY_OF_WEEK, DateFeature.IS_WEEKEND);
        data = dateExtractor.fitTransform(data, "order_date");
        
        // Create price bins
        Binner priceBinner = new Binner(strategy=BinningStrategy.QUANTILE, bins=4);
        data = priceBinner.fitTransform(data, "price_binned", "price");
        
        // Text processing for descriptions
        TfIdfVectorizer vectorizer = new TfIdfVectorizer()
            .withMaxFeatures(100)
            .withNgrams(1, 2)
            .withStopWords(StopWords.ENGLISH);
        data = vectorizer.fitTransform(data, "description");
        
        // Encoding categorical variables
        OneHotEncoder encoder = new OneHotEncoder()
            .withDropFirst(true)
            .withHandleUnknown(OneHotEncoder.HandleUnknown.IGNORE);
        data = encoder.fitTransform(data, "category", "price_binned");
        
        // Scale numerical features
        StandardScaler scaler = new StandardScaler();
        data = scaler.fitTransform(data, "price", "rating", "review_count");
        
        // Final data validation
        System.out.println("Processed dataset shape: " + data.getShape());
        System.out.println("Features: " + data.getFeatureNames());
        
        return data;
    }
}

Best Practices

1. Data Validation

// Always validate your data
public void validateData(Dataset data) {
    assert !data.isEmpty() : "Dataset cannot be empty";
    assert !data.hasMissingValues() : "Missing values should be handled";
    assert data.getNumericColumns().stream()
        .allMatch(col -> data.getColumn(col).isFinite()) : "No infinite values allowed";
}

2. Reproducibility

// Set random seeds for reproducible results
RandomState.setSeed(42);

// Save preprocessing parameters
PreprocessingConfig config = new PreprocessingConfig()
    .setScalerParams(scaler.getParams())
    .setEncoderParams(encoder.getParams());
config.save("preprocessing_config.json");

3. Memory Efficiency

// For large datasets, use streaming
StreamingDataLoader loader = new StreamingDataLoader("large_file.csv")
    .withChunkSize(10000);

while (loader.hasNext()) {
    Dataset chunk = loader.next();
    Dataset processed = pipeline.transform(chunk);
    // Process chunk
}

4. Error Handling

try {
    Dataset processed = pipeline.fitTransform(data);
} catch (DataPreprocessingException e) {
    logger.error("Preprocessing failed: " + e.getMessage());
    // Fallback to simpler preprocessing
    Dataset processed = fallbackPreprocessing(data);
}

Performance Tips

  1. Use appropriate data types: Choose the most memory-efficient data types
  2. Process in chunks: For large datasets, process data in smaller chunks
  3. Cache results: Cache expensive preprocessing operations
  4. Parallel processing: Use parallel streams for independent operations
  5. Profile your code: Use profiling tools to identify bottlenecks

Summary

In this tutorial, we covered:

  • Data Loading: From CSV, databases, JSON, and other sources
  • Missing Data: Detection and various imputation strategies
  • Feature Scaling: Standardization, normalization, and robust scaling
  • Categorical Encoding: One-hot, label, and target encoding
  • Feature Engineering: Polynomial features, binning, and date/time features
  • Outlier Handling: Detection and treatment methods
  • Data Splitting: For training, validation, and testing
  • Pipelines: Combining multiple preprocessing steps
  • Best Practices: For production-ready preprocessing

Data preprocessing is often the most time-consuming part of machine learning projects, but it’s also one of the most important. Good preprocessing can significantly improve model performance, while poor preprocessing can make even the best algorithms fail.

In the next tutorial, we’ll use this preprocessed data to build and train our first machine learning model with SuperML Java.