· Java Machine Learning · 9 min read
Data Loading and Preprocessing with SuperML Java
Data preprocessing is a critical step in any machine learning pipeline. SuperML Java provides comprehensive tools for loading data from various sources and preparing it for model training. This tutorial covers everything from basic data loading to advanced preprocessing techniques.
Understanding Data in Machine Learning
Before diving into code, let’s understand what makes good ML data:
- Clean: No missing or corrupted values
- Consistent: Uniform format and scale
- Relevant: Features that correlate with target variable
- Sufficient: Enough samples for reliable training
Data Loading with SuperML Java
Loading from CSV Files
The most common data source is CSV files. SuperML Java makes this simple:
import org.superml.datasets.Datasets;
import org.superml.preprocessing.*;
// Basic CSV loading
Dataset data = Datasets.fromCSV("data/sales.csv");
// CSV with custom options
Dataset data = DataLoader.fromCSV("data/sales.csv")
.withHeader(true)
.withDelimiter(",")
.withEncoding("UTF-8")
.load();
// Specify column types
Dataset data = DataLoader.fromCSV("data/sales.csv")
.withColumnTypes(Map.of(
"price", ColumnType.NUMERIC,
"category", ColumnType.CATEGORICAL,
"date", ColumnType.DATE
))
.load();
Loading from Databases
For enterprise applications, database integration is essential:
import java.sql.Connection;
import java.sql.DriverManager;
// Database connection
Connection conn = DriverManager.getConnection(
"jdbc:postgresql://localhost:5432/sales_db",
"username", "password"
);
// Load from SQL query
Dataset data = DataLoader.fromDatabase(conn,
"SELECT price, category, sales_date, quantity FROM sales WHERE sales_date >= '2024-01-01'");
// Parameterized queries
String query = "SELECT * FROM customers WHERE region = ? AND signup_date >= ?";
Dataset data = DataLoader.fromDatabase(conn, query, "North", "2024-01-01");
Loading from JSON
For web applications and APIs:
// Simple JSON loading
Dataset data = DataLoader.fromJSON("data/products.json");
// JSON with nested structure
Dataset data = DataLoader.fromJSON("data/complex_data.json")
.withPath("$.results[*]") // JSONPath expression
.load();
// JSON from URL
Dataset data = DataLoader.fromURL("https://api.example.com/data.json")
.withAuth("Bearer " + apiToken)
.load();
Loading from Other Sources
// Excel files
Dataset data = DataLoader.fromExcel("data/report.xlsx")
.withSheet("Sales Data")
.withHeaderRow(1)
.load();
// Parquet files (for big data)
Dataset data = DataLoader.fromParquet("data/large_dataset.parquet");
// In-memory arrays
double[][] features = {{1, 2}, {3, 4}, {5, 6}};
double[] targets = {1, 0, 1};
Dataset data = DataLoader.fromArrays(features, targets);
Data Exploration
Before preprocessing, understand your data:
// Basic information
System.out.println("Dataset shape: " + data.getShape());
System.out.println("Feature names: " + data.getFeatureNames());
System.out.println("Target name: " + data.getTargetName());
// Statistical summary
DataSummary summary = data.describe();
System.out.println(summary);
// Missing values
Map<String, Integer> missingCounts = data.getMissingCounts();
missingCounts.forEach((column, count) ->
System.out.println(column + ": " + count + " missing values"));
// Data types
Map<String, ColumnType> types = data.getColumnTypes();
Visualization Support
// Basic plots (if visualization module is included)
data.plot().histogram("price").show();
data.plot().scatter("price", "quantity").show();
data.plot().correlation().show();
Handling Missing Data
Missing data is common in real-world datasets. SuperML Java provides several strategies:
Detection and Analysis
// Check for missing values
boolean hasMissing = data.hasMissingValues();
Map<String, Double> missingPercentages = data.getMissingPercentages();
// Visualize missing patterns
MissingDataAnalyzer analyzer = new MissingDataAnalyzer(data);
analyzer.showMissingPattern();
Removal Strategies
// Drop rows with any missing values
Dataset cleaned = data.dropMissing();
// Drop rows with missing values in specific columns
Dataset cleaned = data.dropMissing("price", "quantity");
// Drop columns with too many missing values (>50%)
Dataset cleaned = data.dropColumns(0.5);
Imputation Strategies
// Fill with constant values
Dataset filled = data.fillMissing(0.0); // Fill with 0
Dataset filled = data.fillMissing("category", "Unknown");
// Fill with statistical measures
Dataset filled = data.fillMissing(Strategy.MEAN); // Mean for numeric
Dataset filled = data.fillMissing(Strategy.MEDIAN); // Median for numeric
Dataset filled = data.fillMissing(Strategy.MODE); // Mode for categorical
// Column-specific strategies
Dataset filled = data.fillMissing(Map.of(
"price", Strategy.MEAN,
"category", Strategy.MODE,
"description", "Not Available"
));
// Forward/backward fill for time series
Dataset filled = data.fillMissing(Strategy.FORWARD_FILL);
Dataset filled = data.fillMissing(Strategy.BACKWARD_FILL);
Advanced Imputation
// K-Nearest Neighbors imputation
KNNImputer imputer = new KNNImputer(k=5);
Dataset filled = imputer.fitTransform(data);
// Regression-based imputation
RegressionImputer imputer = new RegressionImputer();
Dataset filled = imputer.fitTransform(data);
// Multiple imputation
MultipleImputer imputer = new MultipleImputer(iterations=5);
List<Dataset> imputedDatasets = imputer.fitTransform(data);
Feature Scaling and Normalization
Machine learning algorithms often require features to be on similar scales:
Standardization (Z-score normalization)
// Standardize all numeric features (mean=0, std=1)
StandardScaler scaler = new StandardScaler();
Dataset scaled = scaler.fitTransform(data);
// Standardize specific columns
Dataset scaled = scaler.fitTransform(data, "price", "quantity", "rating");
// Manual standardization
Dataset scaled = data.standardize();
Min-Max Normalization
// Scale to [0, 1] range
MinMaxScaler scaler = new MinMaxScaler();
Dataset scaled = scaler.fitTransform(data);
// Custom range [min, max]
MinMaxScaler scaler = new MinMaxScaler(-1, 1);
Dataset scaled = scaler.fitTransform(data);
// Quick normalization
Dataset normalized = data.normalize();
Robust Scaling
// Scale using median and IQR (robust to outliers)
RobustScaler scaler = new RobustScaler();
Dataset scaled = scaler.fitTransform(data);
Unit Vector Scaling
// Scale each sample to unit norm
Normalizer normalizer = new Normalizer();
Dataset normalized = normalizer.fitTransform(data);
Categorical Data Encoding
Convert categorical variables to numeric format:
One-Hot Encoding
// Basic one-hot encoding
OneHotEncoder encoder = new OneHotEncoder();
Dataset encoded = encoder.fitTransform(data, "category", "region");
// With options
OneHotEncoder encoder = new OneHotEncoder()
.withDropFirst(true) // Avoid multicollinearity
.withSparse(true) // Memory efficient for high cardinality
.withHandleUnknown(OneHotEncoder.HandleUnknown.IGNORE);
Dataset encoded = encoder.fitTransform(data, "category");
Label Encoding
// Convert categories to integers
LabelEncoder encoder = new LabelEncoder();
Dataset encoded = encoder.fitTransform(data, "category");
// For ordinal data with custom order
OrdinalEncoder encoder = new OrdinalEncoder()
.withOrder("category", List.of("Low", "Medium", "High"));
Dataset encoded = encoder.fitTransform(data);
Target Encoding
// Encode categories based on target variable
TargetEncoder encoder = new TargetEncoder();
Dataset encoded = encoder.fitTransform(data, "category", "target");
Feature Engineering
Create new features from existing ones:
Polynomial Features
// Add polynomial combinations
PolynomialFeatures poly = new PolynomialFeatures(degree=2);
Dataset expanded = poly.fitTransform(data);
// With interaction terms only
PolynomialFeatures poly = new PolynomialFeatures(degree=2)
.withInteractionOnly(true);
Binning/Discretization
// Equal-width binning
Binner binner = new Binner(strategy=BinningStrategy.UNIFORM, bins=5);
Dataset binned = binner.fitTransform(data, "price");
// Equal-frequency binning
Binner binner = new Binner(strategy=BinningStrategy.QUANTILE, bins=4);
Dataset binned = binner.fitTransform(data, "age");
// Custom bin edges
Binner binner = new Binner(edges=new double[]{0, 25, 50, 75, 100});
Dataset binned = binner.fitTransform(data, "score");
Date/Time Features
// Extract date components
DateFeatureExtractor extractor = new DateFeatureExtractor();
Dataset withDateFeatures = extractor.fitTransform(data, "order_date");
// This creates features like: year, month, day, day_of_week, hour, etc.
// Custom date features
DateFeatureExtractor extractor = new DateFeatureExtractor()
.withFeatures(DateFeature.YEAR, DateFeature.QUARTER, DateFeature.IS_WEEKEND);
Text Features
// TF-IDF vectorization
TfIdfVectorizer vectorizer = new TfIdfVectorizer()
.withMaxFeatures(1000)
.withNgrams(1, 2) // Unigrams and bigrams
.withStopWords(StopWords.ENGLISH);
Dataset vectorized = vectorizer.fitTransform(data, "description");
// Count vectorization
CountVectorizer vectorizer = new CountVectorizer();
Dataset vectorized = vectorizer.fitTransform(data, "text_column");
Outlier Detection and Handling
Identify and handle outliers that might affect model performance:
Statistical Methods
// Z-score method
OutlierDetector detector = new ZScoreDetector(threshold=3.0);
boolean[] outliers = detector.detect(data, "price");
// IQR method
OutlierDetector detector = new IQRDetector(multiplier=1.5);
boolean[] outliers = detector.detect(data, "quantity");
// Modified Z-score (robust)
OutlierDetector detector = new ModifiedZScoreDetector(threshold=3.5);
Machine Learning Methods
// Isolation Forest
IsolationForest detector = new IsolationForest(contamination=0.1);
boolean[] outliers = detector.fitDetect(data);
// Local Outlier Factor
LocalOutlierFactor detector = new LocalOutlierFactor(neighbors=20);
double[] scores = detector.fitDetect(data);
Handling Outliers
// Remove outliers
Dataset cleaned = data.removeOutliers(outliers);
// Cap outliers at percentiles
Dataset capped = data.capOutliers("price", 0.05, 0.95); // 5th and 95th percentiles
// Transform outliers
Dataset transformed = data.transformOutliers("price", Math::log);
Data Splitting
Prepare data for training and evaluation:
Basic Train-Test Split
// 80-20 split
DataSplit split = data.split(0.8);
Dataset trainData = split.getTrain();
Dataset testData = split.getTest();
// With stratification (for classification)
DataSplit split = data.split(0.8, stratify=true);
// Custom split with validation set
DataSplit split = data.split(0.6, 0.2, 0.2); // 60% train, 20% val, 20% test
Dataset trainData = split.getTrain();
Dataset valData = split.getValidation();
Dataset testData = split.getTest();
Time Series Splitting
// For time series data
TimeSeriesSplit splitter = new TimeSeriesSplit(testSize=0.2);
DataSplit split = splitter.split(data, "date_column");
Cross-Validation Splits
// K-fold cross-validation
KFoldSplitter splitter = new KFoldSplitter(k=5);
List<DataSplit> folds = splitter.split(data);
// Stratified K-fold
StratifiedKFoldSplitter splitter = new StratifiedKFoldSplitter(k=5);
List<DataSplit> folds = splitter.split(data);
Pipeline Creation
Combine multiple preprocessing steps:
Basic Pipeline
// Create preprocessing pipeline
Pipeline pipeline = new Pipeline()
.add(new MissingValueImputer(Strategy.MEAN))
.add(new StandardScaler())
.add(new OneHotEncoder("category", "region"));
// Apply pipeline
Dataset processed = pipeline.fitTransform(data);
// Save pipeline for later use
pipeline.save("preprocessing_pipeline.pkl");
// Load and apply saved pipeline
Pipeline loadedPipeline = Pipeline.load("preprocessing_pipeline.pkl");
Dataset newProcessed = loadedPipeline.transform(newData);
Advanced Pipeline with Column Selection
// Column-specific transformations
ColumnTransformer transformer = new ColumnTransformer()
.addNumeric(List.of("price", "quantity"), new StandardScaler())
.addCategorical(List.of("category", "region"), new OneHotEncoder())
.addText(List.of("description"), new TfIdfVectorizer())
.addDate(List.of("order_date"), new DateFeatureExtractor());
Dataset processed = transformer.fitTransform(data);
Real-World Example: E-commerce Dataset
Let’s put it all together with a complete preprocessing example:
public class EcommerceDataPreprocessor {
public Dataset preprocessEcommerceData(String dataPath) {
// Load data
Dataset data = DataLoader.fromCSV(dataPath)
.withColumnTypes(Map.of(
"product_id", ColumnType.CATEGORICAL,
"price", ColumnType.NUMERIC,
"category", ColumnType.CATEGORICAL,
"rating", ColumnType.NUMERIC,
"review_count", ColumnType.NUMERIC,
"order_date", ColumnType.DATE,
"description", ColumnType.TEXT
))
.load();
// Initial data exploration
System.out.println("Original dataset shape: " + data.getShape());
System.out.println("Missing values: " + data.getMissingCounts());
// Handle missing values
data = data.fillMissing(Map.of(
"rating", Strategy.MEDIAN,
"review_count", 0.0,
"description", "No description"
));
// Outlier detection and handling
OutlierDetector detector = new IQRDetector();
boolean[] priceOutliers = detector.detect(data, "price");
data = data.capOutliers("price", 0.01, 0.99);
// Feature engineering
DateFeatureExtractor dateExtractor = new DateFeatureExtractor()
.withFeatures(DateFeature.YEAR, DateFeature.MONTH,
DateFeature.DAY_OF_WEEK, DateFeature.IS_WEEKEND);
data = dateExtractor.fitTransform(data, "order_date");
// Create price bins
Binner priceBinner = new Binner(strategy=BinningStrategy.QUANTILE, bins=4);
data = priceBinner.fitTransform(data, "price_binned", "price");
// Text processing for descriptions
TfIdfVectorizer vectorizer = new TfIdfVectorizer()
.withMaxFeatures(100)
.withNgrams(1, 2)
.withStopWords(StopWords.ENGLISH);
data = vectorizer.fitTransform(data, "description");
// Encoding categorical variables
OneHotEncoder encoder = new OneHotEncoder()
.withDropFirst(true)
.withHandleUnknown(OneHotEncoder.HandleUnknown.IGNORE);
data = encoder.fitTransform(data, "category", "price_binned");
// Scale numerical features
StandardScaler scaler = new StandardScaler();
data = scaler.fitTransform(data, "price", "rating", "review_count");
// Final data validation
System.out.println("Processed dataset shape: " + data.getShape());
System.out.println("Features: " + data.getFeatureNames());
return data;
}
}
Best Practices
1. Data Validation
// Always validate your data
public void validateData(Dataset data) {
assert !data.isEmpty() : "Dataset cannot be empty";
assert !data.hasMissingValues() : "Missing values should be handled";
assert data.getNumericColumns().stream()
.allMatch(col -> data.getColumn(col).isFinite()) : "No infinite values allowed";
}
2. Reproducibility
// Set random seeds for reproducible results
RandomState.setSeed(42);
// Save preprocessing parameters
PreprocessingConfig config = new PreprocessingConfig()
.setScalerParams(scaler.getParams())
.setEncoderParams(encoder.getParams());
config.save("preprocessing_config.json");
3. Memory Efficiency
// For large datasets, use streaming
StreamingDataLoader loader = new StreamingDataLoader("large_file.csv")
.withChunkSize(10000);
while (loader.hasNext()) {
Dataset chunk = loader.next();
Dataset processed = pipeline.transform(chunk);
// Process chunk
}
4. Error Handling
try {
Dataset processed = pipeline.fitTransform(data);
} catch (DataPreprocessingException e) {
logger.error("Preprocessing failed: " + e.getMessage());
// Fallback to simpler preprocessing
Dataset processed = fallbackPreprocessing(data);
}
Performance Tips
- Use appropriate data types: Choose the most memory-efficient data types
- Process in chunks: For large datasets, process data in smaller chunks
- Cache results: Cache expensive preprocessing operations
- Parallel processing: Use parallel streams for independent operations
- Profile your code: Use profiling tools to identify bottlenecks
Summary
In this tutorial, we covered:
- Data Loading: From CSV, databases, JSON, and other sources
- Missing Data: Detection and various imputation strategies
- Feature Scaling: Standardization, normalization, and robust scaling
- Categorical Encoding: One-hot, label, and target encoding
- Feature Engineering: Polynomial features, binning, and date/time features
- Outlier Handling: Detection and treatment methods
- Data Splitting: For training, validation, and testing
- Pipelines: Combining multiple preprocessing steps
- Best Practices: For production-ready preprocessing
Data preprocessing is often the most time-consuming part of machine learning projects, but it’s also one of the most important. Good preprocessing can significantly improve model performance, while poor preprocessing can make even the best algorithms fail.
In the next tutorial, we’ll use this preprocessed data to build and train our first machine learning model with SuperML Java.