Course Content
AutoML Framework
Automated algorithm selection and hyperparameter optimization
AutoML in Java - Automated Machine Learning
AutoML (Automated Machine Learning) in SuperML Java 2.1.0 provides intelligent algorithm selection, hyperparameter optimization, and model ensemble techniques. This tutorial covers how to leverage AutoML for rapid prototyping, production systems, and competitive machine learning with minimal manual intervention.
What Youβll Learn
- AutoML Framework - Automated algorithm selection and optimization
- Hyperparameter Tuning - Grid search, random search, and Bayesian optimization
- Model Ensemble - Combining multiple models for better performance
- Kaggle Integration - One-line training on any Kaggle dataset
- Production AutoML - Enterprise-ready automated ML pipelines
- Performance Monitoring - Automated model evaluation and selection
- Custom Configurations - Advanced AutoML parameter tuning
Prerequisites
- Completion of βIntroduction to SuperML Javaβ tutorial
- Basic understanding of machine learning concepts
- Java development environment with SuperML Java 2.1.0
- Familiarity with model evaluation metrics
AutoML Overview
SuperML Java 2.1.0βs AutoML framework automatically:
- Selects optimal algorithms from 12+ implementations
- Optimizes hyperparameters using advanced search strategies
- Creates model ensembles for improved performance
- Handles data preprocessing and feature engineering
- Provides comprehensive model evaluation and comparison
Basic AutoML Usage
One-Line AutoML
The most powerful feature of SuperML AutoML is its ability to automatically find the best machine learning algorithm and hyperparameters with a single line of code. This example demonstrates the complete AutoML workflow from data loading to model deployment.
import org.superml.datasets.Datasets;
import org.superml.autotrainer.AutoTrainer;
public class BasicAutoMLExample {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - One-Line AutoML ===\n");
try {
// Load dataset
// The Iris dataset is perfect for demonstrating AutoML with 150 samples,
// 4 features (sepal/petal length/width), and 3 classes (setosa, versicolor, virginica)
var dataset = Datasets.loadIris();
System.out.println("π Dataset: " + dataset.X.length + " samples, " + dataset.X[0].length + " features");
// One-line AutoML!
// This single line automatically:
// 1. Tests multiple algorithms (Logistic Regression, Random Forest, SVM, etc.)
// 2. Performs hyperparameter optimization for each algorithm
// 3. Uses cross-validation to evaluate performance
// 4. Selects the best performing model
// 5. Returns a trained model ready for production
System.out.println("π€ Starting AutoML optimization...");
long startTime = System.currentTimeMillis();
var result = AutoTrainer.autoML(dataset.X, dataset.y, "classification");
long autoMLTime = System.currentTimeMillis() - startTime;
// Display results
// The AutoML result contains comprehensive information about the optimization process
System.out.println("\n=== AutoML Results ===");
System.out.println("π Best Algorithm: " + result.getBestAlgorithm());
System.out.println("π Best Score: " + String.format("%.4f", result.getBestScore()));
System.out.println("βοΈ Best Parameters: " + result.getBestParams());
System.out.println("β±οΈ AutoML Time: " + autoMLTime + " ms");
// Get all tested algorithms
// AutoML tests multiple algorithms and provides a complete performance comparison
// This transparency allows you to understand which algorithms work best for your data
System.out.println("\nπ All Algorithm Results:");
var allResults = result.getAllResults();
allResults.forEach((algorithm, score) -> {
System.out.println("- " + algorithm + ": " + String.format("%.4f", score));
});
// Use best model for predictions
// The best model is immediately ready for production use
// No additional training or configuration required
var bestModel = result.getBestModel();
double[] predictions = bestModel.predict(dataset.X);
System.out.println("\nβ
AutoML completed successfully!");
System.out.println("π― Ready for production with optimized model: " + result.getBestAlgorithm());
} catch (Exception e) {
System.err.println("β Error in AutoML: " + e.getMessage());
e.printStackTrace();
}
}
}
Key Learning Points:
- Automatic Algorithm Selection: AutoML tests multiple algorithms and selects the best performer
- Hyperparameter Optimization: Each algorithm is automatically tuned for optimal performance
- Cross-Validation: Built-in cross-validation ensures reliable performance estimates
- Production Ready: The result is a trained model ready for immediate deployment
- Transparency: Full visibility into the optimization process and algorithm comparison
AutoML with Train/Test Split
This example demonstrates proper machine learning evaluation using separate training and testing datasets. This approach prevents overfitting and provides realistic performance estimates for production deployment.
import org.superml.datasets.Datasets;
import org.superml.autotrainer.AutoTrainer;
import org.superml.model_selection.ModelSelection;
import org.superml.metrics.Metrics;
public class AutoMLWithSplitExample {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - AutoML with Train/Test Split ===\n");
try {
// Load and split dataset
// Wine dataset contains 178 samples with 13 chemical features
// Perfect for demonstrating multi-class classification with proper evaluation
var dataset = Datasets.loadWine();
// Split data into 80% training and 20% testing
// The random seed (42) ensures reproducible results
// Stratified split maintains class distribution in both sets
var split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);
System.out.println("π Training samples: " + split.XTrain.length);
System.out.println("π Test samples: " + split.XTest.length);
// AutoML on training data
// Training only on training data prevents data leakage
// The model has never seen the test data, ensuring honest evaluation
System.out.println("\nπ€ Running AutoML on training data...");
var result = AutoTrainer.autoML(split.XTrain, split.yTrain, "classification");
// Get best model
// The best model is selected based on cross-validation performance
// This model represents the optimal algorithm and hyperparameters
var bestModel = result.getBestModel();
// Evaluate on test set
// Test set evaluation provides unbiased performance estimate
// This simulates real-world deployment performance
double[] predictions = bestModel.predict(split.XTest);
// Calculate metrics
// Comprehensive evaluation using multiple metrics
// Each metric provides different insights into model performance
double accuracy = Metrics.accuracy(split.yTest, predictions); // Overall correctness
double precision = Metrics.precision(split.yTest, predictions); // Positive prediction accuracy
double recall = Metrics.recall(split.yTest, predictions); // True positive detection rate
double f1 = Metrics.f1Score(split.yTest, predictions); // Harmonic mean of precision/recall
System.out.println("\n=== AutoML Results ===");
System.out.println("π Best Algorithm: " + result.getBestAlgorithm());
System.out.println("π CV Score: " + String.format("%.4f", result.getBestScore()));
System.out.println("π Test Accuracy: " + String.format("%.4f", accuracy));
System.out.println("π Test Precision: " + String.format("%.4f", precision));
System.out.println("π Test Recall: " + String.format("%.4f", recall));
System.out.println("π Test F1 Score: " + String.format("%.4f", f1));
// Display confusion matrix
// Confusion matrix shows detailed classification performance
// Diagonal elements represent correct predictions
// Off-diagonal elements show misclassification patterns
int[][] confMatrix = Metrics.confusionMatrix(split.yTest, predictions);
System.out.println("\nπ Confusion Matrix:");
for (int i = 0; i < confMatrix.length; i++) {
System.out.println(java.util.Arrays.toString(confMatrix[i]));
}
System.out.println("\nβ
AutoML evaluation completed!");
} catch (Exception e) {
System.err.println("β Error in AutoML: " + e.getMessage());
e.printStackTrace();
}
}
}
Key Learning Points:
- Data Splitting: Proper train/test split prevents overfitting and provides honest evaluation
- Stratified Sampling: Maintains class distribution across training and test sets
- Cross-Validation: AutoML uses CV on training data to select the best model
- Multiple Metrics: Comprehensive evaluation using accuracy, precision, recall, and F1-score
- Confusion Matrix: Detailed view of classification performance across all classes
- Production Readiness: Test set performance estimates real-world deployment accuracy
Advanced AutoML Configuration
Custom AutoML Parameters
This advanced example demonstrates how to customize AutoML behavior for specific requirements. Custom configuration allows fine-tuning of the optimization process for better performance and control.
import org.superml.autotrainer.AutoTrainer;
import org.superml.datasets.Datasets;
public class AdvancedAutoMLExample {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - Advanced AutoML Configuration ===\n");
try {
// Load dataset
// Generate a complex synthetic dataset for demonstration
// 2000 samples, 20 features, 5 informative features, 5 classes
var dataset = Datasets.makeClassification(2000, 20, 5, 42);
System.out.println("π Generated dataset: " + dataset.X.length + " samples, " + dataset.X[0].length + " features, " +
java.util.Arrays.stream(dataset.y).max().orElse(0) + " classes");
// Advanced AutoML configuration
// Each setting controls a specific aspect of the optimization process
var config = new AutoTrainer.Config()
.setAlgorithms("logistic", "randomforest", "gradientboosting", "xgboost") // Select specific algorithms
.setSearchStrategy("bayesian") // Bayesian optimization for intelligent search
.setCrossValidationFolds(5) // 5-fold cross-validation for robust evaluation
.setMaxEvaluationTime(600) // 10 minutes maximum training time
.setEarlyStoppingRounds(10) // Stop early if no improvement for 10 rounds
.setEnsembleMethods(true) // Enable ensemble methods for better performance
.setFeatureSelection(true) // Automatic feature selection to reduce overfitting
.setPreprocessing(true) // Automatic data preprocessing and scaling
.setVerbose(true); // Detailed logging for transparency
System.out.println("π Advanced AutoML Configuration:");
System.out.println("- Algorithms: Logistic, RandomForest, GradientBoosting, XGBoost");
System.out.println("- Search Strategy: Bayesian Optimization");
System.out.println("- Cross-Validation: 5-fold");
System.out.println("- Max Time: 10 minutes");
System.out.println("- Ensemble: Enabled");
System.out.println("- Feature Selection: Automatic");
// Run advanced AutoML
// The custom configuration provides more control over the optimization process
// Bayesian optimization intelligently explores the hyperparameter space
System.out.println("\nπ€ Starting advanced AutoML...");
long startTime = System.currentTimeMillis();
var result = AutoTrainer.autoMLWithConfig(dataset.X, dataset.y, config);
long autoMLTime = System.currentTimeMillis() - startTime;
// Display detailed results
// Advanced AutoML provides comprehensive insights into the optimization process
System.out.println("\n=== Advanced AutoML Results ===");
System.out.println("π Best Algorithm: " + result.getBestAlgorithm());
System.out.println("π Best CV Score: " + String.format("%.4f", result.getBestScore()));
System.out.println("π Best Parameters: " + result.getBestParams());
System.out.println("β±οΈ Total Time: " + autoMLTime + " ms");
// Feature importance (if available)
// Shows which features contribute most to the model's predictions
// Higher values indicate more important features
if (result.getFeatureImportance() != null) {
System.out.println("\nπ Feature Importance (Top 5):");
var importance = result.getFeatureImportance();
for (int i = 0; i < Math.min(5, importance.length); i++) {
System.out.println("- Feature " + i + ": " + String.format("%.4f", importance[i]));
}
}
// Ensemble information
// Ensemble methods combine multiple models for improved performance
// Typically more robust and accurate than single models
if (result.isEnsembleUsed()) {
System.out.println("\nπ Ensemble Information:");
System.out.println("- Ensemble Type: " + result.getEnsembleType());
System.out.println("- Number of Models: " + result.getEnsembleSize());
System.out.println("- Ensemble Score: " + String.format("%.4f", result.getEnsembleScore()));
}
// Model comparison
// Shows performance of all tested algorithms
// Helps understand which algorithms work best for this data
System.out.println("\nπ Model Comparison:");
var allResults = result.getAllResults();
allResults.entrySet().stream()
.sorted(Map.Entry.<String, Double>comparingByValue().reversed())
.forEach(entry -> {
System.out.println("- " + entry.getKey() + ": " + String.format("%.4f", entry.getValue()));
});
System.out.println("\nβ
Advanced AutoML completed successfully!");
} catch (Exception e) {
System.err.println("β Error in advanced AutoML: " + e.getMessage());
e.printStackTrace();
}
}
}
Key Learning Points:
- Algorithm Selection: Choose specific algorithms based on problem requirements and computational constraints
- Bayesian Optimization: Intelligent hyperparameter search that learns from previous evaluations
- Feature Selection: Automatic removal of irrelevant features to improve performance and reduce overfitting
- Ensemble Methods: Combine multiple models for improved accuracy and robustness
- Time Management: Set time limits to balance performance with computational resources
- Feature Importance: Understand which features drive model predictions for better insights
- Model Comparison: Comprehensive analysis of all tested algorithms for informed decision making
Hyperparameter Optimization
Grid Search with AutoML
Grid search systematically evaluates all combinations of hyperparameters to find the optimal configuration. This example demonstrates comprehensive hyperparameter optimization with detailed parameter space definition.
import org.superml.autotrainer.AutoTrainer;
import org.superml.autotrainer.HyperparameterOptimizer;
import org.superml.datasets.Datasets;
public class HyperparameterOptimizationExample {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - Hyperparameter Optimization ===\n");
try {
// Load dataset
// Iris dataset provides a perfect baseline for hyperparameter optimization
var dataset = Datasets.loadIris();
// Define hyperparameter search spaces
// Each parameter space defines the range of values to test
// Careful selection of parameter ranges is crucial for effective optimization
var parameterSpaces = new HashMap<String, Object>();
// Logistic Regression parameters
// maxIter: Maximum iterations for convergence
// regularization: Type of regularization (L1, L2, or none)
// C: Regularization strength (smaller values = stronger regularization)
parameterSpaces.put("logistic_maxIter", new int[]{100, 200, 500, 1000});
parameterSpaces.put("logistic_regularization", new String[]{"l1", "l2", "none"});
parameterSpaces.put("logistic_C", new double[]{0.1, 1.0, 10.0, 100.0});
// Random Forest parameters
// nEstimators: Number of trees in the forest
// maxDepth: Maximum depth of each tree (-1 for unlimited)
// minSamplesSplit: Minimum samples required to split a node
// minSamplesLeaf: Minimum samples required at each leaf node
parameterSpaces.put("randomforest_nEstimators", new int[]{50, 100, 200});
parameterSpaces.put("randomforest_maxDepth", new int[]{5, 10, 20, -1});
parameterSpaces.put("randomforest_minSamplesSplit", new int[]{2, 5, 10});
parameterSpaces.put("randomforest_minSamplesLeaf", new int[]{1, 2, 4});
// Gradient Boosting parameters
// nEstimators: Number of boosting stages
// learningRate: Learning rate shrinks contribution of each tree
// maxDepth: Maximum depth of individual trees
parameterSpaces.put("gradientboosting_nEstimators", new int[]{100, 200, 300});
parameterSpaces.put("gradientboosting_learningRate", new double[]{0.01, 0.1, 0.2});
parameterSpaces.put("gradientboosting_maxDepth", new int[]{3, 5, 7});
System.out.println("π§ Hyperparameter Search Configuration:");
System.out.println("- Logistic Regression: 4 x 3 x 4 = 48 combinations");
System.out.println("- Random Forest: 3 x 4 x 3 x 3 = 108 combinations");
System.out.println("- Gradient Boosting: 3 x 3 x 3 = 27 combinations");
System.out.println("- Total combinations: 183");
// Create hyperparameter optimizer
// Grid search evaluates all parameter combinations systematically
// Cross-validation ensures robust performance estimation
var optimizer = new HyperparameterOptimizer()
.setParameterSpaces(parameterSpaces)
.setSearchStrategy("grid") // Grid search strategy
.setCrossValidationFolds(5) // 5-fold cross-validation
.setMaxEvaluations(50) // Limit total evaluations
.setParallelEvaluation(true) // Enable parallel processing
.setVerbose(true); // Detailed progress logging
// Run hyperparameter optimization
// The optimizer systematically evaluates parameter combinations
// Each combination is evaluated using cross-validation
System.out.println("\nπ Starting hyperparameter optimization...");
long startTime = System.currentTimeMillis();
var result = optimizer.optimize(dataset.X, dataset.y, "classification");
long optimizationTime = System.currentTimeMillis() - startTime;
// Display results
// Results provide comprehensive insights into the optimization process
System.out.println("\n=== Hyperparameter Optimization Results ===");
System.out.println("π Best Algorithm: " + result.getBestAlgorithm());
System.out.println("π Best CV Score: " + String.format("%.4f", result.getBestScore()));
System.out.println("π Best Parameters: " + result.getBestParams());
System.out.println("β±οΈ Optimization Time: " + optimizationTime + " ms");
System.out.println("π’ Evaluations: " + result.getNumEvaluations());
// Top 5 configurations
// Shows the best performing parameter combinations
// Helps understand which parameter settings work best
System.out.println("\nπ
Top 5 Configurations:");
var topConfigs = result.getTopConfigurations(5);
for (int i = 0; i < topConfigs.size(); i++) {
var config = topConfigs.get(i);
System.out.println((i + 1) + ". " + config.getAlgorithm() +
" - Score: " + String.format("%.4f", config.getScore()) +
" - Params: " + config.getParameters());
}
System.out.println("\nβ
Hyperparameter optimization completed!");
} catch (Exception e) {
System.err.println("β Error in hyperparameter optimization: " + e.getMessage());
e.printStackTrace();
}
}
}
Key Learning Points:
- Parameter Space Definition: Carefully define the range of hyperparameters to explore
- Grid Search Strategy: Systematic evaluation of all parameter combinations
- Cross-Validation: Robust performance estimation using multiple data splits
- Parallel Processing: Speed up optimization using parallel evaluation
- Evaluation Limits: Control computational cost by limiting total evaluations
- Performance Analysis: Compare top configurations to understand parameter sensitivity
- Algorithm Comparison: Understand which algorithms and parameters work best for your data
Bayesian Optimization
Bayesian optimization uses a probabilistic model to intelligently search the hyperparameter space. This example demonstrates how Bayesian methods can find optimal parameters more efficiently than grid search.
import org.superml.autotrainer.BayesianOptimizer;
import org.superml.datasets.Datasets;
public class BayesianOptimizationExample {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - Bayesian Optimization ===\n");
try {
// Load dataset
// Generate a complex dataset to demonstrate Bayesian optimization effectiveness
var dataset = Datasets.makeClassification(1500, 15, 3, 42);
System.out.println("π Dataset: " + dataset.X.length + " samples, " + dataset.X[0].length + " features");
// Configure Bayesian optimization
// Bayesian optimization uses probabilistic models to guide the search
// It balances exploration (trying new areas) with exploitation (refining good areas)
var optimizer = new BayesianOptimizer()
.setAcquisitionFunction("expected_improvement") // EI guides search strategy
.setInitialRandomSamples(10) // Random initialization
.setMaxIterations(50) // Maximum optimization iterations
.setKappa(2.576) // Exploration parameter (higher = more exploration)
.setXi(0.01) // Exploitation parameter (trade-off)
.setVerbose(true); // Progress logging
// Define continuous parameter spaces
// Bayesian optimization works well with continuous parameters
// The optimizer learns the relationship between parameters and performance
var parameterSpaces = new HashMap<String, Object>();
// Random Forest with continuous parameters
// Each parameter range defines the search space
parameterSpaces.put("randomforest_maxDepth", new double[]{3.0, 20.0});
parameterSpaces.put("randomforest_minSamplesSplit", new double[]{2.0, 20.0});
parameterSpaces.put("randomforest_minSamplesLeaf", new double[]{1.0, 10.0});
// Gradient Boosting with continuous parameters
// Learning rate and subsample ratio are crucial for gradient boosting
parameterSpaces.put("gradientboosting_learningRate", new double[]{0.01, 0.3});
parameterSpaces.put("gradientboosting_maxDepth", new double[]{3.0, 10.0});
parameterSpaces.put("gradientboosting_subsample", new double[]{0.5, 1.0});
System.out.println("π§ Bayesian Optimization Configuration:");
System.out.println("- Acquisition Function: Expected Improvement");
System.out.println("- Initial Random Samples: 10");
System.out.println("- Max Iterations: 50");
System.out.println("- Exploration Parameter (ΞΊ): 2.576");
// Run Bayesian optimization
// The optimizer intelligently explores the parameter space
// Each iteration uses information from previous evaluations
System.out.println("\nπ Starting Bayesian optimization...");
long startTime = System.currentTimeMillis();
var result = optimizer.optimize(dataset.X, dataset.y, parameterSpaces);
long optimizationTime = System.currentTimeMillis() - startTime;
// Display results
// Bayesian optimization provides insights into the optimization process
System.out.println("\n=== Bayesian Optimization Results ===");
System.out.println("π Best Algorithm: " + result.getBestAlgorithm());
System.out.println("π Best Score: " + String.format("%.4f", result.getBestScore()));
System.out.println("π Best Parameters: " + result.getBestParams());
System.out.println("β±οΈ Optimization Time: " + optimizationTime + " ms");
System.out.println("π’ Total Evaluations: " + result.getNumEvaluations());
// Convergence information
// Shows how the best score improves over iterations
// Demonstrates the efficiency of Bayesian optimization
System.out.println("\nπ Convergence Information:");
var convergenceHistory = result.getConvergenceHistory();
System.out.println("- Best score after 10 iterations: " + String.format("%.4f", convergenceHistory.get(9)));
System.out.println("- Best score after 25 iterations: " + String.format("%.4f", convergenceHistory.get(24)));
System.out.println("- Final best score: " + String.format("%.4f", convergenceHistory.get(convergenceHistory.size() - 1)));
// Expected improvement over iterations
// Shows how the acquisition function guides the search
// Higher values indicate more promising parameter regions
System.out.println("\nπ― Expected Improvement History:");
var eiHistory = result.getExpectedImprovementHistory();
for (int i = 0; i < Math.min(5, eiHistory.size()); i++) {
System.out.println("- Iteration " + (i + 1) + ": " + String.format("%.6f", eiHistory.get(i)));
}
System.out.println("\nβ
Bayesian optimization completed!");
} catch (Exception e) {
System.err.println("β Error in Bayesian optimization: " + e.getMessage());
e.printStackTrace();
}
}
}
Key Learning Points:
- Probabilistic Model: Bayesian optimization uses Gaussian processes to model the objective function
- Acquisition Function: Expected Improvement balances exploration and exploitation
- Continuous Parameters: Works well with continuous parameter spaces
- Intelligent Search: Learns from previous evaluations to guide future searches
- Convergence Analysis: Tracks improvement over iterations to understand optimization efficiency
- Parameter Sensitivity: Understands which parameters have the most impact on performance
- Efficiency: Often finds good solutions with fewer evaluations than grid search
Model Ensemble with AutoML
Ensemble Methods
Ensemble methods combine multiple models to create a more robust and accurate predictor. This example demonstrates how AutoML can automatically create and optimize ensemble models.
import org.superml.autotrainer.AutoTrainer;
import org.superml.ensemble.EnsembleBuilder;
import org.superml.datasets.Datasets;
public class AutoMLEnsembleExample {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - AutoML Ensemble Methods ===\n");
try {
// Load dataset
// Wine dataset provides a good example for ensemble methods
// Multiple features and classes benefit from ensemble diversity
var dataset = Datasets.loadWine();
// Configure AutoML with ensemble
// Ensemble configuration controls how multiple models are combined
var config = new AutoTrainer.Config()
.setAlgorithms("logistic", "randomforest", "gradientboosting", "xgboost")
.setSearchStrategy("random") // Random search for efficiency
.setCrossValidationFolds(5) // 5-fold CV for robust evaluation
.setMaxEvaluationTime(300) // 5-minute time limit
.setEnsembleMethods(true) // Enable ensemble creation
.setEnsembleSize(5) // Use top 5 models
.setEnsembleStrategy("voting") // Voting ensemble strategy
.setEnsembleWeights("performance") // Weight models by performance
.setVerbose(true); // Detailed logging
System.out.println("π AutoML Ensemble Configuration:");
System.out.println("- Algorithms: 4 different algorithms");
System.out.println("- Ensemble Size: Top 5 models");
System.out.println("- Ensemble Strategy: Voting");
System.out.println("- Weights: Performance-based");
// Run AutoML with ensemble
// AutoML first finds the best individual models
// Then creates an ensemble from the top performers
System.out.println("\nπ€ Starting AutoML with ensemble...");
long startTime = System.currentTimeMillis();
var result = AutoTrainer.autoMLWithConfig(dataset.X, dataset.y, config);
long autoMLTime = System.currentTimeMillis() - startTime;
// Display results
// Compare individual model performance with ensemble performance
System.out.println("\n=== AutoML Ensemble Results ===");
System.out.println("π Best Single Algorithm: " + result.getBestAlgorithm());
System.out.println("π Best Single Score: " + String.format("%.4f", result.getBestScore()));
// Ensemble information
// Ensembles typically outperform individual models
// They provide better generalization and robustness
if (result.isEnsembleUsed()) {
System.out.println("π Ensemble Used: Yes");
System.out.println("π Ensemble Score: " + String.format("%.4f", result.getEnsembleScore()));
System.out.println("π Ensemble Improvement: " +
String.format("%.4f", result.getEnsembleScore() - result.getBestScore()));
System.out.println("π’ Ensemble Size: " + result.getEnsembleSize());
// Ensemble composition
// Shows which models are in the ensemble and their weights
// Higher weights indicate more reliable models
System.out.println("\nποΈ Ensemble Composition:");
var ensembleModels = result.getEnsembleModels();
var ensembleWeights = result.getEnsembleWeights();
for (int i = 0; i < ensembleModels.size(); i++) {
System.out.println("- Model " + (i + 1) + ": " + ensembleModels.get(i) +
" (Weight: " + String.format("%.3f", ensembleWeights.get(i)) + ")");
}
}
System.out.println("\nβ±οΈ Total AutoML Time: " + autoMLTime + " ms");
// Test ensemble vs single model
// Compare performance on unseen data
// This demonstrates the real-world benefit of ensemble methods
var split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);
var bestSingleModel = result.getBestModel();
var ensembleModel = result.getEnsembleModel();
double[] singlePredictions = bestSingleModel.predict(split.XTest);
double[] ensemblePredictions = ensembleModel.predict(split.XTest);
double singleAccuracy = Metrics.accuracy(split.yTest, singlePredictions);
double ensembleAccuracy = Metrics.accuracy(split.yTest, ensemblePredictions);
System.out.println("\nπ Test Set Comparison:");
System.out.println("- Single Model Accuracy: " + String.format("%.4f", singleAccuracy));
System.out.println("- Ensemble Accuracy: " + String.format("%.4f", ensembleAccuracy));
System.out.println("- Improvement: " + String.format("%.4f", ensembleAccuracy - singleAccuracy));
System.out.println("\nβ
AutoML ensemble completed successfully!");
} catch (Exception e) {
System.err.println("β Error in AutoML ensemble: " + e.getMessage());
e.printStackTrace();
}
}
}
Key Learning Points:
- Ensemble Diversity: Combining different algorithms creates diverse predictors
- Voting Strategy: Models vote on predictions, with majority or weighted voting
- Performance Weighting: Better models get higher weights in the ensemble
- Generalization: Ensembles typically generalize better than single models
- Robustness: Ensembles are less sensitive to individual model failures
- Improvement Measurement: Quantify the benefit of ensemble over single models
- Composition Analysis: Understand which models contribute to the ensemble
Kaggle Integration
One-Line Kaggle Training
Kaggle integration demonstrates AutoMLβs power in competitive machine learning. This example shows how to train on popular Kaggle datasets with minimal code while achieving competitive performance.
import org.superml.kaggle.KaggleTrainingManager;
import org.superml.kaggle.KaggleIntegration.KaggleCredentials;
public class KaggleAutoMLExample {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - Kaggle AutoML Integration ===\n");
try {
// Setup Kaggle credentials
// Kaggle API credentials are required for dataset access
// Download kaggle.json from Kaggle account settings
var credentials = KaggleCredentials.fromDefaultLocation();
var manager = new KaggleTrainingManager(credentials);
System.out.println("π Kaggle AutoML Configuration:");
System.out.println("- Credentials: Loaded from ~/.kaggle/kaggle.json");
System.out.println("- AutoML: Enabled with all algorithms");
System.out.println("- Optimization: Bayesian search");
// Train on Titanic dataset with AutoML
// Titanic is a classic binary classification problem
// Features include passenger class, age, sex, fare, etc.
// Goal: Predict passenger survival
System.out.println("\nπ’ Training on Titanic dataset...");
var titanicResults = manager.trainOnDataset(
"titanic", // competition name
"titanic", // dataset name
"Survived" // target column
);
// Display Titanic results
// AutoML automatically handles feature engineering and model selection
System.out.println("\n=== Titanic Results ===");
var bestTitanic = titanicResults.get(0);
System.out.println("π Best Algorithm: " + bestTitanic.algorithm);
System.out.println("π CV Score: " + String.format("%.4f", bestTitanic.cvScore));
System.out.println("π Validation Score: " + String.format("%.4f", bestTitanic.validationScore));
System.out.println("βοΈ Parameters: " + bestTitanic.parameters);
// Train on House Prices dataset with AutoML
// House Prices is a regression problem with complex features
// Features include square footage, location, age, quality ratings
// Goal: Predict house sale price
System.out.println("\nπ Training on House Prices dataset...");
var houseResults = manager.trainOnDataset(
"house-prices-advanced-regression-techniques",
"house-prices-advanced-regression-techniques",
"SalePrice"
);
// Display House Prices results
// Regression problems use different metrics (RMSE, MAE)
System.out.println("\n=== House Prices Results ===");
var bestHouse = houseResults.get(0);
System.out.println("π Best Algorithm: " + bestHouse.algorithm);
System.out.println("π CV Score: " + String.format("%.4f", bestHouse.cvScore));
System.out.println("π Validation Score: " + String.format("%.4f", bestHouse.validationScore));
System.out.println("βοΈ Parameters: " + bestHouse.parameters);
// Advanced Kaggle AutoML with custom configuration
// Custom configuration provides more control over the AutoML process
System.out.println("\nπ§ Advanced Kaggle AutoML...");
var advancedConfig = new KaggleTrainingManager.Config()
.setAutoMLEnabled(true)
.setAlgorithms("randomforest", "xgboost", "gradientboosting")
.setSearchStrategy("bayesian") // Bayesian optimization
.setMaxEvaluationTime(1800) // 30 minutes maximum
.setCrossValidationFolds(5) // 5-fold cross-validation
.setEnsembleEnabled(true) // Enable ensemble methods
.setFeatureEngineering(true); // Automatic feature engineering
// Train on Digit Recognizer (image classification)
// Digit Recognizer is a computer vision problem
// Features are pixel values of handwritten digits
// Goal: Classify digits 0-9
var advancedResults = manager.trainOnDatasetWithConfig(
"digit-recognizer",
"digit-recognizer",
"label",
advancedConfig
);
// Display advanced results
// Advanced configuration often yields better results
System.out.println("\n=== Advanced Results (Digit Recognizer) ===");
var bestAdvanced = advancedResults.get(0);
System.out.println("π Best Algorithm: " + bestAdvanced.algorithm);
System.out.println("π CV Score: " + String.format("%.4f", bestAdvanced.cvScore));
System.out.println("π Feature Engineering: " + bestAdvanced.featureEngineering);
System.out.println("π Ensemble Used: " + bestAdvanced.ensembleUsed);
System.out.println("\nβ
Kaggle AutoML integration completed successfully!");
} catch (Exception e) {
System.err.println("β Error in Kaggle AutoML: " + e.getMessage());
e.printStackTrace();
}
}
}
Key Learning Points:
- Kaggle Integration: Seamless access to popular machine learning datasets
- Competition Diversity: Different problem types (classification, regression, computer vision)
- Automatic Preprocessing: AutoML handles data cleaning and feature engineering
- Performance Benchmarking: Compare results against Kaggle leaderboards
- Advanced Configuration: Custom settings for competitive performance
- Feature Engineering: Automatic creation of relevant features
- Ensemble Methods: Combine multiple models for better competition scores
Production AutoML Pipeline
Enterprise AutoML System
import org.superml.autotrainer.AutoTrainer;
import org.superml.persistence.ModelPersistence;
import org.superml.drift.DriftDetector;
import org.superml.monitoring.ModelMonitor;
@Service
public class ProductionAutoMLPipeline {
private final ModelMonitor monitor;
private final DriftDetector driftDetector;
public ProductionAutoMLPipeline() {
this.monitor = new ModelMonitor();
this.driftDetector = new DriftDetector();
}
@Scheduled(fixedRate = 86400000) // Daily retraining
public void autoRetraining() {
System.out.println("=== Production AutoML Pipeline ===\n");
try {
// Load latest production data
var dataset = loadLatestProductionData();
System.out.println("π Production Data: " + dataset.X.length + " samples");
System.out.println("π Data Quality Score: " + assessDataQuality(dataset));
// Check for data drift
boolean driftDetected = driftDetector.detectDrift(dataset.X);
if (driftDetected) {
System.out.println("π¨ Data drift detected - triggering full retraining");
// Advanced AutoML configuration for production
var config = new AutoTrainer.Config()
.setAlgorithms("logistic", "randomforest", "gradientboosting", "xgboost")
.setSearchStrategy("bayesian")
.setCrossValidationFolds(5)
.setMaxEvaluationTime(3600) // 1 hour max
.setEarlyStoppingRounds(20)
.setEnsembleMethods(true)
.setEnsembleSize(3)
.setFeatureSelection(true)
.setPreprocessing(true)
.setRobustScaling(true)
.setOutlierDetection(true)
.setVerbose(true);
// Run production AutoML
System.out.println("π€ Starting production AutoML...");
var result = AutoTrainer.autoMLWithConfig(dataset.X, dataset.y, config);
// Validate model performance
double currentScore = getCurrentModelScore();
double newScore = result.getBestScore();
System.out.println("π Current Model Score: " + String.format("%.4f", currentScore));
System.out.println("π New Model Score: " + String.format("%.4f", newScore));
if (newScore > currentScore + 0.01) { // 1% improvement threshold
// Deploy new model
deployNewModel(result);
System.out.println("β
New model deployed successfully!");
} else {
System.out.println("β οΈ New model not significantly better - keeping current model");
}
} else {
System.out.println("β
No data drift detected - model remains current");
}
} catch (Exception e) {
System.err.println("β Error in production AutoML: " + e.getMessage());
notifyOperationsTeam(e);
}
}
private void deployNewModel(AutoTrainer.Result result) {
try {
// Save model with comprehensive metadata
Map<String, Object> metadata = new HashMap<>();
metadata.put("deployment_date", new Date().toString());
metadata.put("algorithm", result.getBestAlgorithm());
metadata.put("cv_score", result.getBestScore());
metadata.put("ensemble_used", result.isEnsembleUsed());
metadata.put("feature_count", result.getFeatureCount());
metadata.put("training_samples", result.getTrainingSamples());
metadata.put("hyperparameters", result.getBestParams());
String modelPath = "models/production_model_" + System.currentTimeMillis() + ".superml";
ModelPersistence.save(result.getBestModel(), modelPath, "Production AutoML Model", metadata);
// Update model registry
updateModelRegistry(modelPath, metadata);
// Update monitoring
monitor.updateModel(result.getBestModel());
} catch (Exception e) {
System.err.println("β Error deploying model: " + e.getMessage());
throw new RuntimeException("Model deployment failed", e);
}
}
private double assessDataQuality(Dataset dataset) {
// Implement data quality assessment
return 0.95; // Placeholder
}
private double getCurrentModelScore() {
// Get current production model score
return 0.87; // Placeholder
}
private Dataset loadLatestProductionData() {
// Load latest production data
return new Dataset(); // Placeholder
}
private void updateModelRegistry(String modelPath, Map<String, Object> metadata) {
// Update model registry with new model
}
private void notifyOperationsTeam(Exception e) {
// Send alert to operations team
}
}
AutoML Performance Monitoring
import org.superml.monitoring.AutoMLMonitor;
import org.superml.metrics.ModelMetrics;
public class AutoMLPerformanceMonitoring {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - AutoML Performance Monitoring ===\n");
try {
// Create AutoML monitor
var monitor = new AutoMLMonitor()
.setMetricsCollection(true)
.setPerformanceThresholds(0.05) // 5% degradation threshold
.setDriftDetection(true)
.setAlertingEnabled(true)
.setLoggingLevel("INFO");
// Monitor multiple AutoML runs
System.out.println("π Monitoring AutoML Performance...");
for (int run = 1; run <= 5; run++) {
System.out.println("\nπ AutoML Run " + run + ":");
// Generate synthetic data for each run
var dataset = Datasets.makeClassification(1000, 10, 3, run * 42);
// Run AutoML with monitoring
var startTime = System.currentTimeMillis();
var result = AutoTrainer.autoML(dataset.X, dataset.y, "classification");
var endTime = System.currentTimeMillis();
// Collect metrics
var metrics = new ModelMetrics()
.setScore(result.getBestScore())
.setAlgorithm(result.getBestAlgorithm())
.setTrainingTime(endTime - startTime)
.setDataSize(dataset.X.length)
.setFeatureCount(dataset.X[0].length);
// Update monitor
monitor.recordRun(run, metrics);
System.out.println("- Algorithm: " + result.getBestAlgorithm());
System.out.println("- Score: " + String.format("%.4f", result.getBestScore()));
System.out.println("- Time: " + (endTime - startTime) + " ms");
// Check for performance degradation
if (run > 1) {
double degradation = monitor.checkPerformanceDegradation(run);
if (degradation > 0.05) {
System.out.println("β οΈ Performance degradation detected: " +
String.format("%.2f%%", degradation * 100));
}
}
}
// Display monitoring summary
System.out.println("\n=== AutoML Monitoring Summary ===");
var summary = monitor.getSummary();
System.out.println("π Average Score: " + String.format("%.4f", summary.getAverageScore()));
System.out.println("π Best Score: " + String.format("%.4f", summary.getBestScore()));
System.out.println("π Worst Score: " + String.format("%.4f", summary.getWorstScore()));
System.out.println("π Score Variance: " + String.format("%.4f", summary.getScoreVariance()));
System.out.println("β±οΈ Average Time: " + summary.getAverageTime() + " ms");
System.out.println("π Most Successful Algorithm: " + summary.getMostSuccessfulAlgorithm());
// Algorithm performance breakdown
System.out.println("\nπ Algorithm Performance Breakdown:");
var algorithmStats = summary.getAlgorithmStats();
algorithmStats.forEach((algorithm, stats) -> {
System.out.println("- " + algorithm + ": " +
stats.getWinRate() + "% win rate, " +
"avg score: " + String.format("%.4f", stats.getAverageScore()));
});
System.out.println("\nβ
AutoML performance monitoring completed!");
} catch (Exception e) {
System.err.println("β Error in AutoML monitoring: " + e.getMessage());
e.printStackTrace();
}
}
}
Best Practices
1. Data Preparation for AutoML
- Quality Check: Ensure data quality before AutoML
- Feature Engineering: Let AutoML handle basic preprocessing
- Data Size: Minimum 100 samples per class for reliable results
- Validation: Use proper train/validation/test splits
2. AutoML Configuration
- Time Limits: Set reasonable time limits for production systems
- Algorithm Selection: Choose algorithms appropriate for your problem
- Cross-Validation: Use sufficient folds for reliable estimates
- Early Stopping: Enable early stopping to prevent overfitting
3. Model Selection
- Ensemble Methods: Use ensemble for improved robustness
- Performance Metrics: Choose appropriate metrics for your problem
- Validation Strategy: Use holdout validation for final evaluation
- Model Complexity: Balance complexity with interpretability
4. Production Deployment
- Model Monitoring: Implement continuous monitoring
- Drift Detection: Monitor for data and model drift
- Automated Retraining: Set up automated retraining pipelines
- Fallback Models: Maintain fallback models for reliability
Troubleshooting
Common AutoML Issues
// Problem: Long training times
// Solution: Set time limits and use early stopping
var config = new AutoTrainer.Config()
.setMaxEvaluationTime(600) // 10 minutes max
.setEarlyStoppingRounds(10) // Stop early if no improvement
.setMaxEvaluations(50); // Limit total evaluations
// Problem: Poor performance
// Solution: Enable feature engineering and preprocessing
var config = new AutoTrainer.Config()
.setFeatureSelection(true) // Automatic feature selection
.setPreprocessing(true) // Automatic preprocessing
.setOutlierDetection(true) // Handle outliers
.setRobustScaling(true); // Robust scaling
// Problem: Overfitting
// Solution: Use proper cross-validation and regularization
var config = new AutoTrainer.Config()
.setCrossValidationFolds(5) // 5-fold CV
.setValidationStrategy("stratified") // Stratified CV
.setRegularization(true); // Enable regularization
Summary
In this tutorial, you learned:
- AutoML Framework: Automated algorithm selection and optimization
- Hyperparameter Tuning: Grid search, random search, and Bayesian optimization
- Model Ensemble: Combining multiple models for better performance
- Kaggle Integration: One-line training on Kaggle datasets
- Production AutoML: Enterprise-ready automated ML pipelines
- Performance Monitoring: Continuous monitoring and drift detection
- Best Practices: Guidelines for effective AutoML deployment
AutoML in SuperML Java 2.1.0 provides intelligent automation while maintaining the flexibility and performance needed for enterprise applications. The framework handles the complexity of algorithm selection and hyperparameter optimization while providing transparent results and professional-grade deployment capabilities.
Next Steps
- Explore XGBoost: Learn advanced gradient boosting techniques
- Neural Networks: Implement deep learning with MLP, CNN, and RNN
- Model Deployment: Production deployment with inference engine
- Advanced Ensemble: Custom ensemble methods and voting strategies
- MLOps Integration: CI/CD pipelines for machine learning
Youβre now ready to leverage AutoML for rapid prototyping, production systems, and competitive machine learning with SuperML Java 2.1.0!