Press ESC to exit fullscreen
πŸ“– Lesson ⏱️ 90 minutes

AutoML Framework

Automated algorithm selection and hyperparameter optimization

AutoML in Java - Automated Machine Learning

AutoML (Automated Machine Learning) in SuperML Java 2.1.0 provides intelligent algorithm selection, hyperparameter optimization, and model ensemble techniques. This tutorial covers how to leverage AutoML for rapid prototyping, production systems, and competitive machine learning with minimal manual intervention.

What You’ll Learn

  • AutoML Framework - Automated algorithm selection and optimization
  • Hyperparameter Tuning - Grid search, random search, and Bayesian optimization
  • Model Ensemble - Combining multiple models for better performance
  • Kaggle Integration - One-line training on any Kaggle dataset
  • Production AutoML - Enterprise-ready automated ML pipelines
  • Performance Monitoring - Automated model evaluation and selection
  • Custom Configurations - Advanced AutoML parameter tuning

Prerequisites

  • Completion of β€œIntroduction to SuperML Java” tutorial
  • Basic understanding of machine learning concepts
  • Java development environment with SuperML Java 2.1.0
  • Familiarity with model evaluation metrics

AutoML Overview

SuperML Java 2.1.0’s AutoML framework automatically:

  • Selects optimal algorithms from 12+ implementations
  • Optimizes hyperparameters using advanced search strategies
  • Creates model ensembles for improved performance
  • Handles data preprocessing and feature engineering
  • Provides comprehensive model evaluation and comparison

Basic AutoML Usage

One-Line AutoML

The most powerful feature of SuperML AutoML is its ability to automatically find the best machine learning algorithm and hyperparameters with a single line of code. This example demonstrates the complete AutoML workflow from data loading to model deployment.

import org.superml.datasets.Datasets;
import org.superml.autotrainer.AutoTrainer;

public class BasicAutoMLExample {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - One-Line AutoML ===\n");
        
        try {
            // Load dataset
            // The Iris dataset is perfect for demonstrating AutoML with 150 samples,
            // 4 features (sepal/petal length/width), and 3 classes (setosa, versicolor, virginica)
            var dataset = Datasets.loadIris();
            
            System.out.println("πŸ“Š Dataset: " + dataset.X.length + " samples, " + dataset.X[0].length + " features");
            
            // One-line AutoML!
            // This single line automatically:
            // 1. Tests multiple algorithms (Logistic Regression, Random Forest, SVM, etc.)
            // 2. Performs hyperparameter optimization for each algorithm
            // 3. Uses cross-validation to evaluate performance
            // 4. Selects the best performing model
            // 5. Returns a trained model ready for production
            System.out.println("πŸ€– Starting AutoML optimization...");
            long startTime = System.currentTimeMillis();
            
            var result = AutoTrainer.autoML(dataset.X, dataset.y, "classification");
            
            long autoMLTime = System.currentTimeMillis() - startTime;
            
            // Display results
            // The AutoML result contains comprehensive information about the optimization process
            System.out.println("\n=== AutoML Results ===");
            System.out.println("πŸ† Best Algorithm: " + result.getBestAlgorithm());
            System.out.println("πŸ“Š Best Score: " + String.format("%.4f", result.getBestScore()));
            System.out.println("βš™οΈ Best Parameters: " + result.getBestParams());
            System.out.println("⏱️ AutoML Time: " + autoMLTime + " ms");
            
            // Get all tested algorithms
            // AutoML tests multiple algorithms and provides a complete performance comparison
            // This transparency allows you to understand which algorithms work best for your data
            System.out.println("\nπŸ“ˆ All Algorithm Results:");
            var allResults = result.getAllResults();
            allResults.forEach((algorithm, score) -> {
                System.out.println("- " + algorithm + ": " + String.format("%.4f", score));
            });
            
            // Use best model for predictions
            // The best model is immediately ready for production use
            // No additional training or configuration required
            var bestModel = result.getBestModel();
            double[] predictions = bestModel.predict(dataset.X);
            
            System.out.println("\nβœ… AutoML completed successfully!");
            System.out.println("🎯 Ready for production with optimized model: " + result.getBestAlgorithm());
            
        } catch (Exception e) {
            System.err.println("❌ Error in AutoML: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Key Learning Points:

  • Automatic Algorithm Selection: AutoML tests multiple algorithms and selects the best performer
  • Hyperparameter Optimization: Each algorithm is automatically tuned for optimal performance
  • Cross-Validation: Built-in cross-validation ensures reliable performance estimates
  • Production Ready: The result is a trained model ready for immediate deployment
  • Transparency: Full visibility into the optimization process and algorithm comparison

AutoML with Train/Test Split

This example demonstrates proper machine learning evaluation using separate training and testing datasets. This approach prevents overfitting and provides realistic performance estimates for production deployment.

import org.superml.datasets.Datasets;
import org.superml.autotrainer.AutoTrainer;
import org.superml.model_selection.ModelSelection;
import org.superml.metrics.Metrics;

public class AutoMLWithSplitExample {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - AutoML with Train/Test Split ===\n");
        
        try {
            // Load and split dataset
            // Wine dataset contains 178 samples with 13 chemical features
            // Perfect for demonstrating multi-class classification with proper evaluation
            var dataset = Datasets.loadWine();
            
            // Split data into 80% training and 20% testing
            // The random seed (42) ensures reproducible results
            // Stratified split maintains class distribution in both sets
            var split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);
            
            System.out.println("πŸ“Š Training samples: " + split.XTrain.length);
            System.out.println("πŸ“Š Test samples: " + split.XTest.length);
            
            // AutoML on training data
            // Training only on training data prevents data leakage
            // The model has never seen the test data, ensuring honest evaluation
            System.out.println("\nπŸ€– Running AutoML on training data...");
            var result = AutoTrainer.autoML(split.XTrain, split.yTrain, "classification");
            
            // Get best model
            // The best model is selected based on cross-validation performance
            // This model represents the optimal algorithm and hyperparameters
            var bestModel = result.getBestModel();
            
            // Evaluate on test set
            // Test set evaluation provides unbiased performance estimate
            // This simulates real-world deployment performance
            double[] predictions = bestModel.predict(split.XTest);
            
            // Calculate metrics
            // Comprehensive evaluation using multiple metrics
            // Each metric provides different insights into model performance
            double accuracy = Metrics.accuracy(split.yTest, predictions);      // Overall correctness
            double precision = Metrics.precision(split.yTest, predictions);    // Positive prediction accuracy
            double recall = Metrics.recall(split.yTest, predictions);          // True positive detection rate
            double f1 = Metrics.f1Score(split.yTest, predictions);            // Harmonic mean of precision/recall
            
            System.out.println("\n=== AutoML Results ===");
            System.out.println("πŸ† Best Algorithm: " + result.getBestAlgorithm());
            System.out.println("πŸ“Š CV Score: " + String.format("%.4f", result.getBestScore()));
            System.out.println("πŸ“Š Test Accuracy: " + String.format("%.4f", accuracy));
            System.out.println("πŸ“Š Test Precision: " + String.format("%.4f", precision));
            System.out.println("πŸ“Š Test Recall: " + String.format("%.4f", recall));
            System.out.println("πŸ“Š Test F1 Score: " + String.format("%.4f", f1));
            
            // Display confusion matrix
            // Confusion matrix shows detailed classification performance
            // Diagonal elements represent correct predictions
            // Off-diagonal elements show misclassification patterns
            int[][] confMatrix = Metrics.confusionMatrix(split.yTest, predictions);
            System.out.println("\nπŸ“Š Confusion Matrix:");
            for (int i = 0; i < confMatrix.length; i++) {
                System.out.println(java.util.Arrays.toString(confMatrix[i]));
            }
            
            System.out.println("\nβœ… AutoML evaluation completed!");
            
        } catch (Exception e) {
            System.err.println("❌ Error in AutoML: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Key Learning Points:

  • Data Splitting: Proper train/test split prevents overfitting and provides honest evaluation
  • Stratified Sampling: Maintains class distribution across training and test sets
  • Cross-Validation: AutoML uses CV on training data to select the best model
  • Multiple Metrics: Comprehensive evaluation using accuracy, precision, recall, and F1-score
  • Confusion Matrix: Detailed view of classification performance across all classes
  • Production Readiness: Test set performance estimates real-world deployment accuracy

Advanced AutoML Configuration

Custom AutoML Parameters

This advanced example demonstrates how to customize AutoML behavior for specific requirements. Custom configuration allows fine-tuning of the optimization process for better performance and control.

import org.superml.autotrainer.AutoTrainer;
import org.superml.datasets.Datasets;

public class AdvancedAutoMLExample {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - Advanced AutoML Configuration ===\n");
        
        try {
            // Load dataset
            // Generate a complex synthetic dataset for demonstration
            // 2000 samples, 20 features, 5 informative features, 5 classes
            var dataset = Datasets.makeClassification(2000, 20, 5, 42);
            
            System.out.println("πŸ“Š Generated dataset: " + dataset.X.length + " samples, " + dataset.X[0].length + " features, " + 
                java.util.Arrays.stream(dataset.y).max().orElse(0) + " classes");
            
            // Advanced AutoML configuration
            // Each setting controls a specific aspect of the optimization process
            var config = new AutoTrainer.Config()
                .setAlgorithms("logistic", "randomforest", "gradientboosting", "xgboost")  // Select specific algorithms
                .setSearchStrategy("bayesian")        // Bayesian optimization for intelligent search
                .setCrossValidationFolds(5)           // 5-fold cross-validation for robust evaluation
                .setMaxEvaluationTime(600)            // 10 minutes maximum training time
                .setEarlyStoppingRounds(10)           // Stop early if no improvement for 10 rounds
                .setEnsembleMethods(true)             // Enable ensemble methods for better performance
                .setFeatureSelection(true)            // Automatic feature selection to reduce overfitting
                .setPreprocessing(true)               // Automatic data preprocessing and scaling
                .setVerbose(true);                    // Detailed logging for transparency
            
            System.out.println("πŸš€ Advanced AutoML Configuration:");
            System.out.println("- Algorithms: Logistic, RandomForest, GradientBoosting, XGBoost");
            System.out.println("- Search Strategy: Bayesian Optimization");
            System.out.println("- Cross-Validation: 5-fold");
            System.out.println("- Max Time: 10 minutes");
            System.out.println("- Ensemble: Enabled");
            System.out.println("- Feature Selection: Automatic");
            
            // Run advanced AutoML
            // The custom configuration provides more control over the optimization process
            // Bayesian optimization intelligently explores the hyperparameter space
            System.out.println("\nπŸ€– Starting advanced AutoML...");
            long startTime = System.currentTimeMillis();
            
            var result = AutoTrainer.autoMLWithConfig(dataset.X, dataset.y, config);
            
            long autoMLTime = System.currentTimeMillis() - startTime;
            
            // Display detailed results
            // Advanced AutoML provides comprehensive insights into the optimization process
            System.out.println("\n=== Advanced AutoML Results ===");
            System.out.println("πŸ† Best Algorithm: " + result.getBestAlgorithm());
            System.out.println("πŸ“Š Best CV Score: " + String.format("%.4f", result.getBestScore()));
            System.out.println("πŸ“Š Best Parameters: " + result.getBestParams());
            System.out.println("⏱️ Total Time: " + autoMLTime + " ms");
            
            // Feature importance (if available)
            // Shows which features contribute most to the model's predictions
            // Higher values indicate more important features
            if (result.getFeatureImportance() != null) {
                System.out.println("\nπŸ” Feature Importance (Top 5):");
                var importance = result.getFeatureImportance();
                for (int i = 0; i < Math.min(5, importance.length); i++) {
                    System.out.println("- Feature " + i + ": " + String.format("%.4f", importance[i]));
                }
            }
            
            // Ensemble information
            // Ensemble methods combine multiple models for improved performance
            // Typically more robust and accurate than single models
            if (result.isEnsembleUsed()) {
                System.out.println("\nπŸ”— Ensemble Information:");
                System.out.println("- Ensemble Type: " + result.getEnsembleType());
                System.out.println("- Number of Models: " + result.getEnsembleSize());
                System.out.println("- Ensemble Score: " + String.format("%.4f", result.getEnsembleScore()));
            }
            
            // Model comparison
            // Shows performance of all tested algorithms
            // Helps understand which algorithms work best for this data
            System.out.println("\nπŸ“Š Model Comparison:");
            var allResults = result.getAllResults();
            allResults.entrySet().stream()
                .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
                .forEach(entry -> {
                    System.out.println("- " + entry.getKey() + ": " + String.format("%.4f", entry.getValue()));
                });
            
            System.out.println("\nβœ… Advanced AutoML completed successfully!");
            
        } catch (Exception e) {
            System.err.println("❌ Error in advanced AutoML: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Key Learning Points:

  • Algorithm Selection: Choose specific algorithms based on problem requirements and computational constraints
  • Bayesian Optimization: Intelligent hyperparameter search that learns from previous evaluations
  • Feature Selection: Automatic removal of irrelevant features to improve performance and reduce overfitting
  • Ensemble Methods: Combine multiple models for improved accuracy and robustness
  • Time Management: Set time limits to balance performance with computational resources
  • Feature Importance: Understand which features drive model predictions for better insights
  • Model Comparison: Comprehensive analysis of all tested algorithms for informed decision making

Hyperparameter Optimization

Grid Search with AutoML

Grid search systematically evaluates all combinations of hyperparameters to find the optimal configuration. This example demonstrates comprehensive hyperparameter optimization with detailed parameter space definition.

import org.superml.autotrainer.AutoTrainer;
import org.superml.autotrainer.HyperparameterOptimizer;
import org.superml.datasets.Datasets;

public class HyperparameterOptimizationExample {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - Hyperparameter Optimization ===\n");
        
        try {
            // Load dataset
            // Iris dataset provides a perfect baseline for hyperparameter optimization
            var dataset = Datasets.loadIris();
            
            // Define hyperparameter search spaces
            // Each parameter space defines the range of values to test
            // Careful selection of parameter ranges is crucial for effective optimization
            var parameterSpaces = new HashMap<String, Object>();
            
            // Logistic Regression parameters
            // maxIter: Maximum iterations for convergence
            // regularization: Type of regularization (L1, L2, or none)
            // C: Regularization strength (smaller values = stronger regularization)
            parameterSpaces.put("logistic_maxIter", new int[]{100, 200, 500, 1000});
            parameterSpaces.put("logistic_regularization", new String[]{"l1", "l2", "none"});
            parameterSpaces.put("logistic_C", new double[]{0.1, 1.0, 10.0, 100.0});
            
            // Random Forest parameters
            // nEstimators: Number of trees in the forest
            // maxDepth: Maximum depth of each tree (-1 for unlimited)
            // minSamplesSplit: Minimum samples required to split a node
            // minSamplesLeaf: Minimum samples required at each leaf node
            parameterSpaces.put("randomforest_nEstimators", new int[]{50, 100, 200});
            parameterSpaces.put("randomforest_maxDepth", new int[]{5, 10, 20, -1});
            parameterSpaces.put("randomforest_minSamplesSplit", new int[]{2, 5, 10});
            parameterSpaces.put("randomforest_minSamplesLeaf", new int[]{1, 2, 4});
            
            // Gradient Boosting parameters
            // nEstimators: Number of boosting stages
            // learningRate: Learning rate shrinks contribution of each tree
            // maxDepth: Maximum depth of individual trees
            parameterSpaces.put("gradientboosting_nEstimators", new int[]{100, 200, 300});
            parameterSpaces.put("gradientboosting_learningRate", new double[]{0.01, 0.1, 0.2});
            parameterSpaces.put("gradientboosting_maxDepth", new int[]{3, 5, 7});
            
            System.out.println("πŸ”§ Hyperparameter Search Configuration:");
            System.out.println("- Logistic Regression: 4 x 3 x 4 = 48 combinations");
            System.out.println("- Random Forest: 3 x 4 x 3 x 3 = 108 combinations");
            System.out.println("- Gradient Boosting: 3 x 3 x 3 = 27 combinations");
            System.out.println("- Total combinations: 183");
            
            // Create hyperparameter optimizer
            // Grid search evaluates all parameter combinations systematically
            // Cross-validation ensures robust performance estimation
            var optimizer = new HyperparameterOptimizer()
                .setParameterSpaces(parameterSpaces)
                .setSearchStrategy("grid")           // Grid search strategy
                .setCrossValidationFolds(5)          // 5-fold cross-validation
                .setMaxEvaluations(50)               // Limit total evaluations
                .setParallelEvaluation(true)         // Enable parallel processing
                .setVerbose(true);                   // Detailed progress logging
            
            // Run hyperparameter optimization
            // The optimizer systematically evaluates parameter combinations
            // Each combination is evaluated using cross-validation
            System.out.println("\nπŸ” Starting hyperparameter optimization...");
            long startTime = System.currentTimeMillis();
            
            var result = optimizer.optimize(dataset.X, dataset.y, "classification");
            
            long optimizationTime = System.currentTimeMillis() - startTime;
            
            // Display results
            // Results provide comprehensive insights into the optimization process
            System.out.println("\n=== Hyperparameter Optimization Results ===");
            System.out.println("πŸ† Best Algorithm: " + result.getBestAlgorithm());
            System.out.println("πŸ“Š Best CV Score: " + String.format("%.4f", result.getBestScore()));
            System.out.println("πŸ“Š Best Parameters: " + result.getBestParams());
            System.out.println("⏱️ Optimization Time: " + optimizationTime + " ms");
            System.out.println("πŸ”’ Evaluations: " + result.getNumEvaluations());
            
            // Top 5 configurations
            // Shows the best performing parameter combinations
            // Helps understand which parameter settings work best
            System.out.println("\nπŸ… Top 5 Configurations:");
            var topConfigs = result.getTopConfigurations(5);
            for (int i = 0; i < topConfigs.size(); i++) {
                var config = topConfigs.get(i);
                System.out.println((i + 1) + ". " + config.getAlgorithm() + 
                    " - Score: " + String.format("%.4f", config.getScore()) +
                    " - Params: " + config.getParameters());
            }
            
            System.out.println("\nβœ… Hyperparameter optimization completed!");
            
        } catch (Exception e) {
            System.err.println("❌ Error in hyperparameter optimization: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Key Learning Points:

  • Parameter Space Definition: Carefully define the range of hyperparameters to explore
  • Grid Search Strategy: Systematic evaluation of all parameter combinations
  • Cross-Validation: Robust performance estimation using multiple data splits
  • Parallel Processing: Speed up optimization using parallel evaluation
  • Evaluation Limits: Control computational cost by limiting total evaluations
  • Performance Analysis: Compare top configurations to understand parameter sensitivity
  • Algorithm Comparison: Understand which algorithms and parameters work best for your data

Bayesian Optimization

Bayesian optimization uses a probabilistic model to intelligently search the hyperparameter space. This example demonstrates how Bayesian methods can find optimal parameters more efficiently than grid search.

import org.superml.autotrainer.BayesianOptimizer;
import org.superml.datasets.Datasets;

public class BayesianOptimizationExample {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - Bayesian Optimization ===\n");
        
        try {
            // Load dataset
            // Generate a complex dataset to demonstrate Bayesian optimization effectiveness
            var dataset = Datasets.makeClassification(1500, 15, 3, 42);
            
            System.out.println("πŸ“Š Dataset: " + dataset.X.length + " samples, " + dataset.X[0].length + " features");
            
            // Configure Bayesian optimization
            // Bayesian optimization uses probabilistic models to guide the search
            // It balances exploration (trying new areas) with exploitation (refining good areas)
            var optimizer = new BayesianOptimizer()
                .setAcquisitionFunction("expected_improvement")  // EI guides search strategy
                .setInitialRandomSamples(10)                    // Random initialization
                .setMaxIterations(50)                           // Maximum optimization iterations
                .setKappa(2.576)                                // Exploration parameter (higher = more exploration)
                .setXi(0.01)                                    // Exploitation parameter (trade-off)
                .setVerbose(true);                              // Progress logging
            
            // Define continuous parameter spaces
            // Bayesian optimization works well with continuous parameters
            // The optimizer learns the relationship between parameters and performance
            var parameterSpaces = new HashMap<String, Object>();
            
            // Random Forest with continuous parameters
            // Each parameter range defines the search space
            parameterSpaces.put("randomforest_maxDepth", new double[]{3.0, 20.0});
            parameterSpaces.put("randomforest_minSamplesSplit", new double[]{2.0, 20.0});
            parameterSpaces.put("randomforest_minSamplesLeaf", new double[]{1.0, 10.0});
            
            // Gradient Boosting with continuous parameters
            // Learning rate and subsample ratio are crucial for gradient boosting
            parameterSpaces.put("gradientboosting_learningRate", new double[]{0.01, 0.3});
            parameterSpaces.put("gradientboosting_maxDepth", new double[]{3.0, 10.0});
            parameterSpaces.put("gradientboosting_subsample", new double[]{0.5, 1.0});
            
            System.out.println("🧠 Bayesian Optimization Configuration:");
            System.out.println("- Acquisition Function: Expected Improvement");
            System.out.println("- Initial Random Samples: 10");
            System.out.println("- Max Iterations: 50");
            System.out.println("- Exploration Parameter (ΞΊ): 2.576");
            
            // Run Bayesian optimization
            // The optimizer intelligently explores the parameter space
            // Each iteration uses information from previous evaluations
            System.out.println("\nπŸ” Starting Bayesian optimization...");
            long startTime = System.currentTimeMillis();
            
            var result = optimizer.optimize(dataset.X, dataset.y, parameterSpaces);
            
            long optimizationTime = System.currentTimeMillis() - startTime;
            
            // Display results
            // Bayesian optimization provides insights into the optimization process
            System.out.println("\n=== Bayesian Optimization Results ===");
            System.out.println("πŸ† Best Algorithm: " + result.getBestAlgorithm());
            System.out.println("πŸ“Š Best Score: " + String.format("%.4f", result.getBestScore()));
            System.out.println("πŸ“Š Best Parameters: " + result.getBestParams());
            System.out.println("⏱️ Optimization Time: " + optimizationTime + " ms");
            System.out.println("πŸ”’ Total Evaluations: " + result.getNumEvaluations());
            
            // Convergence information
            // Shows how the best score improves over iterations
            // Demonstrates the efficiency of Bayesian optimization
            System.out.println("\nπŸ“ˆ Convergence Information:");
            var convergenceHistory = result.getConvergenceHistory();
            System.out.println("- Best score after 10 iterations: " + String.format("%.4f", convergenceHistory.get(9)));
            System.out.println("- Best score after 25 iterations: " + String.format("%.4f", convergenceHistory.get(24)));
            System.out.println("- Final best score: " + String.format("%.4f", convergenceHistory.get(convergenceHistory.size() - 1)));
            
            // Expected improvement over iterations
            // Shows how the acquisition function guides the search
            // Higher values indicate more promising parameter regions
            System.out.println("\n🎯 Expected Improvement History:");
            var eiHistory = result.getExpectedImprovementHistory();
            for (int i = 0; i < Math.min(5, eiHistory.size()); i++) {
                System.out.println("- Iteration " + (i + 1) + ": " + String.format("%.6f", eiHistory.get(i)));
            }
            
            System.out.println("\nβœ… Bayesian optimization completed!");
            
        } catch (Exception e) {
            System.err.println("❌ Error in Bayesian optimization: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Key Learning Points:

  • Probabilistic Model: Bayesian optimization uses Gaussian processes to model the objective function
  • Acquisition Function: Expected Improvement balances exploration and exploitation
  • Continuous Parameters: Works well with continuous parameter spaces
  • Intelligent Search: Learns from previous evaluations to guide future searches
  • Convergence Analysis: Tracks improvement over iterations to understand optimization efficiency
  • Parameter Sensitivity: Understands which parameters have the most impact on performance
  • Efficiency: Often finds good solutions with fewer evaluations than grid search

Model Ensemble with AutoML

Ensemble Methods

Ensemble methods combine multiple models to create a more robust and accurate predictor. This example demonstrates how AutoML can automatically create and optimize ensemble models.

import org.superml.autotrainer.AutoTrainer;
import org.superml.ensemble.EnsembleBuilder;
import org.superml.datasets.Datasets;

public class AutoMLEnsembleExample {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - AutoML Ensemble Methods ===\n");
        
        try {
            // Load dataset
            // Wine dataset provides a good example for ensemble methods
            // Multiple features and classes benefit from ensemble diversity
            var dataset = Datasets.loadWine();
            
            // Configure AutoML with ensemble
            // Ensemble configuration controls how multiple models are combined
            var config = new AutoTrainer.Config()
                .setAlgorithms("logistic", "randomforest", "gradientboosting", "xgboost")
                .setSearchStrategy("random")         // Random search for efficiency
                .setCrossValidationFolds(5)          // 5-fold CV for robust evaluation
                .setMaxEvaluationTime(300)           // 5-minute time limit
                .setEnsembleMethods(true)            // Enable ensemble creation
                .setEnsembleSize(5)                  // Use top 5 models
                .setEnsembleStrategy("voting")       // Voting ensemble strategy
                .setEnsembleWeights("performance")   // Weight models by performance
                .setVerbose(true);                   // Detailed logging
            
            System.out.println("πŸ”— AutoML Ensemble Configuration:");
            System.out.println("- Algorithms: 4 different algorithms");
            System.out.println("- Ensemble Size: Top 5 models");
            System.out.println("- Ensemble Strategy: Voting");
            System.out.println("- Weights: Performance-based");
            
            // Run AutoML with ensemble
            // AutoML first finds the best individual models
            // Then creates an ensemble from the top performers
            System.out.println("\nπŸ€– Starting AutoML with ensemble...");
            long startTime = System.currentTimeMillis();
            
            var result = AutoTrainer.autoMLWithConfig(dataset.X, dataset.y, config);
            
            long autoMLTime = System.currentTimeMillis() - startTime;
            
            // Display results
            // Compare individual model performance with ensemble performance
            System.out.println("\n=== AutoML Ensemble Results ===");
            System.out.println("πŸ† Best Single Algorithm: " + result.getBestAlgorithm());
            System.out.println("πŸ“Š Best Single Score: " + String.format("%.4f", result.getBestScore()));
            
            // Ensemble information
            // Ensembles typically outperform individual models
            // They provide better generalization and robustness
            if (result.isEnsembleUsed()) {
                System.out.println("πŸ”— Ensemble Used: Yes");
                System.out.println("πŸ“Š Ensemble Score: " + String.format("%.4f", result.getEnsembleScore()));
                System.out.println("πŸ“Š Ensemble Improvement: " + 
                    String.format("%.4f", result.getEnsembleScore() - result.getBestScore()));
                System.out.println("πŸ”’ Ensemble Size: " + result.getEnsembleSize());
                
                // Ensemble composition
                // Shows which models are in the ensemble and their weights
                // Higher weights indicate more reliable models
                System.out.println("\nπŸ—οΈ Ensemble Composition:");
                var ensembleModels = result.getEnsembleModels();
                var ensembleWeights = result.getEnsembleWeights();
                
                for (int i = 0; i < ensembleModels.size(); i++) {
                    System.out.println("- Model " + (i + 1) + ": " + ensembleModels.get(i) + 
                        " (Weight: " + String.format("%.3f", ensembleWeights.get(i)) + ")");
                }
            }
            
            System.out.println("\n⏱️ Total AutoML Time: " + autoMLTime + " ms");
            
            // Test ensemble vs single model
            // Compare performance on unseen data
            // This demonstrates the real-world benefit of ensemble methods
            var split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);
            
            var bestSingleModel = result.getBestModel();
            var ensembleModel = result.getEnsembleModel();
            
            double[] singlePredictions = bestSingleModel.predict(split.XTest);
            double[] ensemblePredictions = ensembleModel.predict(split.XTest);
            
            double singleAccuracy = Metrics.accuracy(split.yTest, singlePredictions);
            double ensembleAccuracy = Metrics.accuracy(split.yTest, ensemblePredictions);
            
            System.out.println("\nπŸ“Š Test Set Comparison:");
            System.out.println("- Single Model Accuracy: " + String.format("%.4f", singleAccuracy));
            System.out.println("- Ensemble Accuracy: " + String.format("%.4f", ensembleAccuracy));
            System.out.println("- Improvement: " + String.format("%.4f", ensembleAccuracy - singleAccuracy));
            
            System.out.println("\nβœ… AutoML ensemble completed successfully!");
            
        } catch (Exception e) {
            System.err.println("❌ Error in AutoML ensemble: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Key Learning Points:

  • Ensemble Diversity: Combining different algorithms creates diverse predictors
  • Voting Strategy: Models vote on predictions, with majority or weighted voting
  • Performance Weighting: Better models get higher weights in the ensemble
  • Generalization: Ensembles typically generalize better than single models
  • Robustness: Ensembles are less sensitive to individual model failures
  • Improvement Measurement: Quantify the benefit of ensemble over single models
  • Composition Analysis: Understand which models contribute to the ensemble

Kaggle Integration

One-Line Kaggle Training

Kaggle integration demonstrates AutoML’s power in competitive machine learning. This example shows how to train on popular Kaggle datasets with minimal code while achieving competitive performance.

import org.superml.kaggle.KaggleTrainingManager;
import org.superml.kaggle.KaggleIntegration.KaggleCredentials;

public class KaggleAutoMLExample {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - Kaggle AutoML Integration ===\n");
        
        try {
            // Setup Kaggle credentials
            // Kaggle API credentials are required for dataset access
            // Download kaggle.json from Kaggle account settings
            var credentials = KaggleCredentials.fromDefaultLocation();
            var manager = new KaggleTrainingManager(credentials);
            
            System.out.println("πŸ† Kaggle AutoML Configuration:");
            System.out.println("- Credentials: Loaded from ~/.kaggle/kaggle.json");
            System.out.println("- AutoML: Enabled with all algorithms");
            System.out.println("- Optimization: Bayesian search");
            
            // Train on Titanic dataset with AutoML
            // Titanic is a classic binary classification problem
            // Features include passenger class, age, sex, fare, etc.
            // Goal: Predict passenger survival
            System.out.println("\n🚒 Training on Titanic dataset...");
            var titanicResults = manager.trainOnDataset(
                "titanic",              // competition name
                "titanic",              // dataset name
                "Survived"              // target column
            );
            
            // Display Titanic results
            // AutoML automatically handles feature engineering and model selection
            System.out.println("\n=== Titanic Results ===");
            var bestTitanic = titanicResults.get(0);
            System.out.println("πŸ† Best Algorithm: " + bestTitanic.algorithm);
            System.out.println("πŸ“Š CV Score: " + String.format("%.4f", bestTitanic.cvScore));
            System.out.println("πŸ“Š Validation Score: " + String.format("%.4f", bestTitanic.validationScore));
            System.out.println("βš™οΈ Parameters: " + bestTitanic.parameters);
            
            // Train on House Prices dataset with AutoML
            // House Prices is a regression problem with complex features
            // Features include square footage, location, age, quality ratings
            // Goal: Predict house sale price
            System.out.println("\n🏠 Training on House Prices dataset...");
            var houseResults = manager.trainOnDataset(
                "house-prices-advanced-regression-techniques",
                "house-prices-advanced-regression-techniques",
                "SalePrice"
            );
            
            // Display House Prices results
            // Regression problems use different metrics (RMSE, MAE)
            System.out.println("\n=== House Prices Results ===");
            var bestHouse = houseResults.get(0);
            System.out.println("πŸ† Best Algorithm: " + bestHouse.algorithm);
            System.out.println("πŸ“Š CV Score: " + String.format("%.4f", bestHouse.cvScore));
            System.out.println("πŸ“Š Validation Score: " + String.format("%.4f", bestHouse.validationScore));
            System.out.println("βš™οΈ Parameters: " + bestHouse.parameters);
            
            // Advanced Kaggle AutoML with custom configuration
            // Custom configuration provides more control over the AutoML process
            System.out.println("\nπŸ”§ Advanced Kaggle AutoML...");
            var advancedConfig = new KaggleTrainingManager.Config()
                .setAutoMLEnabled(true)
                .setAlgorithms("randomforest", "xgboost", "gradientboosting")
                .setSearchStrategy("bayesian")       // Bayesian optimization
                .setMaxEvaluationTime(1800)          // 30 minutes maximum
                .setCrossValidationFolds(5)          // 5-fold cross-validation
                .setEnsembleEnabled(true)            // Enable ensemble methods
                .setFeatureEngineering(true);        // Automatic feature engineering
            
            // Train on Digit Recognizer (image classification)
            // Digit Recognizer is a computer vision problem
            // Features are pixel values of handwritten digits
            // Goal: Classify digits 0-9
            var advancedResults = manager.trainOnDatasetWithConfig(
                "digit-recognizer",
                "digit-recognizer",
                "label",
                advancedConfig
            );
            
            // Display advanced results
            // Advanced configuration often yields better results
            System.out.println("\n=== Advanced Results (Digit Recognizer) ===");
            var bestAdvanced = advancedResults.get(0);
            System.out.println("πŸ† Best Algorithm: " + bestAdvanced.algorithm);
            System.out.println("πŸ“Š CV Score: " + String.format("%.4f", bestAdvanced.cvScore));
            System.out.println("πŸ“Š Feature Engineering: " + bestAdvanced.featureEngineering);
            System.out.println("πŸ”— Ensemble Used: " + bestAdvanced.ensembleUsed);
            
            System.out.println("\nβœ… Kaggle AutoML integration completed successfully!");
            
        } catch (Exception e) {
            System.err.println("❌ Error in Kaggle AutoML: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Key Learning Points:

  • Kaggle Integration: Seamless access to popular machine learning datasets
  • Competition Diversity: Different problem types (classification, regression, computer vision)
  • Automatic Preprocessing: AutoML handles data cleaning and feature engineering
  • Performance Benchmarking: Compare results against Kaggle leaderboards
  • Advanced Configuration: Custom settings for competitive performance
  • Feature Engineering: Automatic creation of relevant features
  • Ensemble Methods: Combine multiple models for better competition scores

Production AutoML Pipeline

Enterprise AutoML System

import org.superml.autotrainer.AutoTrainer;
import org.superml.persistence.ModelPersistence;
import org.superml.drift.DriftDetector;
import org.superml.monitoring.ModelMonitor;

@Service
public class ProductionAutoMLPipeline {
    private final ModelMonitor monitor;
    private final DriftDetector driftDetector;
    
    public ProductionAutoMLPipeline() {
        this.monitor = new ModelMonitor();
        this.driftDetector = new DriftDetector();
    }
    
    @Scheduled(fixedRate = 86400000) // Daily retraining
    public void autoRetraining() {
        System.out.println("=== Production AutoML Pipeline ===\n");
        
        try {
            // Load latest production data
            var dataset = loadLatestProductionData();
            
            System.out.println("πŸ“Š Production Data: " + dataset.X.length + " samples");
            System.out.println("πŸ“Š Data Quality Score: " + assessDataQuality(dataset));
            
            // Check for data drift
            boolean driftDetected = driftDetector.detectDrift(dataset.X);
            
            if (driftDetected) {
                System.out.println("🚨 Data drift detected - triggering full retraining");
                
                // Advanced AutoML configuration for production
                var config = new AutoTrainer.Config()
                    .setAlgorithms("logistic", "randomforest", "gradientboosting", "xgboost")
                    .setSearchStrategy("bayesian")
                    .setCrossValidationFolds(5)
                    .setMaxEvaluationTime(3600)        // 1 hour max
                    .setEarlyStoppingRounds(20)
                    .setEnsembleMethods(true)
                    .setEnsembleSize(3)
                    .setFeatureSelection(true)
                    .setPreprocessing(true)
                    .setRobustScaling(true)
                    .setOutlierDetection(true)
                    .setVerbose(true);
                
                // Run production AutoML
                System.out.println("πŸ€– Starting production AutoML...");
                var result = AutoTrainer.autoMLWithConfig(dataset.X, dataset.y, config);
                
                // Validate model performance
                double currentScore = getCurrentModelScore();
                double newScore = result.getBestScore();
                
                System.out.println("πŸ“Š Current Model Score: " + String.format("%.4f", currentScore));
                System.out.println("πŸ“Š New Model Score: " + String.format("%.4f", newScore));
                
                if (newScore > currentScore + 0.01) { // 1% improvement threshold
                    // Deploy new model
                    deployNewModel(result);
                    System.out.println("βœ… New model deployed successfully!");
                } else {
                    System.out.println("⚠️ New model not significantly better - keeping current model");
                }
            } else {
                System.out.println("βœ… No data drift detected - model remains current");
            }
            
        } catch (Exception e) {
            System.err.println("❌ Error in production AutoML: " + e.getMessage());
            notifyOperationsTeam(e);
        }
    }
    
    private void deployNewModel(AutoTrainer.Result result) {
        try {
            // Save model with comprehensive metadata
            Map<String, Object> metadata = new HashMap<>();
            metadata.put("deployment_date", new Date().toString());
            metadata.put("algorithm", result.getBestAlgorithm());
            metadata.put("cv_score", result.getBestScore());
            metadata.put("ensemble_used", result.isEnsembleUsed());
            metadata.put("feature_count", result.getFeatureCount());
            metadata.put("training_samples", result.getTrainingSamples());
            metadata.put("hyperparameters", result.getBestParams());
            
            String modelPath = "models/production_model_" + System.currentTimeMillis() + ".superml";
            ModelPersistence.save(result.getBestModel(), modelPath, "Production AutoML Model", metadata);
            
            // Update model registry
            updateModelRegistry(modelPath, metadata);
            
            // Update monitoring
            monitor.updateModel(result.getBestModel());
            
        } catch (Exception e) {
            System.err.println("❌ Error deploying model: " + e.getMessage());
            throw new RuntimeException("Model deployment failed", e);
        }
    }
    
    private double assessDataQuality(Dataset dataset) {
        // Implement data quality assessment
        return 0.95; // Placeholder
    }
    
    private double getCurrentModelScore() {
        // Get current production model score
        return 0.87; // Placeholder
    }
    
    private Dataset loadLatestProductionData() {
        // Load latest production data
        return new Dataset(); // Placeholder
    }
    
    private void updateModelRegistry(String modelPath, Map<String, Object> metadata) {
        // Update model registry with new model
    }
    
    private void notifyOperationsTeam(Exception e) {
        // Send alert to operations team
    }
}

AutoML Performance Monitoring

import org.superml.monitoring.AutoMLMonitor;
import org.superml.metrics.ModelMetrics;

public class AutoMLPerformanceMonitoring {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - AutoML Performance Monitoring ===\n");
        
        try {
            // Create AutoML monitor
            var monitor = new AutoMLMonitor()
                .setMetricsCollection(true)
                .setPerformanceThresholds(0.05)     // 5% degradation threshold
                .setDriftDetection(true)
                .setAlertingEnabled(true)
                .setLoggingLevel("INFO");
            
            // Monitor multiple AutoML runs
            System.out.println("πŸ“Š Monitoring AutoML Performance...");
            
            for (int run = 1; run <= 5; run++) {
                System.out.println("\nπŸ”„ AutoML Run " + run + ":");
                
                // Generate synthetic data for each run
                var dataset = Datasets.makeClassification(1000, 10, 3, run * 42);
                
                // Run AutoML with monitoring
                var startTime = System.currentTimeMillis();
                var result = AutoTrainer.autoML(dataset.X, dataset.y, "classification");
                var endTime = System.currentTimeMillis();
                
                // Collect metrics
                var metrics = new ModelMetrics()
                    .setScore(result.getBestScore())
                    .setAlgorithm(result.getBestAlgorithm())
                    .setTrainingTime(endTime - startTime)
                    .setDataSize(dataset.X.length)
                    .setFeatureCount(dataset.X[0].length);
                
                // Update monitor
                monitor.recordRun(run, metrics);
                
                System.out.println("- Algorithm: " + result.getBestAlgorithm());
                System.out.println("- Score: " + String.format("%.4f", result.getBestScore()));
                System.out.println("- Time: " + (endTime - startTime) + " ms");
                
                // Check for performance degradation
                if (run > 1) {
                    double degradation = monitor.checkPerformanceDegradation(run);
                    if (degradation > 0.05) {
                        System.out.println("⚠️ Performance degradation detected: " + 
                            String.format("%.2f%%", degradation * 100));
                    }
                }
            }
            
            // Display monitoring summary
            System.out.println("\n=== AutoML Monitoring Summary ===");
            var summary = monitor.getSummary();
            
            System.out.println("πŸ“Š Average Score: " + String.format("%.4f", summary.getAverageScore()));
            System.out.println("πŸ“Š Best Score: " + String.format("%.4f", summary.getBestScore()));
            System.out.println("πŸ“Š Worst Score: " + String.format("%.4f", summary.getWorstScore()));
            System.out.println("πŸ“Š Score Variance: " + String.format("%.4f", summary.getScoreVariance()));
            System.out.println("⏱️ Average Time: " + summary.getAverageTime() + " ms");
            System.out.println("πŸ† Most Successful Algorithm: " + summary.getMostSuccessfulAlgorithm());
            
            // Algorithm performance breakdown
            System.out.println("\nπŸ” Algorithm Performance Breakdown:");
            var algorithmStats = summary.getAlgorithmStats();
            algorithmStats.forEach((algorithm, stats) -> {
                System.out.println("- " + algorithm + ": " + 
                    stats.getWinRate() + "% win rate, " +
                    "avg score: " + String.format("%.4f", stats.getAverageScore()));
            });
            
            System.out.println("\nβœ… AutoML performance monitoring completed!");
            
        } catch (Exception e) {
            System.err.println("❌ Error in AutoML monitoring: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Best Practices

1. Data Preparation for AutoML

  • Quality Check: Ensure data quality before AutoML
  • Feature Engineering: Let AutoML handle basic preprocessing
  • Data Size: Minimum 100 samples per class for reliable results
  • Validation: Use proper train/validation/test splits

2. AutoML Configuration

  • Time Limits: Set reasonable time limits for production systems
  • Algorithm Selection: Choose algorithms appropriate for your problem
  • Cross-Validation: Use sufficient folds for reliable estimates
  • Early Stopping: Enable early stopping to prevent overfitting

3. Model Selection

  • Ensemble Methods: Use ensemble for improved robustness
  • Performance Metrics: Choose appropriate metrics for your problem
  • Validation Strategy: Use holdout validation for final evaluation
  • Model Complexity: Balance complexity with interpretability

4. Production Deployment

  • Model Monitoring: Implement continuous monitoring
  • Drift Detection: Monitor for data and model drift
  • Automated Retraining: Set up automated retraining pipelines
  • Fallback Models: Maintain fallback models for reliability

Troubleshooting

Common AutoML Issues

// Problem: Long training times
// Solution: Set time limits and use early stopping
var config = new AutoTrainer.Config()
    .setMaxEvaluationTime(600)        // 10 minutes max
    .setEarlyStoppingRounds(10)       // Stop early if no improvement
    .setMaxEvaluations(50);           // Limit total evaluations

// Problem: Poor performance
// Solution: Enable feature engineering and preprocessing
var config = new AutoTrainer.Config()
    .setFeatureSelection(true)        // Automatic feature selection
    .setPreprocessing(true)           // Automatic preprocessing
    .setOutlierDetection(true)        // Handle outliers
    .setRobustScaling(true);          // Robust scaling

// Problem: Overfitting
// Solution: Use proper cross-validation and regularization
var config = new AutoTrainer.Config()
    .setCrossValidationFolds(5)       // 5-fold CV
    .setValidationStrategy("stratified") // Stratified CV
    .setRegularization(true);         // Enable regularization

Summary

In this tutorial, you learned:

  • AutoML Framework: Automated algorithm selection and optimization
  • Hyperparameter Tuning: Grid search, random search, and Bayesian optimization
  • Model Ensemble: Combining multiple models for better performance
  • Kaggle Integration: One-line training on Kaggle datasets
  • Production AutoML: Enterprise-ready automated ML pipelines
  • Performance Monitoring: Continuous monitoring and drift detection
  • Best Practices: Guidelines for effective AutoML deployment

AutoML in SuperML Java 2.1.0 provides intelligent automation while maintaining the flexibility and performance needed for enterprise applications. The framework handles the complexity of algorithm selection and hyperparameter optimization while providing transparent results and professional-grade deployment capabilities.

Next Steps

  • Explore XGBoost: Learn advanced gradient boosting techniques
  • Neural Networks: Implement deep learning with MLP, CNN, and RNN
  • Model Deployment: Production deployment with inference engine
  • Advanced Ensemble: Custom ensemble methods and voting strategies
  • MLOps Integration: CI/CD pipelines for machine learning

You’re now ready to leverage AutoML for rapid prototyping, production systems, and competitive machine learning with SuperML Java 2.1.0!