Press ESC to exit fullscreen
πŸ“– Lesson ⏱️ 75 minutes

High-Performance Inference Engine

Deploying models with microsecond predictions and intelligent caching

Java Inference Engine - High-Performance Model Serving

The SuperML Java 2.1.0 Inference Engine provides enterprise-grade model serving capabilities with millisecond response times, automatic scaling, and comprehensive monitoring. This tutorial covers real-time inference, batch processing, model optimization, and production deployment patterns.

What You’ll Learn

  • Real-Time Inference - Millisecond response times for production APIs
  • Batch Processing - High-throughput processing of large datasets
  • Model Optimization - Quantization, pruning, and performance tuning
  • Load Balancing - Distributed inference across multiple instances
  • Monitoring & Logging - Comprehensive observability and debugging
  • Resource Management - Memory optimization and GPU acceleration
  • Enterprise Patterns - Microservices, caching, and fault tolerance

Prerequisites

  • Completion of β€œAutoML in Java” tutorial
  • Understanding of model training and evaluation
  • Java development environment with SuperML Java 2.1.0
  • Basic knowledge of REST APIs and microservices

Inference Engine Overview

The SuperML Inference Engine provides:

  • Ultra-Low Latency: Sub-millisecond inference times
  • High Throughput: Process millions of requests per second
  • Auto-Scaling: Dynamic resource allocation based on load
  • Multi-Model Support: Serve multiple models simultaneously
  • Hardware Acceleration: GPU, TPU, and specialized hardware support
  • Enterprise Features: Monitoring, logging, and fault tolerance

Real-Time Inference

Basic Inference Server

This example demonstrates how to create a high-performance inference server for real-time predictions. The server handles individual requests with minimal latency.

import org.superml.inference.InferenceEngine;
import org.superml.inference.ModelRegistry;
import org.superml.inference.server.InferenceServer;
import org.superml.persistence.ModelPersistence;

public class BasicInferenceServer {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - Real-Time Inference Server ===\n");
        
        try {
            // Create inference engine
            // The inference engine is optimized for low-latency predictions
            // It uses memory pooling and thread-safe operations
            var engine = new InferenceEngine()
                .setMaxConcurrentRequests(1000)     // Handle 1000 concurrent requests
                .setResponseTimeout(100)             // 100ms timeout per request
                .setMemoryPoolSize(512)              // 512MB memory pool
                .setThreadPoolSize(32)               // 32 worker threads
                .setOptimizationLevel("HIGH")        // High optimization for speed
                .setWarmupEnabled(true);             // Warm up models on startup
            
            // Load pre-trained models
            // Models are loaded into memory for fast access
            // Multiple models can be served simultaneously
            System.out.println("πŸ“₯ Loading pre-trained models...");
            
            var classificationModel = ModelPersistence.load("models/iris_classifier.superml");
            var regressionModel = ModelPersistence.load("models/house_price_predictor.superml");
            var neuralNetworkModel = ModelPersistence.load("models/image_classifier.superml");
            
            // Register models with inference engine
            // Each model gets a unique ID for routing requests
            // Model metadata is stored for monitoring and debugging
            var registry = new ModelRegistry()
                .registerModel("iris-classifier", classificationModel)
                .registerModel("house-predictor", regressionModel)
                .registerModel("image-classifier", neuralNetworkModel);
            
            engine.setModelRegistry(registry);
            
            System.out.println("βœ… Models loaded successfully:");
            System.out.println("- iris-classifier: Classification model");
            System.out.println("- house-predictor: Regression model");
            System.out.println("- image-classifier: Neural network model");
            
            // Configure inference server
            // The server handles HTTP requests and routes them to appropriate models
            // It provides automatic serialization/deserialization
            var server = new InferenceServer()
                .setPort(8080)                       // Listen on port 8080
                .setInferenceEngine(engine)          // Use our configured engine
                .setHealthCheckEnabled(true)         // Enable health checks
                .setMetricsEnabled(true)             // Enable metrics collection
                .setCorsEnabled(true)                // Enable CORS for web clients
                .setRateLimitEnabled(true)           // Enable rate limiting
                .setMaxRequestsPerSecond(10000);     // 10K requests per second limit
            
            // Start the server
            // The server starts in a separate thread and handles requests asynchronously
            System.out.println("\nπŸš€ Starting inference server...");
            server.start();
            
            System.out.println("βœ… Inference server started successfully!");
            System.out.println("πŸ“‘ Server URL: http://localhost:8080");
            System.out.println("πŸ” Health Check: http://localhost:8080/health");
            System.out.println("πŸ“Š Metrics: http://localhost:8080/metrics");
            
            // Example API endpoints
            System.out.println("\nπŸ“‹ Available API Endpoints:");
            System.out.println("POST /predict/iris-classifier - Iris classification");
            System.out.println("POST /predict/house-predictor - House price prediction");
            System.out.println("POST /predict/image-classifier - Image classification");
            System.out.println("GET /models - List all available models");
            System.out.println("GET /models/{modelId}/info - Get model information");
            
            // Simulate some inference requests
            System.out.println("\nπŸ§ͺ Testing inference performance...");
            testInferencePerformance(engine);
            
            // Keep server running
            System.out.println("\n⏳ Server is running. Press Ctrl+C to stop...");
            Thread.currentThread().join();
            
        } catch (Exception e) {
            System.err.println("❌ Error starting inference server: " + e.getMessage());
            e.printStackTrace();
        }
    }
    
    private static void testInferencePerformance(InferenceEngine engine) {
        try {
            // Test classification model
            double[][] irisData = {{5.1, 3.5, 1.4, 0.2}};
            long startTime = System.nanoTime();
            
            var result = engine.predict("iris-classifier", irisData);
            
            long inferenceTime = System.nanoTime() - startTime;
            double latencyMs = inferenceTime / 1_000_000.0;
            
            System.out.println("⚑ Iris Classification:");
            System.out.println("- Input: " + java.util.Arrays.toString(irisData[0]));
            System.out.println("- Prediction: " + result.getPrediction());
            System.out.println("- Confidence: " + String.format("%.4f", result.getConfidence()));
            System.out.println("- Latency: " + String.format("%.2f ms", latencyMs));
            
            // Test regression model
            double[][] houseData = {{1500, 3, 2, 10}};
            startTime = System.nanoTime();
            
            result = engine.predict("house-predictor", houseData);
            
            inferenceTime = System.nanoTime() - startTime;
            latencyMs = inferenceTime / 1_000_000.0;
            
            System.out.println("\n🏠 House Price Prediction:");
            System.out.println("- Input: " + java.util.Arrays.toString(houseData[0]));
            System.out.println("- Prediction: $" + String.format("%.0f", result.getPrediction()));
            System.out.println("- Latency: " + String.format("%.2f ms", latencyMs));
            
        } catch (Exception e) {
            System.err.println("❌ Error testing inference: " + e.getMessage());
        }
    }
}

Key Learning Points:

  • Memory Management: Efficient memory pooling for high-performance inference
  • Thread Safety: Concurrent request handling with thread-safe operations
  • Model Registry: Centralized model management and routing
  • Performance Optimization: Low-latency inference with sub-millisecond response times
  • Health Monitoring: Built-in health checks and metrics collection
  • Scalability: Handle thousands of concurrent requests efficiently

Advanced Inference Configuration

This example shows advanced configuration options for production environments, including GPU acceleration, model optimization, and custom preprocessing.

import org.superml.inference.InferenceEngine;
import org.superml.inference.optimization.ModelOptimizer;
import org.superml.inference.preprocessing.PreprocessingPipeline;
import org.superml.inference.hardware.GPUAccelerator;

public class AdvancedInferenceConfiguration {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - Advanced Inference Configuration ===\n");
        
        try {
            // Configure GPU acceleration
            // GPU acceleration provides significant speedup for large models
            // Automatic fallback to CPU if GPU is unavailable
            var gpuAccelerator = new GPUAccelerator()
                .setDeviceId(0)                      // Use first GPU device
                .setMemoryFraction(0.8)              // Use 80% of GPU memory
                .setBatchSize(32)                    // Optimal batch size for GPU
                .setMixedPrecision(true)             // Use mixed precision for speed
                .setTensorCores(true);               // Enable Tensor Cores if available
            
            System.out.println("πŸš€ GPU Acceleration Configuration:");
            System.out.println("- Device: " + gpuAccelerator.getDeviceName());
            System.out.println("- Memory: " + gpuAccelerator.getAvailableMemory() + " GB");
            System.out.println("- Compute Capability: " + gpuAccelerator.getComputeCapability());
            
            // Configure model optimization
            // Model optimization reduces memory usage and improves inference speed
            // Techniques include quantization, pruning, and graph optimization
            var optimizer = new ModelOptimizer()
                .setQuantization(true)               // Enable 8-bit quantization
                .setQuantizationMode("DYNAMIC")      // Dynamic quantization
                .setPruning(true)                    // Enable weight pruning
                .setPruningSparsity(0.1)            // 10% sparsity
                .setGraphOptimization(true)          // Optimize computation graph
                .setConstantFolding(true)            // Fold constants
                .setDeadCodeElimination(true);       // Remove unused operations
            
            System.out.println("\nπŸ”§ Model Optimization Configuration:");
            System.out.println("- Quantization: 8-bit dynamic");
            System.out.println("- Pruning: 10% sparsity");
            System.out.println("- Graph Optimization: Enabled");
            
            // Configure preprocessing pipeline
            // Preprocessing pipeline handles data transformation before inference
            // Operations are optimized and can run in parallel
            var preprocessor = new PreprocessingPipeline()
                .addScaler("StandardScaler")         // Standardize features
                .addEncoder("LabelEncoder")          // Encode categorical features
                .addFeatureSelector("VarianceThreshold") // Remove low-variance features
                .addDimensionalityReduction("PCA")   // Principal component analysis
                .setParallelProcessing(true)         // Enable parallel processing
                .setCacheEnabled(true);              // Cache preprocessing results
            
            System.out.println("\nπŸ”„ Preprocessing Pipeline:");
            System.out.println("- Scaling: StandardScaler");
            System.out.println("- Encoding: LabelEncoder");
            System.out.println("- Feature Selection: VarianceThreshold");
            System.out.println("- Dimensionality Reduction: PCA");
            
            // Create advanced inference engine
            // The engine combines all optimizations for maximum performance
            var engine = new InferenceEngine()
                .setGPUAccelerator(gpuAccelerator)   // Use GPU acceleration
                .setModelOptimizer(optimizer)        // Apply model optimizations
                .setPreprocessor(preprocessor)       // Use preprocessing pipeline
                .setMaxConcurrentRequests(5000)      // Handle 5000 concurrent requests
                .setResponseTimeout(50)              // 50ms timeout per request
                .setMemoryPoolSize(2048)             // 2GB memory pool
                .setThreadPoolSize(64)               // 64 worker threads
                .setOptimizationLevel("AGGRESSIVE")  // Aggressive optimization
                .setWarmupEnabled(true)              // Warm up models on startup
                .setJITCompilation(true)             // Just-in-time compilation
                .setVectorization(true);             // Enable SIMD vectorization
            
            // Load and optimize models
            System.out.println("\nπŸ“₯ Loading and optimizing models...");
            
            var originalModel = ModelPersistence.load("models/large_neural_network.superml");
            
            // Optimize the model for inference
            // This process may take several minutes but significantly improves performance
            long optimizationStart = System.currentTimeMillis();
            var optimizedModel = optimizer.optimize(originalModel);
            long optimizationTime = System.currentTimeMillis() - optimizationStart;
            
            System.out.println("βœ… Model optimization completed:");
            System.out.println("- Optimization Time: " + optimizationTime + " ms");
            System.out.println("- Model Size Reduction: " + 
                String.format("%.1f%%", (1.0 - (double)optimizedModel.getSize() / originalModel.getSize()) * 100));
            System.out.println("- Memory Usage: " + optimizedModel.getMemoryUsage() + " MB");
            
            // Register optimized model
            var registry = new ModelRegistry()
                .registerModel("optimized-model", optimizedModel);
            
            engine.setModelRegistry(registry);
            
            // Performance benchmarking
            System.out.println("\n🏁 Performance Benchmarking:");
            performanceBenchmark(engine, "optimized-model");
            
            // Monitor resource usage
            System.out.println("\nπŸ“Š Resource Usage Monitoring:");
            monitorResourceUsage(engine);
            
        } catch (Exception e) {
            System.err.println("❌ Error in advanced inference configuration: " + e.getMessage());
            e.printStackTrace();
        }
    }
    
    private static void performanceBenchmark(InferenceEngine engine, String modelId) {
        try {
            // Generate test data
            int batchSize = 100;
            int numBatches = 10;
            double[][] testData = generateTestData(batchSize, 784); // 28x28 images
            
            // Warm up the engine
            System.out.println("πŸ”₯ Warming up inference engine...");
            for (int i = 0; i < 10; i++) {
                engine.predict(modelId, testData);
            }
            
            // Benchmark inference performance
            System.out.println("⚑ Running performance benchmark...");
            long totalTime = 0;
            long totalSamples = 0;
            
            for (int batch = 0; batch < numBatches; batch++) {
                long batchStart = System.nanoTime();
                
                var results = engine.predict(modelId, testData);
                
                long batchTime = System.nanoTime() - batchStart;
                totalTime += batchTime;
                totalSamples += batchSize;
                
                double batchLatency = batchTime / 1_000_000.0;
                double throughput = (batchSize * 1000.0) / batchLatency;
                
                System.out.println("- Batch " + (batch + 1) + ": " + 
                    String.format("%.2f ms", batchLatency) + " (" + 
                    String.format("%.0f samples/sec", throughput) + ")");
            }
            
            // Calculate overall performance metrics
            double avgLatency = (totalTime / numBatches) / 1_000_000.0;
            double avgThroughput = (totalSamples * 1000.0) / (totalTime / 1_000_000.0);
            
            System.out.println("\nπŸ“ˆ Performance Summary:");
            System.out.println("- Average Latency: " + String.format("%.2f ms", avgLatency));
            System.out.println("- Average Throughput: " + String.format("%.0f samples/sec", avgThroughput));
            System.out.println("- Total Samples: " + totalSamples);
            System.out.println("- Total Time: " + String.format("%.2f ms", totalTime / 1_000_000.0));
            
        } catch (Exception e) {
            System.err.println("❌ Error in performance benchmark: " + e.getMessage());
        }
    }
    
    private static void monitorResourceUsage(InferenceEngine engine) {
        try {
            var monitor = engine.getResourceMonitor();
            
            System.out.println("πŸ’Ύ Memory Usage:");
            System.out.println("- Heap Memory: " + monitor.getHeapMemoryUsage() + " MB");
            System.out.println("- Direct Memory: " + monitor.getDirectMemoryUsage() + " MB");
            System.out.println("- GPU Memory: " + monitor.getGPUMemoryUsage() + " MB");
            
            System.out.println("\nπŸ–₯️ CPU Usage:");
            System.out.println("- CPU Usage: " + String.format("%.1f%%", monitor.getCPUUsage()));
            System.out.println("- Thread Pool Usage: " + monitor.getThreadPoolUsage() + "%");
            System.out.println("- Active Threads: " + monitor.getActiveThreads());
            
            System.out.println("\nπŸš€ GPU Usage:");
            System.out.println("- GPU Utilization: " + String.format("%.1f%%", monitor.getGPUUtilization()));
            System.out.println("- GPU Temperature: " + monitor.getGPUTemperature() + "Β°C");
            System.out.println("- GPU Power: " + monitor.getGPUPower() + "W");
            
        } catch (Exception e) {
            System.err.println("❌ Error monitoring resources: " + e.getMessage());
        }
    }
    
    private static double[][] generateTestData(int batchSize, int features) {
        double[][] data = new double[batchSize][features];
        java.util.Random random = new java.util.Random(42);
        
        for (int i = 0; i < batchSize; i++) {
            for (int j = 0; j < features; j++) {
                data[i][j] = random.nextGaussian();
            }
        }
        
        return data;
    }
}

Key Learning Points:

  • GPU Acceleration: Leverage GPU computing for significant performance improvements
  • Model Optimization: Use quantization and pruning to reduce model size and improve speed
  • Preprocessing Pipeline: Efficient data transformation with parallel processing
  • Performance Benchmarking: Measure and optimize inference performance
  • Resource Monitoring: Track CPU, memory, and GPU usage for optimal performance
  • JIT Compilation: Use just-in-time compilation for additional speed improvements

Batch Processing

High-Throughput Batch Inference

This example demonstrates batch processing for high-throughput scenarios where you need to process large datasets efficiently.

import org.superml.inference.BatchInferenceEngine;
import org.superml.inference.batch.BatchProcessor;
import org.superml.inference.batch.BatchConfiguration;
import org.superml.datasets.DataLoader;

public class BatchInferenceExample {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - High-Throughput Batch Inference ===\n");
        
        try {
            // Configure batch processing
            // Batch processing optimizes throughput by processing multiple samples together
            // It's ideal for offline processing and large-scale data analysis
            var batchConfig = new BatchConfiguration()
                .setBatchSize(1000)                  // Process 1000 samples per batch
                .setMaxConcurrentBatches(8)          // Process 8 batches in parallel
                .setQueueSize(10000)                 // Queue up to 10K samples
                .setWorkerThreads(16)                // 16 worker threads
                .setResultBufferSize(100000)         // Buffer 100K results
                .setCheckpointInterval(50000)        // Checkpoint every 50K samples
                .setErrorHandling("CONTINUE")        // Continue processing on errors
                .setProgressReporting(true);         // Enable progress reporting
            
            System.out.println("πŸ”§ Batch Processing Configuration:");
            System.out.println("- Batch Size: " + batchConfig.getBatchSize());
            System.out.println("- Concurrent Batches: " + batchConfig.getMaxConcurrentBatches());
            System.out.println("- Worker Threads: " + batchConfig.getWorkerThreads());
            System.out.println("- Queue Size: " + batchConfig.getQueueSize());
            
            // Create batch inference engine
            // The batch engine is optimized for high throughput rather than low latency
            var batchEngine = new BatchInferenceEngine()
                .setConfiguration(batchConfig)
                .setMemoryOptimization(true)         // Optimize memory usage
                .setCompressionEnabled(true)         // Compress intermediate results
                .setResultCaching(false)             // Disable caching for batch processing
                .setVerbose(true);                   // Enable verbose logging
            
            // Load model for batch processing
            System.out.println("\nπŸ“₯ Loading model for batch processing...");
            var model = ModelPersistence.load("models/image_classifier.superml");
            batchEngine.loadModel("batch-classifier", model);
            
            // Load large dataset
            // In real scenarios, this would be from a database or file system
            System.out.println("πŸ“Š Loading large dataset...");
            var dataLoader = new DataLoader()
                .setDataSource("datasets/large_image_dataset.csv")
                .setBatchSize(batchConfig.getBatchSize())
                .setPreprocessing(true)
                .setShuffling(false);               // Don't shuffle for batch processing
            
            int totalSamples = dataLoader.getTotalSamples();
            System.out.println("βœ… Dataset loaded: " + totalSamples + " samples");
            
            // Create batch processor
            // The processor handles the entire batch inference pipeline
            var processor = new BatchProcessor(batchEngine, dataLoader)
                .setOutputPath("results/batch_predictions.csv")
                .setOutputFormat("CSV")
                .setIncludeConfidence(true)
                .setIncludeTimestamp(true)
                .setCompressionEnabled(true);
            
            // Start batch processing
            System.out.println("\nπŸš€ Starting batch processing...");
            long batchStart = System.currentTimeMillis();
            
            var processingResult = processor.processBatches("batch-classifier");
            
            long batchTime = System.currentTimeMillis() - batchStart;
            
            // Display batch processing results
            System.out.println("\n=== Batch Processing Results ===");
            System.out.println("βœ… Processing completed successfully!");
            System.out.println("πŸ“Š Total Samples: " + processingResult.getTotalSamples());
            System.out.println("πŸ“Š Successful Predictions: " + processingResult.getSuccessfulPredictions());
            System.out.println("πŸ“Š Failed Predictions: " + processingResult.getFailedPredictions());
            System.out.println("⏱️ Total Time: " + batchTime + " ms");
            System.out.println("⚑ Throughput: " + 
                String.format("%.0f samples/sec", processingResult.getTotalSamples() * 1000.0 / batchTime));
            
            // Performance breakdown
            System.out.println("\nπŸ“ˆ Performance Breakdown:");
            System.out.println("- Loading Time: " + processingResult.getLoadingTime() + " ms");
            System.out.println("- Preprocessing Time: " + processingResult.getPreprocessingTime() + " ms");
            System.out.println("- Inference Time: " + processingResult.getInferenceTime() + " ms");
            System.out.println("- Postprocessing Time: " + processingResult.getPostprocessingTime() + " ms");
            System.out.println("- I/O Time: " + processingResult.getIOTime() + " ms");
            
            // Resource utilization
            System.out.println("\nπŸ’Ύ Resource Utilization:");
            System.out.println("- Peak Memory Usage: " + processingResult.getPeakMemoryUsage() + " MB");
            System.out.println("- Average CPU Usage: " + String.format("%.1f%%", processingResult.getAverageCPUUsage()));
            System.out.println("- GPU Utilization: " + String.format("%.1f%%", processingResult.getGPUUtilization()));
            
            // Error analysis
            if (processingResult.getFailedPredictions() > 0) {
                System.out.println("\n⚠️ Error Analysis:");
                var errorSummary = processingResult.getErrorSummary();
                errorSummary.forEach((errorType, count) -> {
                    System.out.println("- " + errorType + ": " + count + " occurrences");
                });
            }
            
            // Save processing report
            processingResult.saveReport("reports/batch_processing_report.json");
            System.out.println("\nπŸ“„ Processing report saved to: reports/batch_processing_report.json");
            
        } catch (Exception e) {
            System.err.println("❌ Error in batch processing: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Key Learning Points:

  • Batch Optimization: Process multiple samples together for maximum throughput
  • Parallel Processing: Use multiple threads and concurrent batches for scalability
  • Memory Management: Optimize memory usage for large-scale processing
  • Error Handling: Robust error handling to continue processing despite failures
  • Progress Monitoring: Track processing progress and performance metrics
  • Resource Utilization: Monitor CPU, memory, and GPU usage during batch processing

Load Balancing and Distributed Inference

Distributed Inference Cluster

This example shows how to set up a distributed inference cluster for handling massive scale and providing high availability.

import org.superml.inference.cluster.InferenceCluster;
import org.superml.inference.cluster.LoadBalancer;
import org.superml.inference.cluster.NodeManager;
import org.superml.inference.cluster.HealthChecker;

public class DistributedInferenceCluster {
    public static void main(String[] args) {
        System.out.println("=== SuperML 2.1.0 - Distributed Inference Cluster ===\n");
        
        try {
            // Configure cluster nodes
            // Each node runs an inference engine instance
            // Nodes can be on different machines for true distributed computing
            var nodeConfig = new NodeManager.NodeConfiguration()
                .setMaxConcurrentRequests(2000)     // 2000 requests per node
                .setResponseTimeout(200)             // 200ms timeout per node
                .setMemoryLimit(4096)                // 4GB memory limit per node
                .setHealthCheckInterval(30)          // Health check every 30 seconds
                .setFailureThreshold(3)              // Mark unhealthy after 3 failures
                .setRecoveryTimeout(300);            // 5 minute recovery timeout
            
            System.out.println("πŸ–₯️ Node Configuration:");
            System.out.println("- Max Concurrent Requests: " + nodeConfig.getMaxConcurrentRequests());
            System.out.println("- Response Timeout: " + nodeConfig.getResponseTimeout() + " ms");
            System.out.println("- Memory Limit: " + nodeConfig.getMemoryLimit() + " MB");
            
            // Create cluster nodes
            // In production, these would be separate machines or containers
            var nodeManager = new NodeManager();
            
            // Add inference nodes
            System.out.println("\nπŸ”§ Setting up inference nodes...");
            nodeManager.addNode("node-1", "localhost:8081", nodeConfig);
            nodeManager.addNode("node-2", "localhost:8082", nodeConfig);
            nodeManager.addNode("node-3", "localhost:8083", nodeConfig);
            nodeManager.addNode("node-4", "localhost:8084", nodeConfig);
            
            // Configure load balancer
            // The load balancer distributes requests across healthy nodes
            // Multiple algorithms available: round-robin, least-connections, weighted
            var loadBalancer = new LoadBalancer()
                .setAlgorithm("WEIGHTED_ROUND_ROBIN")    // Weight by node performance
                .setHealthCheckEnabled(true)             // Only route to healthy nodes
                .setStickySessions(false)                // Don't use sticky sessions
                .setFailoverEnabled(true)                // Enable automatic failover
                .setMaxRetries(3)                        // Retry failed requests 3 times
                .setRetryDelay(100)                      // 100ms between retries
                .setCircuitBreakerEnabled(true);         // Enable circuit breaker
            
            System.out.println("βš–οΈ Load Balancer Configuration:");
            System.out.println("- Algorithm: " + loadBalancer.getAlgorithm());
            System.out.println("- Health Check: " + loadBalancer.isHealthCheckEnabled());
            System.out.println("- Failover: " + loadBalancer.isFailoverEnabled());
            System.out.println("- Max Retries: " + loadBalancer.getMaxRetries());
            
            // Configure health checker
            // Monitors node health and removes unhealthy nodes from load balancing
            var healthChecker = new HealthChecker()
                .setCheckInterval(15)                    // Check every 15 seconds
                .setTimeoutPerCheck(5)                   // 5 second timeout per check
                .setUnhealthyThreshold(2)                // 2 failures = unhealthy
                .setHealthyThreshold(3)                  // 3 successes = healthy
                .setMetricsEnabled(true)                 // Collect health metrics
                .setAlertingEnabled(true);               // Enable health alerts
            
            // Create inference cluster
            // The cluster coordinates all nodes, load balancing, and health checking
            var cluster = new InferenceCluster()
                .setNodeManager(nodeManager)
                .setLoadBalancer(loadBalancer)
                .setHealthChecker(healthChecker)
                .setModelReplicationEnabled(true)       // Replicate models across nodes
                .setAutoScalingEnabled(true)             // Enable auto-scaling
                .setMinNodes(2)                          // Minimum 2 nodes
                .setMaxNodes(10)                         // Maximum 10 nodes
                .setScaleUpThreshold(0.8)               // Scale up at 80% capacity
                .setScaleDownThreshold(0.3);            // Scale down at 30% capacity
            
            System.out.println("\n🌐 Cluster Configuration:");
            System.out.println("- Auto-scaling: " + cluster.isAutoScalingEnabled());
            System.out.println("- Min Nodes: " + cluster.getMinNodes());
            System.out.println("- Max Nodes: " + cluster.getMaxNodes());
            System.out.println("- Scale Up Threshold: " + (cluster.getScaleUpThreshold() * 100) + "%");
            
            // Start cluster
            System.out.println("\nπŸš€ Starting inference cluster...");
            cluster.start();
            
            // Wait for all nodes to be ready
            System.out.println("⏳ Waiting for nodes to be ready...");
            cluster.waitForNodesReady(60); // Wait up to 60 seconds
            
            // Load and distribute models
            System.out.println("\nπŸ“₯ Loading and distributing models...");
            var model = ModelPersistence.load("models/production_model.superml");
            cluster.deployModel("production-model", model);
            
            System.out.println("βœ… Cluster started successfully!");
            System.out.println("🌐 Cluster Endpoint: http://localhost:8080/predict");
            System.out.println("πŸ“Š Cluster Status: " + cluster.getClusterStatus());
            
            // Simulate load testing
            System.out.println("\nπŸ§ͺ Load Testing Cluster...");
            loadTestCluster(cluster);
            
            // Monitor cluster performance
            System.out.println("\nπŸ“Š Monitoring cluster performance...");
            monitorClusterPerformance(cluster);
            
            // Test failover scenarios
            System.out.println("\nπŸ”„ Testing failover scenarios...");
            testFailoverScenarios(cluster);
            
        } catch (Exception e) {
            System.err.println("❌ Error in distributed inference cluster: " + e.getMessage());
            e.printStackTrace();
        }
    }
    
    private static void loadTestCluster(InferenceCluster cluster) {
        try {
            // Simulate high load
            int numRequests = 10000;
            int concurrentThreads = 50;
            
            System.out.println("πŸ”₯ Load Test Configuration:");
            System.out.println("- Total Requests: " + numRequests);
            System.out.println("- Concurrent Threads: " + concurrentThreads);
            
            var executor = java.util.concurrent.Executors.newFixedThreadPool(concurrentThreads);
            var startTime = System.currentTimeMillis();
            var completedRequests = new java.util.concurrent.atomic.AtomicInteger(0);
            var failedRequests = new java.util.concurrent.atomic.AtomicInteger(0);
            
            // Submit requests
            for (int i = 0; i < numRequests; i++) {
                final int requestId = i;
                executor.submit(() -> {
                    try {
                        double[][] testData = {{Math.random(), Math.random(), Math.random()}};
                        var result = cluster.predict("production-model", testData);
                        
                        if (result != null) {
                            completedRequests.incrementAndGet();
                        } else {
                            failedRequests.incrementAndGet();
                        }
                    } catch (Exception e) {
                        failedRequests.incrementAndGet();
                    }
                });
            }
            
            // Wait for completion
            executor.shutdown();
            executor.awaitTermination(5, java.util.concurrent.TimeUnit.MINUTES);
            
            long totalTime = System.currentTimeMillis() - startTime;
            
            System.out.println("\nπŸ“ˆ Load Test Results:");
            System.out.println("- Completed Requests: " + completedRequests.get());
            System.out.println("- Failed Requests: " + failedRequests.get());
            System.out.println("- Success Rate: " + 
                String.format("%.2f%%", (completedRequests.get() * 100.0) / numRequests));
            System.out.println("- Total Time: " + totalTime + " ms");
            System.out.println("- Throughput: " + 
                String.format("%.0f requests/sec", numRequests * 1000.0 / totalTime));
            
        } catch (Exception e) {
            System.err.println("❌ Error in load testing: " + e.getMessage());
        }
    }
    
    private static void monitorClusterPerformance(InferenceCluster cluster) {
        try {
            var monitor = cluster.getClusterMonitor();
            
            System.out.println("πŸ–₯️ Cluster Performance Metrics:");
            System.out.println("- Active Nodes: " + monitor.getActiveNodes());
            System.out.println("- Total Requests: " + monitor.getTotalRequests());
            System.out.println("- Requests Per Second: " + monitor.getRequestsPerSecond());
            System.out.println("- Average Response Time: " + monitor.getAverageResponseTime() + " ms");
            System.out.println("- P95 Response Time: " + monitor.getP95ResponseTime() + " ms");
            System.out.println("- P99 Response Time: " + monitor.getP99ResponseTime() + " ms");
            System.out.println("- Error Rate: " + String.format("%.2f%%", monitor.getErrorRate() * 100));
            
            System.out.println("\nπŸ’Ύ Resource Utilization:");
            System.out.println("- CPU Usage: " + String.format("%.1f%%", monitor.getAverageCPUUsage()));
            System.out.println("- Memory Usage: " + String.format("%.1f%%", monitor.getAverageMemoryUsage()));
            System.out.println("- Network I/O: " + monitor.getNetworkIO() + " MB/s");
            
        } catch (Exception e) {
            System.err.println("❌ Error monitoring cluster: " + e.getMessage());
        }
    }
    
    private static void testFailoverScenarios(InferenceCluster cluster) {
        try {
            System.out.println("πŸ”„ Testing Node Failover...");
            
            // Simulate node failure
            cluster.simulateNodeFailure("node-1");
            
            // Continue making requests
            for (int i = 0; i < 100; i++) {
                try {
                    double[][] testData = {{Math.random(), Math.random(), Math.random()}};
                    var result = cluster.predict("production-model", testData);
                    
                    if (i % 20 == 0) {
                        System.out.println("- Request " + i + ": " + 
                            (result != null ? "Success" : "Failed"));
                    }
                } catch (Exception e) {
                    System.out.println("- Request " + i + ": Failed - " + e.getMessage());
                }
            }
            
            // Recover node
            cluster.recoverNode("node-1");
            
            System.out.println("βœ… Failover test completed");
            System.out.println("πŸ“Š Active Nodes: " + cluster.getActiveNodeCount());
            
        } catch (Exception e) {
            System.err.println("❌ Error in failover testing: " + e.getMessage());
        }
    }
}

Key Learning Points:

  • Distributed Architecture: Scale inference across multiple nodes for high availability
  • Load Balancing: Distribute requests efficiently across healthy nodes
  • Health Monitoring: Continuously monitor node health and remove unhealthy nodes
  • Auto-scaling: Automatically scale the cluster based on demand
  • Failover Handling: Gracefully handle node failures without service interruption
  • Performance Monitoring: Track cluster performance and resource utilization

Best Practices

1. Performance Optimization

  • Model Optimization: Use quantization and pruning to reduce model size
  • Batch Processing: Process multiple samples together for higher throughput
  • GPU Acceleration: Leverage GPU computing for significant speedups
  • Memory Management: Use memory pooling and efficient data structures

2. Production Deployment

  • Health Checks: Implement comprehensive health monitoring
  • Load Balancing: Distribute traffic across multiple instances
  • Auto-scaling: Scale based on demand and resource utilization
  • Monitoring: Track performance metrics and resource usage

3. Error Handling

  • Graceful Degradation: Continue service even with partial failures
  • Circuit Breakers: Prevent cascading failures
  • Retry Logic: Implement intelligent retry mechanisms
  • Fallback Models: Maintain backup models for critical services

4. Security and Compliance

  • Authentication: Secure API endpoints with proper authentication
  • Rate Limiting: Prevent abuse with rate limiting
  • Data Privacy: Ensure data privacy and compliance requirements
  • Audit Logging: Log all inference requests for audit purposes

Summary

In this tutorial, you learned:

  • Real-Time Inference: Build high-performance inference servers with sub-millisecond latency
  • Batch Processing: Process large datasets efficiently with parallel processing
  • Model Optimization: Optimize models for production deployment
  • Distributed Systems: Scale inference across multiple nodes with load balancing
  • Performance Monitoring: Monitor and optimize inference performance
  • Production Patterns: Enterprise-grade deployment patterns and best practices

The SuperML Inference Engine provides the foundation for building scalable, high-performance machine learning services that can handle production workloads with reliability and efficiency.

Next Steps

  • Model Deployment: Learn production deployment strategies
  • Enterprise Patterns: Explore advanced enterprise integration patterns
  • ML Optimization: Optimize models for specific hardware platforms
  • MLOps Integration: Integrate with CI/CD and monitoring systems
  • Real-time Streaming: Build real-time streaming inference pipelines

You’re now ready to deploy machine learning models at scale with the SuperML Inference Engine!