Course Content
High-Performance Inference Engine
Deploying models with microsecond predictions and intelligent caching
Java Inference Engine - High-Performance Model Serving
The SuperML Java 2.1.0 Inference Engine provides enterprise-grade model serving capabilities with millisecond response times, automatic scaling, and comprehensive monitoring. This tutorial covers real-time inference, batch processing, model optimization, and production deployment patterns.
What Youβll Learn
- Real-Time Inference - Millisecond response times for production APIs
- Batch Processing - High-throughput processing of large datasets
- Model Optimization - Quantization, pruning, and performance tuning
- Load Balancing - Distributed inference across multiple instances
- Monitoring & Logging - Comprehensive observability and debugging
- Resource Management - Memory optimization and GPU acceleration
- Enterprise Patterns - Microservices, caching, and fault tolerance
Prerequisites
- Completion of βAutoML in Javaβ tutorial
- Understanding of model training and evaluation
- Java development environment with SuperML Java 2.1.0
- Basic knowledge of REST APIs and microservices
Inference Engine Overview
The SuperML Inference Engine provides:
- Ultra-Low Latency: Sub-millisecond inference times
- High Throughput: Process millions of requests per second
- Auto-Scaling: Dynamic resource allocation based on load
- Multi-Model Support: Serve multiple models simultaneously
- Hardware Acceleration: GPU, TPU, and specialized hardware support
- Enterprise Features: Monitoring, logging, and fault tolerance
Real-Time Inference
Basic Inference Server
This example demonstrates how to create a high-performance inference server for real-time predictions. The server handles individual requests with minimal latency.
import org.superml.inference.InferenceEngine;
import org.superml.inference.ModelRegistry;
import org.superml.inference.server.InferenceServer;
import org.superml.persistence.ModelPersistence;
public class BasicInferenceServer {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - Real-Time Inference Server ===\n");
try {
// Create inference engine
// The inference engine is optimized for low-latency predictions
// It uses memory pooling and thread-safe operations
var engine = new InferenceEngine()
.setMaxConcurrentRequests(1000) // Handle 1000 concurrent requests
.setResponseTimeout(100) // 100ms timeout per request
.setMemoryPoolSize(512) // 512MB memory pool
.setThreadPoolSize(32) // 32 worker threads
.setOptimizationLevel("HIGH") // High optimization for speed
.setWarmupEnabled(true); // Warm up models on startup
// Load pre-trained models
// Models are loaded into memory for fast access
// Multiple models can be served simultaneously
System.out.println("π₯ Loading pre-trained models...");
var classificationModel = ModelPersistence.load("models/iris_classifier.superml");
var regressionModel = ModelPersistence.load("models/house_price_predictor.superml");
var neuralNetworkModel = ModelPersistence.load("models/image_classifier.superml");
// Register models with inference engine
// Each model gets a unique ID for routing requests
// Model metadata is stored for monitoring and debugging
var registry = new ModelRegistry()
.registerModel("iris-classifier", classificationModel)
.registerModel("house-predictor", regressionModel)
.registerModel("image-classifier", neuralNetworkModel);
engine.setModelRegistry(registry);
System.out.println("β
Models loaded successfully:");
System.out.println("- iris-classifier: Classification model");
System.out.println("- house-predictor: Regression model");
System.out.println("- image-classifier: Neural network model");
// Configure inference server
// The server handles HTTP requests and routes them to appropriate models
// It provides automatic serialization/deserialization
var server = new InferenceServer()
.setPort(8080) // Listen on port 8080
.setInferenceEngine(engine) // Use our configured engine
.setHealthCheckEnabled(true) // Enable health checks
.setMetricsEnabled(true) // Enable metrics collection
.setCorsEnabled(true) // Enable CORS for web clients
.setRateLimitEnabled(true) // Enable rate limiting
.setMaxRequestsPerSecond(10000); // 10K requests per second limit
// Start the server
// The server starts in a separate thread and handles requests asynchronously
System.out.println("\nπ Starting inference server...");
server.start();
System.out.println("β
Inference server started successfully!");
System.out.println("π‘ Server URL: http://localhost:8080");
System.out.println("π Health Check: http://localhost:8080/health");
System.out.println("π Metrics: http://localhost:8080/metrics");
// Example API endpoints
System.out.println("\nπ Available API Endpoints:");
System.out.println("POST /predict/iris-classifier - Iris classification");
System.out.println("POST /predict/house-predictor - House price prediction");
System.out.println("POST /predict/image-classifier - Image classification");
System.out.println("GET /models - List all available models");
System.out.println("GET /models/{modelId}/info - Get model information");
// Simulate some inference requests
System.out.println("\nπ§ͺ Testing inference performance...");
testInferencePerformance(engine);
// Keep server running
System.out.println("\nβ³ Server is running. Press Ctrl+C to stop...");
Thread.currentThread().join();
} catch (Exception e) {
System.err.println("β Error starting inference server: " + e.getMessage());
e.printStackTrace();
}
}
private static void testInferencePerformance(InferenceEngine engine) {
try {
// Test classification model
double[][] irisData = {{5.1, 3.5, 1.4, 0.2}};
long startTime = System.nanoTime();
var result = engine.predict("iris-classifier", irisData);
long inferenceTime = System.nanoTime() - startTime;
double latencyMs = inferenceTime / 1_000_000.0;
System.out.println("β‘ Iris Classification:");
System.out.println("- Input: " + java.util.Arrays.toString(irisData[0]));
System.out.println("- Prediction: " + result.getPrediction());
System.out.println("- Confidence: " + String.format("%.4f", result.getConfidence()));
System.out.println("- Latency: " + String.format("%.2f ms", latencyMs));
// Test regression model
double[][] houseData = {{1500, 3, 2, 10}};
startTime = System.nanoTime();
result = engine.predict("house-predictor", houseData);
inferenceTime = System.nanoTime() - startTime;
latencyMs = inferenceTime / 1_000_000.0;
System.out.println("\nπ House Price Prediction:");
System.out.println("- Input: " + java.util.Arrays.toString(houseData[0]));
System.out.println("- Prediction: $" + String.format("%.0f", result.getPrediction()));
System.out.println("- Latency: " + String.format("%.2f ms", latencyMs));
} catch (Exception e) {
System.err.println("β Error testing inference: " + e.getMessage());
}
}
}
Key Learning Points:
- Memory Management: Efficient memory pooling for high-performance inference
- Thread Safety: Concurrent request handling with thread-safe operations
- Model Registry: Centralized model management and routing
- Performance Optimization: Low-latency inference with sub-millisecond response times
- Health Monitoring: Built-in health checks and metrics collection
- Scalability: Handle thousands of concurrent requests efficiently
Advanced Inference Configuration
This example shows advanced configuration options for production environments, including GPU acceleration, model optimization, and custom preprocessing.
import org.superml.inference.InferenceEngine;
import org.superml.inference.optimization.ModelOptimizer;
import org.superml.inference.preprocessing.PreprocessingPipeline;
import org.superml.inference.hardware.GPUAccelerator;
public class AdvancedInferenceConfiguration {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - Advanced Inference Configuration ===\n");
try {
// Configure GPU acceleration
// GPU acceleration provides significant speedup for large models
// Automatic fallback to CPU if GPU is unavailable
var gpuAccelerator = new GPUAccelerator()
.setDeviceId(0) // Use first GPU device
.setMemoryFraction(0.8) // Use 80% of GPU memory
.setBatchSize(32) // Optimal batch size for GPU
.setMixedPrecision(true) // Use mixed precision for speed
.setTensorCores(true); // Enable Tensor Cores if available
System.out.println("π GPU Acceleration Configuration:");
System.out.println("- Device: " + gpuAccelerator.getDeviceName());
System.out.println("- Memory: " + gpuAccelerator.getAvailableMemory() + " GB");
System.out.println("- Compute Capability: " + gpuAccelerator.getComputeCapability());
// Configure model optimization
// Model optimization reduces memory usage and improves inference speed
// Techniques include quantization, pruning, and graph optimization
var optimizer = new ModelOptimizer()
.setQuantization(true) // Enable 8-bit quantization
.setQuantizationMode("DYNAMIC") // Dynamic quantization
.setPruning(true) // Enable weight pruning
.setPruningSparsity(0.1) // 10% sparsity
.setGraphOptimization(true) // Optimize computation graph
.setConstantFolding(true) // Fold constants
.setDeadCodeElimination(true); // Remove unused operations
System.out.println("\nπ§ Model Optimization Configuration:");
System.out.println("- Quantization: 8-bit dynamic");
System.out.println("- Pruning: 10% sparsity");
System.out.println("- Graph Optimization: Enabled");
// Configure preprocessing pipeline
// Preprocessing pipeline handles data transformation before inference
// Operations are optimized and can run in parallel
var preprocessor = new PreprocessingPipeline()
.addScaler("StandardScaler") // Standardize features
.addEncoder("LabelEncoder") // Encode categorical features
.addFeatureSelector("VarianceThreshold") // Remove low-variance features
.addDimensionalityReduction("PCA") // Principal component analysis
.setParallelProcessing(true) // Enable parallel processing
.setCacheEnabled(true); // Cache preprocessing results
System.out.println("\nπ Preprocessing Pipeline:");
System.out.println("- Scaling: StandardScaler");
System.out.println("- Encoding: LabelEncoder");
System.out.println("- Feature Selection: VarianceThreshold");
System.out.println("- Dimensionality Reduction: PCA");
// Create advanced inference engine
// The engine combines all optimizations for maximum performance
var engine = new InferenceEngine()
.setGPUAccelerator(gpuAccelerator) // Use GPU acceleration
.setModelOptimizer(optimizer) // Apply model optimizations
.setPreprocessor(preprocessor) // Use preprocessing pipeline
.setMaxConcurrentRequests(5000) // Handle 5000 concurrent requests
.setResponseTimeout(50) // 50ms timeout per request
.setMemoryPoolSize(2048) // 2GB memory pool
.setThreadPoolSize(64) // 64 worker threads
.setOptimizationLevel("AGGRESSIVE") // Aggressive optimization
.setWarmupEnabled(true) // Warm up models on startup
.setJITCompilation(true) // Just-in-time compilation
.setVectorization(true); // Enable SIMD vectorization
// Load and optimize models
System.out.println("\nπ₯ Loading and optimizing models...");
var originalModel = ModelPersistence.load("models/large_neural_network.superml");
// Optimize the model for inference
// This process may take several minutes but significantly improves performance
long optimizationStart = System.currentTimeMillis();
var optimizedModel = optimizer.optimize(originalModel);
long optimizationTime = System.currentTimeMillis() - optimizationStart;
System.out.println("β
Model optimization completed:");
System.out.println("- Optimization Time: " + optimizationTime + " ms");
System.out.println("- Model Size Reduction: " +
String.format("%.1f%%", (1.0 - (double)optimizedModel.getSize() / originalModel.getSize()) * 100));
System.out.println("- Memory Usage: " + optimizedModel.getMemoryUsage() + " MB");
// Register optimized model
var registry = new ModelRegistry()
.registerModel("optimized-model", optimizedModel);
engine.setModelRegistry(registry);
// Performance benchmarking
System.out.println("\nπ Performance Benchmarking:");
performanceBenchmark(engine, "optimized-model");
// Monitor resource usage
System.out.println("\nπ Resource Usage Monitoring:");
monitorResourceUsage(engine);
} catch (Exception e) {
System.err.println("β Error in advanced inference configuration: " + e.getMessage());
e.printStackTrace();
}
}
private static void performanceBenchmark(InferenceEngine engine, String modelId) {
try {
// Generate test data
int batchSize = 100;
int numBatches = 10;
double[][] testData = generateTestData(batchSize, 784); // 28x28 images
// Warm up the engine
System.out.println("π₯ Warming up inference engine...");
for (int i = 0; i < 10; i++) {
engine.predict(modelId, testData);
}
// Benchmark inference performance
System.out.println("β‘ Running performance benchmark...");
long totalTime = 0;
long totalSamples = 0;
for (int batch = 0; batch < numBatches; batch++) {
long batchStart = System.nanoTime();
var results = engine.predict(modelId, testData);
long batchTime = System.nanoTime() - batchStart;
totalTime += batchTime;
totalSamples += batchSize;
double batchLatency = batchTime / 1_000_000.0;
double throughput = (batchSize * 1000.0) / batchLatency;
System.out.println("- Batch " + (batch + 1) + ": " +
String.format("%.2f ms", batchLatency) + " (" +
String.format("%.0f samples/sec", throughput) + ")");
}
// Calculate overall performance metrics
double avgLatency = (totalTime / numBatches) / 1_000_000.0;
double avgThroughput = (totalSamples * 1000.0) / (totalTime / 1_000_000.0);
System.out.println("\nπ Performance Summary:");
System.out.println("- Average Latency: " + String.format("%.2f ms", avgLatency));
System.out.println("- Average Throughput: " + String.format("%.0f samples/sec", avgThroughput));
System.out.println("- Total Samples: " + totalSamples);
System.out.println("- Total Time: " + String.format("%.2f ms", totalTime / 1_000_000.0));
} catch (Exception e) {
System.err.println("β Error in performance benchmark: " + e.getMessage());
}
}
private static void monitorResourceUsage(InferenceEngine engine) {
try {
var monitor = engine.getResourceMonitor();
System.out.println("πΎ Memory Usage:");
System.out.println("- Heap Memory: " + monitor.getHeapMemoryUsage() + " MB");
System.out.println("- Direct Memory: " + monitor.getDirectMemoryUsage() + " MB");
System.out.println("- GPU Memory: " + monitor.getGPUMemoryUsage() + " MB");
System.out.println("\nπ₯οΈ CPU Usage:");
System.out.println("- CPU Usage: " + String.format("%.1f%%", monitor.getCPUUsage()));
System.out.println("- Thread Pool Usage: " + monitor.getThreadPoolUsage() + "%");
System.out.println("- Active Threads: " + monitor.getActiveThreads());
System.out.println("\nπ GPU Usage:");
System.out.println("- GPU Utilization: " + String.format("%.1f%%", monitor.getGPUUtilization()));
System.out.println("- GPU Temperature: " + monitor.getGPUTemperature() + "Β°C");
System.out.println("- GPU Power: " + monitor.getGPUPower() + "W");
} catch (Exception e) {
System.err.println("β Error monitoring resources: " + e.getMessage());
}
}
private static double[][] generateTestData(int batchSize, int features) {
double[][] data = new double[batchSize][features];
java.util.Random random = new java.util.Random(42);
for (int i = 0; i < batchSize; i++) {
for (int j = 0; j < features; j++) {
data[i][j] = random.nextGaussian();
}
}
return data;
}
}
Key Learning Points:
- GPU Acceleration: Leverage GPU computing for significant performance improvements
- Model Optimization: Use quantization and pruning to reduce model size and improve speed
- Preprocessing Pipeline: Efficient data transformation with parallel processing
- Performance Benchmarking: Measure and optimize inference performance
- Resource Monitoring: Track CPU, memory, and GPU usage for optimal performance
- JIT Compilation: Use just-in-time compilation for additional speed improvements
Batch Processing
High-Throughput Batch Inference
This example demonstrates batch processing for high-throughput scenarios where you need to process large datasets efficiently.
import org.superml.inference.BatchInferenceEngine;
import org.superml.inference.batch.BatchProcessor;
import org.superml.inference.batch.BatchConfiguration;
import org.superml.datasets.DataLoader;
public class BatchInferenceExample {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - High-Throughput Batch Inference ===\n");
try {
// Configure batch processing
// Batch processing optimizes throughput by processing multiple samples together
// It's ideal for offline processing and large-scale data analysis
var batchConfig = new BatchConfiguration()
.setBatchSize(1000) // Process 1000 samples per batch
.setMaxConcurrentBatches(8) // Process 8 batches in parallel
.setQueueSize(10000) // Queue up to 10K samples
.setWorkerThreads(16) // 16 worker threads
.setResultBufferSize(100000) // Buffer 100K results
.setCheckpointInterval(50000) // Checkpoint every 50K samples
.setErrorHandling("CONTINUE") // Continue processing on errors
.setProgressReporting(true); // Enable progress reporting
System.out.println("π§ Batch Processing Configuration:");
System.out.println("- Batch Size: " + batchConfig.getBatchSize());
System.out.println("- Concurrent Batches: " + batchConfig.getMaxConcurrentBatches());
System.out.println("- Worker Threads: " + batchConfig.getWorkerThreads());
System.out.println("- Queue Size: " + batchConfig.getQueueSize());
// Create batch inference engine
// The batch engine is optimized for high throughput rather than low latency
var batchEngine = new BatchInferenceEngine()
.setConfiguration(batchConfig)
.setMemoryOptimization(true) // Optimize memory usage
.setCompressionEnabled(true) // Compress intermediate results
.setResultCaching(false) // Disable caching for batch processing
.setVerbose(true); // Enable verbose logging
// Load model for batch processing
System.out.println("\nπ₯ Loading model for batch processing...");
var model = ModelPersistence.load("models/image_classifier.superml");
batchEngine.loadModel("batch-classifier", model);
// Load large dataset
// In real scenarios, this would be from a database or file system
System.out.println("π Loading large dataset...");
var dataLoader = new DataLoader()
.setDataSource("datasets/large_image_dataset.csv")
.setBatchSize(batchConfig.getBatchSize())
.setPreprocessing(true)
.setShuffling(false); // Don't shuffle for batch processing
int totalSamples = dataLoader.getTotalSamples();
System.out.println("β
Dataset loaded: " + totalSamples + " samples");
// Create batch processor
// The processor handles the entire batch inference pipeline
var processor = new BatchProcessor(batchEngine, dataLoader)
.setOutputPath("results/batch_predictions.csv")
.setOutputFormat("CSV")
.setIncludeConfidence(true)
.setIncludeTimestamp(true)
.setCompressionEnabled(true);
// Start batch processing
System.out.println("\nπ Starting batch processing...");
long batchStart = System.currentTimeMillis();
var processingResult = processor.processBatches("batch-classifier");
long batchTime = System.currentTimeMillis() - batchStart;
// Display batch processing results
System.out.println("\n=== Batch Processing Results ===");
System.out.println("β
Processing completed successfully!");
System.out.println("π Total Samples: " + processingResult.getTotalSamples());
System.out.println("π Successful Predictions: " + processingResult.getSuccessfulPredictions());
System.out.println("π Failed Predictions: " + processingResult.getFailedPredictions());
System.out.println("β±οΈ Total Time: " + batchTime + " ms");
System.out.println("β‘ Throughput: " +
String.format("%.0f samples/sec", processingResult.getTotalSamples() * 1000.0 / batchTime));
// Performance breakdown
System.out.println("\nπ Performance Breakdown:");
System.out.println("- Loading Time: " + processingResult.getLoadingTime() + " ms");
System.out.println("- Preprocessing Time: " + processingResult.getPreprocessingTime() + " ms");
System.out.println("- Inference Time: " + processingResult.getInferenceTime() + " ms");
System.out.println("- Postprocessing Time: " + processingResult.getPostprocessingTime() + " ms");
System.out.println("- I/O Time: " + processingResult.getIOTime() + " ms");
// Resource utilization
System.out.println("\nπΎ Resource Utilization:");
System.out.println("- Peak Memory Usage: " + processingResult.getPeakMemoryUsage() + " MB");
System.out.println("- Average CPU Usage: " + String.format("%.1f%%", processingResult.getAverageCPUUsage()));
System.out.println("- GPU Utilization: " + String.format("%.1f%%", processingResult.getGPUUtilization()));
// Error analysis
if (processingResult.getFailedPredictions() > 0) {
System.out.println("\nβ οΈ Error Analysis:");
var errorSummary = processingResult.getErrorSummary();
errorSummary.forEach((errorType, count) -> {
System.out.println("- " + errorType + ": " + count + " occurrences");
});
}
// Save processing report
processingResult.saveReport("reports/batch_processing_report.json");
System.out.println("\nπ Processing report saved to: reports/batch_processing_report.json");
} catch (Exception e) {
System.err.println("β Error in batch processing: " + e.getMessage());
e.printStackTrace();
}
}
}
Key Learning Points:
- Batch Optimization: Process multiple samples together for maximum throughput
- Parallel Processing: Use multiple threads and concurrent batches for scalability
- Memory Management: Optimize memory usage for large-scale processing
- Error Handling: Robust error handling to continue processing despite failures
- Progress Monitoring: Track processing progress and performance metrics
- Resource Utilization: Monitor CPU, memory, and GPU usage during batch processing
Load Balancing and Distributed Inference
Distributed Inference Cluster
This example shows how to set up a distributed inference cluster for handling massive scale and providing high availability.
import org.superml.inference.cluster.InferenceCluster;
import org.superml.inference.cluster.LoadBalancer;
import org.superml.inference.cluster.NodeManager;
import org.superml.inference.cluster.HealthChecker;
public class DistributedInferenceCluster {
public static void main(String[] args) {
System.out.println("=== SuperML 2.1.0 - Distributed Inference Cluster ===\n");
try {
// Configure cluster nodes
// Each node runs an inference engine instance
// Nodes can be on different machines for true distributed computing
var nodeConfig = new NodeManager.NodeConfiguration()
.setMaxConcurrentRequests(2000) // 2000 requests per node
.setResponseTimeout(200) // 200ms timeout per node
.setMemoryLimit(4096) // 4GB memory limit per node
.setHealthCheckInterval(30) // Health check every 30 seconds
.setFailureThreshold(3) // Mark unhealthy after 3 failures
.setRecoveryTimeout(300); // 5 minute recovery timeout
System.out.println("π₯οΈ Node Configuration:");
System.out.println("- Max Concurrent Requests: " + nodeConfig.getMaxConcurrentRequests());
System.out.println("- Response Timeout: " + nodeConfig.getResponseTimeout() + " ms");
System.out.println("- Memory Limit: " + nodeConfig.getMemoryLimit() + " MB");
// Create cluster nodes
// In production, these would be separate machines or containers
var nodeManager = new NodeManager();
// Add inference nodes
System.out.println("\nπ§ Setting up inference nodes...");
nodeManager.addNode("node-1", "localhost:8081", nodeConfig);
nodeManager.addNode("node-2", "localhost:8082", nodeConfig);
nodeManager.addNode("node-3", "localhost:8083", nodeConfig);
nodeManager.addNode("node-4", "localhost:8084", nodeConfig);
// Configure load balancer
// The load balancer distributes requests across healthy nodes
// Multiple algorithms available: round-robin, least-connections, weighted
var loadBalancer = new LoadBalancer()
.setAlgorithm("WEIGHTED_ROUND_ROBIN") // Weight by node performance
.setHealthCheckEnabled(true) // Only route to healthy nodes
.setStickySessions(false) // Don't use sticky sessions
.setFailoverEnabled(true) // Enable automatic failover
.setMaxRetries(3) // Retry failed requests 3 times
.setRetryDelay(100) // 100ms between retries
.setCircuitBreakerEnabled(true); // Enable circuit breaker
System.out.println("βοΈ Load Balancer Configuration:");
System.out.println("- Algorithm: " + loadBalancer.getAlgorithm());
System.out.println("- Health Check: " + loadBalancer.isHealthCheckEnabled());
System.out.println("- Failover: " + loadBalancer.isFailoverEnabled());
System.out.println("- Max Retries: " + loadBalancer.getMaxRetries());
// Configure health checker
// Monitors node health and removes unhealthy nodes from load balancing
var healthChecker = new HealthChecker()
.setCheckInterval(15) // Check every 15 seconds
.setTimeoutPerCheck(5) // 5 second timeout per check
.setUnhealthyThreshold(2) // 2 failures = unhealthy
.setHealthyThreshold(3) // 3 successes = healthy
.setMetricsEnabled(true) // Collect health metrics
.setAlertingEnabled(true); // Enable health alerts
// Create inference cluster
// The cluster coordinates all nodes, load balancing, and health checking
var cluster = new InferenceCluster()
.setNodeManager(nodeManager)
.setLoadBalancer(loadBalancer)
.setHealthChecker(healthChecker)
.setModelReplicationEnabled(true) // Replicate models across nodes
.setAutoScalingEnabled(true) // Enable auto-scaling
.setMinNodes(2) // Minimum 2 nodes
.setMaxNodes(10) // Maximum 10 nodes
.setScaleUpThreshold(0.8) // Scale up at 80% capacity
.setScaleDownThreshold(0.3); // Scale down at 30% capacity
System.out.println("\nπ Cluster Configuration:");
System.out.println("- Auto-scaling: " + cluster.isAutoScalingEnabled());
System.out.println("- Min Nodes: " + cluster.getMinNodes());
System.out.println("- Max Nodes: " + cluster.getMaxNodes());
System.out.println("- Scale Up Threshold: " + (cluster.getScaleUpThreshold() * 100) + "%");
// Start cluster
System.out.println("\nπ Starting inference cluster...");
cluster.start();
// Wait for all nodes to be ready
System.out.println("β³ Waiting for nodes to be ready...");
cluster.waitForNodesReady(60); // Wait up to 60 seconds
// Load and distribute models
System.out.println("\nπ₯ Loading and distributing models...");
var model = ModelPersistence.load("models/production_model.superml");
cluster.deployModel("production-model", model);
System.out.println("β
Cluster started successfully!");
System.out.println("π Cluster Endpoint: http://localhost:8080/predict");
System.out.println("π Cluster Status: " + cluster.getClusterStatus());
// Simulate load testing
System.out.println("\nπ§ͺ Load Testing Cluster...");
loadTestCluster(cluster);
// Monitor cluster performance
System.out.println("\nπ Monitoring cluster performance...");
monitorClusterPerformance(cluster);
// Test failover scenarios
System.out.println("\nπ Testing failover scenarios...");
testFailoverScenarios(cluster);
} catch (Exception e) {
System.err.println("β Error in distributed inference cluster: " + e.getMessage());
e.printStackTrace();
}
}
private static void loadTestCluster(InferenceCluster cluster) {
try {
// Simulate high load
int numRequests = 10000;
int concurrentThreads = 50;
System.out.println("π₯ Load Test Configuration:");
System.out.println("- Total Requests: " + numRequests);
System.out.println("- Concurrent Threads: " + concurrentThreads);
var executor = java.util.concurrent.Executors.newFixedThreadPool(concurrentThreads);
var startTime = System.currentTimeMillis();
var completedRequests = new java.util.concurrent.atomic.AtomicInteger(0);
var failedRequests = new java.util.concurrent.atomic.AtomicInteger(0);
// Submit requests
for (int i = 0; i < numRequests; i++) {
final int requestId = i;
executor.submit(() -> {
try {
double[][] testData = {{Math.random(), Math.random(), Math.random()}};
var result = cluster.predict("production-model", testData);
if (result != null) {
completedRequests.incrementAndGet();
} else {
failedRequests.incrementAndGet();
}
} catch (Exception e) {
failedRequests.incrementAndGet();
}
});
}
// Wait for completion
executor.shutdown();
executor.awaitTermination(5, java.util.concurrent.TimeUnit.MINUTES);
long totalTime = System.currentTimeMillis() - startTime;
System.out.println("\nπ Load Test Results:");
System.out.println("- Completed Requests: " + completedRequests.get());
System.out.println("- Failed Requests: " + failedRequests.get());
System.out.println("- Success Rate: " +
String.format("%.2f%%", (completedRequests.get() * 100.0) / numRequests));
System.out.println("- Total Time: " + totalTime + " ms");
System.out.println("- Throughput: " +
String.format("%.0f requests/sec", numRequests * 1000.0 / totalTime));
} catch (Exception e) {
System.err.println("β Error in load testing: " + e.getMessage());
}
}
private static void monitorClusterPerformance(InferenceCluster cluster) {
try {
var monitor = cluster.getClusterMonitor();
System.out.println("π₯οΈ Cluster Performance Metrics:");
System.out.println("- Active Nodes: " + monitor.getActiveNodes());
System.out.println("- Total Requests: " + monitor.getTotalRequests());
System.out.println("- Requests Per Second: " + monitor.getRequestsPerSecond());
System.out.println("- Average Response Time: " + monitor.getAverageResponseTime() + " ms");
System.out.println("- P95 Response Time: " + monitor.getP95ResponseTime() + " ms");
System.out.println("- P99 Response Time: " + monitor.getP99ResponseTime() + " ms");
System.out.println("- Error Rate: " + String.format("%.2f%%", monitor.getErrorRate() * 100));
System.out.println("\nπΎ Resource Utilization:");
System.out.println("- CPU Usage: " + String.format("%.1f%%", monitor.getAverageCPUUsage()));
System.out.println("- Memory Usage: " + String.format("%.1f%%", monitor.getAverageMemoryUsage()));
System.out.println("- Network I/O: " + monitor.getNetworkIO() + " MB/s");
} catch (Exception e) {
System.err.println("β Error monitoring cluster: " + e.getMessage());
}
}
private static void testFailoverScenarios(InferenceCluster cluster) {
try {
System.out.println("π Testing Node Failover...");
// Simulate node failure
cluster.simulateNodeFailure("node-1");
// Continue making requests
for (int i = 0; i < 100; i++) {
try {
double[][] testData = {{Math.random(), Math.random(), Math.random()}};
var result = cluster.predict("production-model", testData);
if (i % 20 == 0) {
System.out.println("- Request " + i + ": " +
(result != null ? "Success" : "Failed"));
}
} catch (Exception e) {
System.out.println("- Request " + i + ": Failed - " + e.getMessage());
}
}
// Recover node
cluster.recoverNode("node-1");
System.out.println("β
Failover test completed");
System.out.println("π Active Nodes: " + cluster.getActiveNodeCount());
} catch (Exception e) {
System.err.println("β Error in failover testing: " + e.getMessage());
}
}
}
Key Learning Points:
- Distributed Architecture: Scale inference across multiple nodes for high availability
- Load Balancing: Distribute requests efficiently across healthy nodes
- Health Monitoring: Continuously monitor node health and remove unhealthy nodes
- Auto-scaling: Automatically scale the cluster based on demand
- Failover Handling: Gracefully handle node failures without service interruption
- Performance Monitoring: Track cluster performance and resource utilization
Best Practices
1. Performance Optimization
- Model Optimization: Use quantization and pruning to reduce model size
- Batch Processing: Process multiple samples together for higher throughput
- GPU Acceleration: Leverage GPU computing for significant speedups
- Memory Management: Use memory pooling and efficient data structures
2. Production Deployment
- Health Checks: Implement comprehensive health monitoring
- Load Balancing: Distribute traffic across multiple instances
- Auto-scaling: Scale based on demand and resource utilization
- Monitoring: Track performance metrics and resource usage
3. Error Handling
- Graceful Degradation: Continue service even with partial failures
- Circuit Breakers: Prevent cascading failures
- Retry Logic: Implement intelligent retry mechanisms
- Fallback Models: Maintain backup models for critical services
4. Security and Compliance
- Authentication: Secure API endpoints with proper authentication
- Rate Limiting: Prevent abuse with rate limiting
- Data Privacy: Ensure data privacy and compliance requirements
- Audit Logging: Log all inference requests for audit purposes
Summary
In this tutorial, you learned:
- Real-Time Inference: Build high-performance inference servers with sub-millisecond latency
- Batch Processing: Process large datasets efficiently with parallel processing
- Model Optimization: Optimize models for production deployment
- Distributed Systems: Scale inference across multiple nodes with load balancing
- Performance Monitoring: Monitor and optimize inference performance
- Production Patterns: Enterprise-grade deployment patterns and best practices
The SuperML Inference Engine provides the foundation for building scalable, high-performance machine learning services that can handle production workloads with reliability and efficiency.
Next Steps
- Model Deployment: Learn production deployment strategies
- Enterprise Patterns: Explore advanced enterprise integration patterns
- ML Optimization: Optimize models for specific hardware platforms
- MLOps Integration: Integrate with CI/CD and monitoring systems
- Real-time Streaming: Build real-time streaming inference pipelines
Youβre now ready to deploy machine learning models at scale with the SuperML Inference Engine!