Alibaba Cloud launches Eigen+ to cut costs and boost reliability for enterprise databases

Thursday July 3, 2025. 01:33 PM , from InfoWorld

Alibaba Cloud has developed a new cluster management system called Eigen+ that achieved a 36% improvement in memory allocation efficiency while eliminating Out of Memory (OOM) errors in production database environments, according to research presented at the recent SIGMOD conference.

The system addresses a fundamental challenge facing cloud providers: how to maximize memory utilization to reduce costs while avoiding catastrophic OOM errors that can crash critical applications and violate Service Level Objectives (SLOs).

The development, detailed in a research paper titled “Eigen+: Memory Over-Subscription for Alibaba Cloud Databases,” represents a significant departure from traditional memory over-subscription approaches used by major cloud providers, including AWS, Microsoft Azure, and Google Cloud Platform.

The system has been deployed in Alibaba Cloud’s production environment. The research paper claimed that in online MySQL clusters, Eigen+ “improves the memory allocation ratio of an online MySQL cluster by 36.21% (from 75.67% to 111.88%) on average, while maintaining SLO compliance with no OOM occurrences.”

For enterprise IT leaders, these numbers can translate into significant cost savings and improved reliability. The 36% improvement in memory allocation means organizations can run more database instances on the same hardware while actually reducing the risk of outages.

Alibaba Cloud’s Eigen+ has a classification-based memory management approach, whereas peers, AWS, Microsoft Azure, and Google Cloud, primarily rely on prediction-based memory management strategies, which, while effective, may not fully prevent OOM occurrences, explained Kaustubh K, practice director, Everest Group. “This difference in approach can position Alibaba Cloud’s Eigen+ with a greater technical differentiation in the cloud database market, potentially influencing future strategies of other hyperscalers.”

The technology is currently deployed across thousands of database instances in Alibaba Cloud’s production environment, supporting both online transaction processing (OLTP) workloads using MySQL and online analytical processing (OLAP) workloads using AnalyticDB for PostgreSQL, according to Alibaba researchers.

The memory over-subscription risk

Memory over-subscription — allocating more memory to virtual machines than physically exists — has become standard practice among cloud providers because VMs rarely use their full allocated memory simultaneously. However, this practice creates a dangerous balancing act for enterprises running mission-critical databases.

“Memory over-subscription enhances resource utilization by allowing more instances per machine, it increases the risk of Out of Memory (OOM) errors, potentially compromising service availability and violating Service Level Objectives (SLOs),” the researchers noted in their paper.

The stakes are particularly high for enterprise databases. “The figure clearly demonstrates that service availability declines significantly, often falling below the SLO threshold as the number of OOM events increases.”

Traditional approaches attempt to predict future memory usage based on historical data, then use complex algorithms to pack database instances onto servers. But these prediction-based methods often fail catastrophically when workloads spike unexpectedly.

“Eliminating Out of Memory (OOM) errors is critical for enterprise IT leaders, as such errors can lead to service disruptions and data loss,” Everest Group’s Kaustubh said. “While improvements in memory allocation efficiency are beneficial, ensuring system stability and reliability remains paramount. Enterprises should assess their cloud providers’ real-time monitoring capabilities, isolation mechanisms to prevent cross-tenant interference, and proactive mitigation techniques such as live migration and memory ballooning to handle overloads without service disruption. Additionally, clear visibility into oversubscription policies and strict adherence to Service Level Agreements (SLAs) are essential to maintain consistent performance and reliability.”

The Pareto Principle solution

Rather than trying to predict the unpredictable, Alibaba Cloud’s research team discovered that database OOM errors follow the Pareto Principle—also known as the 80/20 rule. “Database instances with memory utilization changes exceeding 5% within a week constitute no more than 5% of all instances, yet these instances lead to more than 90% of OOM errors,” the team said in the paper.

Instead of trying to forecast memory usage patterns, Eigen+ simply identifies which database instances are “transient” (prone to unpredictable memory spikes) and excludes them from over-subscription policies.

“By identifying transient instances, we can convert the complex problem of prediction into a more straightforward binary classification task,” the researchers said in the paper.

Eigen+ employs machine learning classifiers trained on both runtime metrics (memory utilization, queries per second, CPU usage) and operational metadata (instance specifications, customer tier, application types) to identify potentially problematic database instances.

The system uses a sophisticated approach that includes Markov chain state transition models to account for temporal dependencies in database behavior. “This allows it to achieve high accuracy in identifying transient instances that could cause OOM errors,” the paper added.

For steady instances deemed safe for over-subscription, the system employs multiple estimation methods, including percentile analysis, stochastic bin packing, and time series forecasting, depending on each instance’s specific usage patterns.

Quantitative SLO modeling

Perhaps most importantly for enterprise environments, Eigen+ includes a quantitative model for understanding how memory over-subscription affects service availability. Using quadratic logistic regression, the system can determine precise memory utilization thresholds that maintain target SLO compliance levels.

“Using the quadratic logistic regression model, we solve for the machine-level memory utilization (