Why One Database Language Isn’t Enough Anymore
In the world of data, variety isn’t just the spice of life—it’s the whole recipe. Modern analytics often juggle a cocktail of data types: neat tables of rows and columns, messy JSON documents, and sprawling multi-dimensional arrays like those used in machine learning. Traditionally, databases have been monolingual, fluent in just one data model. But today’s complex workloads demand a polyglot approach.
Researchers at Seoul National University, led by Kyoseung Koo, Bogyeong Kim, and Bongki Moon, have tackled this challenge head-on with M2, a new analytic system designed to handle multiple data models simultaneously—without the usual performance headaches.
The Multi-Model Puzzle: Polyglot Persistence vs. Single-Engine Systems
There are two main ways to handle multi-model data. The first is polyglot persistence, where different specialized databases each handle their own data type, coordinated by a central program. It’s like having a team of translators, each fluent in one language, passing messages through a middleman. The downside? The middleman becomes a bottleneck, and the back-and-forth communication slows everything down.
The second approach is a single-engine multi-model database, which tries to cram all data types into one storage engine. Imagine a translator who knows a bit of everything but isn’t an expert in any language. This leads to inefficient processing because the engine isn’t optimized for any particular data model.
M2 aims to combine the best of both worlds by tightly integrating multiple specialized storage engines into one system. It treats each data model as a first-class citizen, assigning queries to the engine best suited for the job, all while minimizing the costly communication overhead.
How M2 Speaks Multiple Data Languages at Once
M2 currently supports three main data models: relational (think SQL tables), document-oriented (like JSON), and array (multi-dimensional data used in scientific computing and AI). It uses two underlying engines: an enhanced version of DuckDB for relational and document data, and PreVision for array data.
When a query involves multiple data models, M2 breaks it down into partitions, each handled by the appropriate engine. A special “bridge” module facilitates communication between these engines, converting data formats and coordinating operations that span models.
One of the standout innovations is M2’s multi-stage hash join (MSHJ), a clever algorithm designed to efficiently join data across different models—especially between arrays and tables—without expensive data copying or format conversions. This join method leverages the spatial organization of array data to minimize disk reads and speeds up the process dramatically.
Why Multi-Stage Hash Join Feels Like Magic
Joining data from different models is like trying to match puzzle pieces from different sets. Arrays are stored in tiled blocks optimized for spatial locality, while relational data is unordered. Traditional join methods either convert arrays into tables (costly and slow) or scan arrays repeatedly (inefficient).
MSHJ sidesteps these issues by sorting and bucketing relational data to align with the array’s tiled structure. It then sequentially probes only the necessary tiles, ensuring each tile is accessed just once. This approach drastically reduces disk I/O and avoids unnecessary data copying.
In practical terms, MSHJ can speed up multi-model joins by orders of magnitude compared to conventional methods, especially when dealing with large, multi-dimensional datasets.
Putting M2 to the Test: Performance That Turns Heads
The Seoul National University team evaluated M2 using M2Bench, a benchmark suite designed for multi-model analytic workloads. They compared M2 against popular multi-model databases and polyglot persistence setups involving multiple independent systems.
The results were striking. M2 outperformed all competitors by up to 188 times in execution speed on complex multi-model queries. This leap came from its integrated architecture, specialized engines, and the efficiency of MSHJ.
Moreover, M2’s unified buffer pool—a shared memory space for both storage engines—improved memory utilization and reduced costly disk spills, especially during iterative machine learning tasks involving large matrices.
What This Means for the Future of Data Analytics
M2 represents a significant step toward truly versatile database systems that can handle the messy, multi-faceted data of today’s world without sacrificing performance. By treating each data model as a peer and leveraging specialized engines in harmony, M2 breaks down the silos that have long hampered multi-model analytics.
For data scientists, engineers, and businesses, this means faster insights, more seamless workflows, and the ability to tackle complex problems that span structured tables, nested documents, and high-dimensional arrays—all within a single, coherent system.
While M2 is still a prototype, its architecture opens exciting avenues for future research: smarter query optimizations, support for even more data models like graphs, and richer multi-model query languages.
In a World of Data Babel, M2 Offers a Common Tongue
As data grows more diverse and interconnected, the ability to process multiple data models efficiently is no longer a luxury—it’s a necessity. The work from Seoul National University shows that with thoughtful integration and innovative algorithms like multi-stage hash join, we can build systems that not only speak many data languages but do so fluently and fast.
In the end, M2 isn’t just a database system; it’s a bridge across the data divide, promising a future where the complexity of data models no longer slows down discovery but fuels it.