I’ve been doing a bunch of speeches at various conferences on the merging of the data warehouse and data lake into a single unified analytics platform. I inevitably get one question, “How is this different from a lakehouse?” There are two answers, a short one that’s glib and easy, and a longer one that really dives into things. Short answer, “They’re extremely similar architectural concepts.” The rest of this post is the long answer.
Race to the middle
Essentially, these two new architecture concepts came from a huge race to the middle. Every data lake vendor is racing to add data warehouse-like capabilities, and every data warehouse vendor is racing to add data lake-like capabilities.
Unified Analytics Platform
I did a whole article on the evolution of the modern data warehouse, so I won’t repeat myself here, but essentially, over the last decade since the data lake started grabbing headlines, the analytical databases have been adding data lake-like capabilities to counter their previous weaknesses. The first thing they did was fix the affordable scalability problem by becoming distributed with massively parallel processing (MPP) engines, just like the data lakes. They’ve added support for streaming data, and schema-on-read for semi-structured data, and the ability to query data in formats other than their own. They also separated out data storage to utilize object storage as well as distributed file systems. I can use a database to query Parquet files in an S3 bucket on AWS now, with database-level resource efficiency and outstanding performance.
Analytical databases (like Vertica, where I work) also embedded advanced analytics capabilities like geospatial analysis, time series, and machine learning – not just algorithms, but the whole shebang from statistical analyses to model evaluation. They also added in clients for frameworks like R and Python so data scientists could do their work on whole datasets without having to sample them down to in-memory single-node size or get someone else to productionize them when they were done. That work has been largely contributed by the open source community.
Data warehouse vendors also took their big concurrency advantage one step further, and isolated workload resources so that streaming data ELT and data onboarding could run inside the database all the time, without slowing down ad hoc BI queries, or machine learning model training. That was essential to support streaming use cases including Internet of Things (IoT)environments.
That’s a unified analytics platform – a distributed data warehouse plus support for many kinds of data (structured, semi-structured, and streaming), Python and R as well as SQL clients, and in-database machine learning, geospatial, and other advanced analytics built-in.
Data Lakehouse
Okay, so what’s a data lakehouse, then?
Well, as soon as data lakes got out in the world, they improved, like any software stack that has adoption. Since they were an open source stack, the data lakes had even more developers working on them. First thing they did was replace MapReduce, which was incredibly slow and limited, with Apache Spark as the main data processing engine. Data lakes were still not as highly optimized as data warehouse databases as far as response latency, but the gap became smaller.
Data lakes also rapidly saw that every single company that needed sophisticated machine learning also needed plain old business intelligence. An analytical platform that couldn’t query with SQL was at a huge disadvantage, so they added SQL query engines. Over time, those went from kind of terrible, to pretty impressive. Presto, in my opinion, is the best of the lot. To accelerate SQL querying, it became clear that a structured format was needed. Data formats like Parquet and ORC are just as strictly structured as any database. A metadata store is also necessary for ACID compliance and just to know where to find your data.
Over time, security and governance have been added to the data lakes as they’ve been working hard to catch up to the databases. And, as is the tradition with open source stacks, they added on more components to do this. Some of these helped make it somewhat less arduous to get machine learning into production. This has made the stacks bigger, with even more components, but as a whole, they have what a company needs to get the job done now, both the job of BI and advanced analytics like predictive, and machine learning.
That’s a data lakehouse – a Spark-based data lake plus support for SQL BI analysis, structured data, ACID compliance, with added security and metadata.
How do you choose?
Someone in a TDWI roundtable recently asked me when an organization would want one over the other, and I think that question gets to the heart of things.
It’s fundamentally a race to the middle, to a blended architecture with all the strengths of both. But each side tends to be stronger in certain areas because of where they started the race. Generally, I think that if you need more of the capabilities of a data lake, choose the Lakehouse architecture, and if you need more of the capabilities of the data warehouse, choose the Unified architecture.
That’s the short answer. There’s an obvious follow-up question: What capabilities?
Well, since the vendors selling lakehouse stacks are often, historically data lake vendors, the lakehouse tends to be able to support more types of data. Unstructured data like sound, video, and image files are a good example. The ability to analyze that type of data is of high importance for certain use cases like computer vision, or predictive maintenance based on engine sounds. If your company wants to accomplish one of those use cases, then a lakehouse architecture is sensible.
On the other hand, response latency and concurrency are still data lake stack weaknesses, so if you need sub-second response on large data sets, the lakehouse is likely not the right choice. If you need more than ten people using your data simultaneously, or more than ten different workloads, the lakehouse is entirely the wrong choice. In my opinion, poor concurrency is the biggest weakness of lakehouse stacks, even more of an issue than stack complexity.
For unified stacks, the biggest strength is what you get from proven software – mainly reliability and security. These things have been considered basic table stakes for analytical databases for decades. Data lake stacks have been adding these things as fast as they can. But data warehouses have had at least a ten-year head start which makes feature parity difficult to achieve. Low latency response is also more of a unified stack strength for much the same reason. Databases bragged for ages about the speed and efficiency of their query optimizers, which squeeze out more and more speed every year. If you need high performance, a unified analytics platform is more likely to be the right choice.
The biggest advantage of the unified stack, though, is the same as the biggest disadvantage of the lakehouse stack: concurrency. If you need to make analytics available to a large percentage of your organization, or you need to do multiple things with analytics – reporting, machine learning, dashboards, applications, ad hoc queries – then, the unified analytics platform is the way to go.
As with most architecture decisions, think about what you need to accomplish first, then decide.