Today at its Data Universe event, Starburst launched Icehouse, a new managed lakehouse offering built upon the table format Apache Iceberg. Starburst says the combination of the Trino query engine and Iceberg tables will empower Icehouse customers to achieve new efficiencies in data storage and retrieve.
Apache Iceberg is gaining momentum as the standard table format for a new generation of data lakehouses, thanks to its support for ACID transactions and other features that bolster data correctness and usability in busy data analytics environments. While Iceberg can simplify life for data engineers and analysts, actually setting up and running Iceberg in production is not necessarily easy.
“People struggle with Iceberg because it’s hard to manage, it’s hard to set up, it’s hard to get data into, and it’s hard to optimize that data for performance,” Starburst vice president of product marketing Jay Chen tells Datanami. “What this [Icehouse] announcement does is help people get there faster, more easily, without having the headaches of trying to set it all up themselves.”
Just setting up Iceberg can be a challenge, he says. Customers must make decisions regarding table structures, partitioning, compaction, and cleanup. With Icehouse, Starburst takes those decisions out of the customers’ hands and implements a basic Iceberg service that will fit the needs of most customers.
That complexity is not to take anything away from Iceberg itself. The co-creator of Iceberg, Ryan Blue–who developed Iceberg at Netflix in part to improve access to HDFS-based data from Presto (which Trino forked from)–has built a similar commercial offering to manage Iceberg and store data on behalf of customers via his startup Tabular. Starburst, like Tabular and other companies, are betting that the advantages that Iceberg brings to developers in terms of data consistency and integrity are worth the slight bit of pain that comes from setting up and managing an Iceberg environment.
“The people I talk to, they love Iceberg,” says Tobias Ternstrom, Starburst’s chief product officer. “It’s a very, very, well-thought through table format. But fundamentally, it’s a set of files, so there are things that you need to do outside of just having the files there. And I don’t think people are surprised.”
And then there are features that customers would like to have in their Iceberg-based lakehouses that frankly are outside of the table format’s spec. For instance, many customers want role-based access at the table level or at the column level. “That’s not something that Iceberg, per se, gives you,” Ternstrom says. “Something needs to sit on top to provide that.”
The Starburst Icehouse is based on Galaxy, the managed, cloud-based data lakehouse platform that it has been selling for a number of years. Living on all the major clouds, Galaxy gives customers the capability to query data sitting in object storage (or other file systems or databases) using Trino, the open source query engine that emerged from Presto and which Starburst helps to develop.
In addition to handling access control and file management issues (compaction, clean-up, etc.), the Starburst Icehouse also offers data management and ingest capabilities. By connecting to Kafka topics or using change data capture (CDC) techniques, Starburst Icehouse can stream data into Iceberg tables, where it can be readily queried with Trino.
“Those are all things that you would have to stitch together into a solution before. Somehow you do data management. Somehow you get the data streamed in,” Ternstrom explains. “But I think that this is table stakes.”
Where Starburst is seeing a lot of excitement, he says, is integrating the whole data pipeline, from data ingest and data prep to materializing the data in Iceberg tables. When you factor in Iceberg’s built-in ACID support, this gives customers the capability to wind back data transactions (including data transformation steps) if something doesn’t look right downstream.
“It boils down to productivity,” Ternstron says. “Where do you want to spend your time? Do you want to spend your time digging around in the in the weeds, or do you want to spend it on your business?”
Starburst is going into preview with Icehouse running on AWS and S3. Customers that are interested in participating in the preview should contact the vendor. When it becomes generally available, Icehouse will be supported as part of Galaxy on all the public clouds.
Icehouse won’t be a separate offering, but will become part of Galaxy that’s activated whenever customers choose to store data in Iceberg tables. Of course, customers don’t have to choose Iceberg at all, which is part of Starburt’s mantra around being flexible and giving customers options.
Eventually, Starburst will likely adopt other table formats too, such as Apache Hudi and Databricks’ Delta Lake, Ternstron says. But Starburst senses that the market is consolidating around Iceberg, he says, and so the company is moving to deliver an end-to-end Iceberg solution that gives customers the best experience, he says.
“Our customers have been say, Hey we love your service, we love Trino, we love Iceberg,” he says. “But now I have to do all of these other things around Iceberg. Could you help us with that so we get a more integrated experience?”
Asked and delivered.
Related Items:
Starburst Brings Dataframes Into Trino Platform
Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?
Starburst Backs Data Mesh Architecture