Materialization and Acceleration in the Iceberg Lakehouse Era
Comparing Dremio, Trino, Doris, StarRocks, ClickHouse, Snowflake & Databricks
As organizations increasingly embrace the lakehouse architecture, the stakes have never been higher for delivering fast, scalable, and cost-effective analytics directly on cloud data lakes. Whether you’re running business dashboards, powering ML pipelines, or exploring ad hoc queries, performance is no longer a nice-to-have; it’s an expectation.
That’s where materialization comes into play.
At a high level, materialization is the idea of pre-computing and storing data, like aggregations, filters, and joins, so that future queries don’t have to start from scratch. Think of it as meal prepping for your data workloads: instead of chopping veggies every time you cook, you prep them once and reuse as needed. It’s all about speeding up the experience without compromising freshness or flexibility.
But not all materialization is created equal.
In this post, we’ll explore how a range of modern query engines, including Dremio, Trino, Starburst, Doris, StarRocks, ClickHouse, Snowflake, and Databricks, are tackling the challenge of query acceleration in the lakehouse era. From open-source federated engines to OLAP-native databases and enterprise cloud platforms, each system brings a unique perspective on:
Where the materialized data lives
Who manages its lifecycle
How queries are rewritten to use it
And how automated (or manual) the process is
What makes this moment especially exciting is that major vendors like Snowflake and Databricks are now leaning into open table formats like Apache Iceberg, blurring the lines between traditional warehouses and modern lakehouses. At the same time, technologies like Dremio’s autonomous Reflections, Starburst’s Warp Speed, and ClickHouse’s ingest-time views are pushing the boundaries of what transparent acceleration can look like.
Let’s unpack how these systems stack up and what their materialization strategies tell us about the future of the lakehouse.
The Lakehouse Revolution: Beyond Just Storage
The term “Lakehouse” has become one of the most talked-about ideas in data architecture, and for good reason. It promises the flexibility of data lakes (schema evolution, open formats, low-cost object storage) with the performance and usability of data warehouses (fast queries, governance, semantics).
At its core, the Lakehouse vision revolves around two pillars:
Open Table Formats like Apache Iceberg, Delta Lake, and Hudi, which bring structure, versioning, and transactional guarantees to files in cloud storage.
Decoupled Compute Engines that can query that data directly, without needing to move or transform it into another system first.
This is a big deal.
It means you can run analytics, data science, and machine learning directly on cloud storage (S3, ADLS, GCS) without duplicating data across systems. But with that freedom comes a challenge: how do we make these queries fast enough for real-world workloads?
That’s where materialization and acceleration become critical.
In the Lakehouse world, materialization isn’t just about traditional materialized views anymore. It’s a broader toolbox of techniques designed to speed things up while preserving the openness and scalability of the architecture:
Precomputed tables and aggregates for reuse
Automatic query rewrites that select the best optimized path
Caching and indexing layers that reduce I/O costs
Incremental updates that keep materializations fresh without reprocessing everything
Semantic layers that maintain consistency across teams
Some engines bake this into their core. Others treat it as a bolt-on. And increasingly, we’re seeing a spectrum of philosophies — ranging from user-managed views to fully autonomous systems that adapt based on query patterns.
And as more engines adopt or support Iceberg — including Snowflake and Databricks, the question isn’t just can you accelerate lakehouse queries, but how you choose to do it:
Do you prioritize transparency or control?
Do you want the engine to figure it out for you, or do you want to tune it yourself?
Are you tied to a specific stack, or investing in open formats and multi-engine flexibility?
With that in mind, let’s dive into how today’s leading platforms approach materialization and acceleration, starting with Dremio.
Dremio: Autonomous Acceleration for the Open Lakehouse
Among the platforms aiming to deliver lightning-fast analytics on open data lakes, Dremio stands out for taking a deeply Lakehouse-native approach. Rather than bolting acceleration features onto a warehouse-centric engine, Dremio builds performance into the fabric of the Lakehouse, specifically through a combination of Reflections, smart caching, and autonomous Iceberg table optimization.
Reflections: More Than Materialized Views
Dremio’s flagship acceleration mechanism is called a Reflection, and while it shares DNA with traditional materialized views, it works at a higher level of abstraction. Reflections can represent:
Pre-aggregated data
Sorted or partitioned data subsets
Columnar projections of wide tables
But what makes them unique is how transparently and autonomously they work.
Users don’t query Reflections directly. Instead, Dremio’s query optimizer automatically rewrites queries to use the most performant Reflection available. It’s like having an intelligent assistant that figures out the fastest path to your data without you needing to touch any knobs.
This is especially powerful when working with Apache Iceberg.
Iceberg-Optimized, Autonomous by Default
When your data lives in Iceberg tables, Dremio kicks into high gear:
Live, Incremental Reflections: Unlike other sources where Reflections may require full refreshes, on Iceberg, Dremio supports incremental refreshes based on table metadata. This means only changed data is reprocessed, faster and more efficiently.
Autonomous Refresh: Reflections can now be automatically managed on Iceberg tables, Dremio watches query patterns and source table changes to decide when and what reflections to create, refresh and remove. It’s like having a self-driving car for data optimization.
Integrated Table Optimization: For Iceberg tables managed via Dremio’s Enterprise catalog, Dremio can automatically optimize data to improve file organization, pruning, and scan efficiency, without user intervention.
Even when Iceberg tables live in external catalogs (like AWS Glue or Nessie), they still benefit from Dremio’s results cache and query plan cache, speeding up repeated queries by skipping redundant work.
Caching at Every Layer
Dremio doesn’t stop at query rewriting and materialized data. It also boosts performance through a multi-layered caching strategy:
Columnar Cloud Cache (C3): Frequently accessed files, or even parts of files, are cached on local Dremio nodes, reducing read latency from cloud storage (like S3 or ADLS).
Results Cache: Stores previous query results at the logical level, returning answers instantly when the same query runs again and no data has changed.
Query Plan Cache: Caches optimized query plans to reduce planning overhead on repeat or similar queries.
The result? Queries on the Lakehouse feel snappy, even when scanning terabytes of Iceberg data stored in object storage.
When Reflections Aren’t on Iceberg
While Dremio supports Reflections on other data sources like Hive or relational databases, those come with more manual effort:
Full Refreshes: Non-Iceberg Reflections generally require full re-computation.
Scheduled or Manual Triggers: You’ll need to define a refresh cadence or explicitly kick off refreshes.
Still useful, but not as hands-off as the Iceberg-native experience.
Dremio’s approach embraces the Lakehouse vision fully: open formats, autonomous optimization, and performance without complexity. For teams building on Apache Iceberg, especially those managing tables through Dremio’s catalog, the platform offers warehouse-like speed with data lake flexibility, and without locking you into a proprietary format or workflow.
Trino & Starburst: Federated Performance with Manual and Proprietary Boosters
Trino started as a federated SQL query engine, built to query across heterogeneous data sources without moving or duplicating data. Its power lies in its extensibility and flexibility; however, its approach to acceleration has traditionally been manual and connector-driven. Starburst, which builds on Trino, enhances this with materialization features and a proprietary performance layer called Warp Speed.
Together, they represent a spectrum, ranging from community-driven federation to enterprise-grade acceleration.
Trino (Open Source): Federation First, Materialization as an Add-On
At its core, Trino is designed to run SQL across diverse data sources: object storage, relational databases, NoSQL systems, and more. Materialization in Trino is explicit and user-managed.
Materialized Views in Trino
Defined using
CREATE MATERIALIZED VIEW
, Trino materialized views are precomputed results of SQL queries.These are persisted as physical tables in the underlying storage system connected via Trino connectors. For example:
If the connector is Hive, materialized views are stored in Hive-managed tables.
If the connector supports Iceberg, views can be persisted as Iceberg tables in object storage (e.g., S3, ADLS).
Trino itself does not store the materialized data — it delegates both storage and metadata to the connector backend.
Refresh Behavior
Refreshing materialized views is explicit. Users must run
REFRESH MATERIALIZED VIEW
, or schedule it via external orchestration tools like Apache Airflow, dbt, or cron.Most refreshes are full recomputations of the view’s query logic.
Incremental refresh is only possible if the connector and table format support it (e.g., Apache Iceberg with proper metadata tracking), but this requires careful tuning.
Query Acceleration
Trino’s optimizer has evolving support for automatically rewriting queries to use materialized views. In many cases, users must query the materialized view directly to benefit from precomputation.
A “grace period” mechanism allows stale views to be used if recent refreshes have not yet occurred, balancing performance with freshness.
Trino gives users complete control, but also full responsibility. It’s a good fit for teams who want flexibility across sources, but are comfortable managing refresh logic and view usage themselves.
Starburst (Enterprise/Galaxy): Acceleration Layers for Trino
Starburst builds on Trino by adding enterprise capabilities — including governance, query management, and crucially, acceleration mechanisms for performance-critical workloads.
Enhanced Materialized Views
Like open-source Trino, Starburst allows users to define materialized views, typically persisted as Iceberg tables in object storage.
Starburst adds scheduling controls (e.g., cron-style refresh intervals) and incremental refresh support when working with table formats like Iceberg.
These views are persisted and managed within the object store, just like other Iceberg-managed tables, allowing external interoperability and visibility.
Warp Speed: Proprietary, Autonomous Acceleration
Warp Speed is Starburst’s proprietary caching and indexing layer, purpose-built to improve performance without user intervention.
Storage Layer:
Raw data blocks are cached on local SSD disks of Trino worker nodes. (Playing a similar role to Dremio’s C3)
Indexing metadata is stored in a shared object storage bucket, separate from the primary data source.
Behavior:
Operates transparently; users do not need to configure or manage it.
Indexes and cache are populated automatically based on query access patterns.
Warp Speed uses these indexes to skip irrelevant data during scans and speed up joins and filter operations.
Refresh and Consistency:
Cache and index maintenance is handled internally. Warp Speed is designed to keep itself in sync with the source, warming up new data as it is accessed.
Warp Speed offers performance benefits, particularly for data lake workloads. However, it is proprietary, and cached/indexed data is not externally accessible or reusable outside Starburst. In contrast, Dremio’s reflections are persisted as Iceberg tables, allowing for potentially external access.
Comparing Philosophies: Explicit vs. Autonomous
Where Trino and Starburst require users to define and manage materialized views, Dremio’s Reflections are autonomous and tightly coupled with the query engine’s optimizer. On Iceberg tables, Dremio uses the format’s rich metadata as a native indexing layer, reducing the need for external caching or proprietary indexing systems.
While Warp Speed provides acceleration for Starburst users, it introduces a closed layer of storage and logic. Dremio’s stack, in contrast, keeps acceleration fully within the open Iceberg ecosystem, with materializations (Reflections) stored as physical Iceberg tables, retaining portability and visibility across tools and platforms.
Apache Arrow: A Columnar Edge in Query Performance
When evaluating the performance characteristics of query engines like Dremio, Trino, and Starburst, it’s easy to focus on high-level features like materialized views, caching layers, or indexing strategies. But underneath those features lies something just as critical: how data is represented and moved in memory. That’s where Apache Arrow plays a defining role, especially in how Dremio delivers consistently high performance on lakehouse workloads.
Columnar Memory Format as a First-Class Citizen
Apache Arrow is an in-memory, columnar data format designed for high-speed analytics. Its strength lies in its ability to represent tabular data in a way that’s:
Highly cache-efficient (ideal for modern CPU architectures)
Language-independent
Zero-copy compatible across processes and systems
In essence, Arrow allows systems to move and process data in memory with minimal serialization overhead, which is often a hidden cost in distributed query engines.
Dremio’s Arrow-Native Execution Engine
Dremio was built from the ground up to be Arrow-native. Every stage of query execution, scans, filters, joins, and aggregations, is performed directly on Arrow buffers in memory. This provides several performance advantages:
Zero-copy data movement between execution operators, reducing CPU overhead
Reduced serialization costs during data shuffles
Tight integration with Gandiva, an LLVM-based expression compiler that JIT-compiles filter and projection expressions on Arrow vectors
Because Arrow is the native execution format, there’s no impedance mismatch between how data is stored (e.g., Parquet files in object storage) and how it’s processed in memory. It’s an end-to-end columnar pipeline.
Trino and Starburst: Row-Based Execution Overhead
Trino and Starburst, in contrast, use a format-based execution model internally (so row-based for row-based formats and columnar for columnar formats). Even though they can read columnar formats like Parquet and ORC from object storage, once the data is loaded into memory, it is typically transformed into pages and blocks in a proprietary format introducing more possible deserialization points vs using an open standard like Arrow.
This transformation adds overhead in several ways:
CPU-intensive deserialization from data to Trino’s in-memory format
While Starburst’s Warp Speed caching and SSD indexing layers help mitigate some of these costs, they don’t eliminate the overhead of row-based execution within the JVM.
What This Means for Query Acceleration
For many lakehouse workloads, especially those involving wide tables, column pruning, or large scans, Arrow-native execution can deliver significantly faster performance simply by being more efficient with CPU, memory, and data layout. This becomes even more pronounced in multi-tenant or cost-sensitive cloud environments, where every millisecond of compute counts.
When combined with Dremio’s Reflections, Columnar Cloud Cache, and query plan optimizations, Arrow gives Dremio a foundational performance advantage: it’s not just what Dremio does, it’s how efficiently it does it.
Materializations and caching get a lot of attention, but the execution model matters too. Dremio’s use of Apache Arrow enables vectorized, zero-copy processing that aligns with the columnar nature of data lakes, allowing it to achieve benchmarks like 22-second TPC-DS completion. In contrast, Trino and Starburst, while powerful and flexible, still carry the overhead of non-Arrow-based processing, which can be a limiting factor in high-throughput lakehouse workloads.
OLAP Engines and Iceberg: Materialization Meets Engine-Centric Design
While engines like Dremio and Trino are built to operate over the data lake, OLAP systems such as Apache Doris, StarRocks, and ClickHouse take a different approach. These systems prioritize performance by owning the storage and execution lifecycle, delivering ultra-fast analytics via tight integration with internal columnar storage formats.
But as Apache Iceberg gains traction as a standardized, open table format for lakehouse workloads, these engines are evolving. Each is beginning to interact with Iceberg in its way, either by reading from it, or materializing Iceberg data into their native storage formats for acceleration.
Let’s take a closer look.
Apache Doris: Fast Analytics via Native Materialized Views
Doris is a distributed MPP OLAP engine, purpose-built for real-time analytics. Its materialized views are deeply integrated into the core engine, functioning more like lightweight indexes than standalone tables.
Materialization for Iceberg Workflows
Doris now supports reading from external Apache Iceberg tables, allowing users to query Iceberg data directly.
For acceleration, Doris materialized views must be defined within its internal storage engine. This typically involves copying Iceberg data into Doris-managed tables and then building materializations on top.
This means materializations are persisted in Doris’s proprietary columnar format, separate from the Iceberg table and inaccessible to other engines.
Pros
Materialized views support automatic query rewrite, including filter pushdown and partition pruning.
Incremental refresh and partition-level updates help keep materializations efficient and relatively fresh.
Tight coupling with the execution engine enables very high performance.
Cons
Materializations are not stored as Iceberg tables, so other engines or tools can’t reuse them.
Data movement is required to bring Iceberg data into Doris’s environment before it can be accelerated.
Requires explicit refresh logic or scheduling.
In contrast to Dremio, which accelerates queries directly on Iceberg data using Reflections stored as Iceberg tables in your existing data lake, Doris delivers speed by re-ingesting data into its environment, which trades openness for performance.
StarRocks: OLAP with External Catalog Awareness
StarRocks shares many characteristics with Doris, but it has gone further in supporting external data catalogs, including Hive Metastore and Apache Iceberg.
Materialization for Iceberg Workflows
StarRocks can query Iceberg tables from external catalogs.
To accelerate queries, it materializes views into its native storage engine, not back into Iceberg tables.
In other words, the materialized view is a copy of the Iceberg data, persisted in StarRocks’s internal columnar format, not as a logical layer or as an Iceberg-native asset.
Materialized View Behavior
Views support automatic query rewriting, partition pruning, and incremental refreshes.
Users can define refresh intervals, or trigger refreshes manually or on data change.
Pros
High performance and low latency once the data is materialized.
Better support than Doris for external data lake sources like Iceberg and Hive.
Flexible refresh mechanisms (sync, async, auto-triggered).
Cons
Materializations are engine-specific and opaque to external tools.
There’s a disconnect between the source Iceberg table and the StarRocks materialized view; any changes in the source must be tracked and refreshed manually or through dependency management.
Like Doris, this approach duplicates data, increasing storage and complexity.
Compared to Dremio and Trino:
Dremio keeps materializations in the Iceberg ecosystem, stored as Iceberg tables, transparently rewritten into queries, and accessible to any compatible engine.
Trino, though more manual, also persists materializations in Iceberg format if used with the right connector.
StarRocks, on the other hand, optimizes for its engine but sacrifices interoperability and openness.
ClickHouse: Ingest-Time Materialization at Its Core
ClickHouse approaches materialization differently. It treats materialized views as real-time transformations applied during data ingestion. That design is highly performant, but presents challenges for Iceberg integration.
Materialization for Iceberg Workflows
ClickHouse now includes basic support for reading from Iceberg tables, but does not write materializations back to Iceberg.
Instead, materialized views create new internal tables, populated by streaming inserts or periodic refresh jobs.
These target tables are stored in ClickHouse’s MergeTree format, not Iceberg.
How It Works
Incremental Materialized Views: Data transformations are applied as new rows are ingested — ideal for append-only use cases.
Refreshable Materialized Views: Full recomputation views that run on a schedule or on demand.
Pros
Exceptional performance for real-time and high-throughput ingestion scenarios.
Materializations act as pre-aggregated, transformation-ready tables, speeding up complex queries.
Cons
Materializations are isolated from the Iceberg table, you’re working with derived data in a separate storage layer.
No support for writing back to Iceberg or making materializations visible to other engines.
Lacks tight semantic integration with Iceberg metadata (e.g., snapshots, partition evolution).
In contrast to Dremio, which allows fully interoperable acceleration within the Iceberg ecosystem, ClickHouse treats Iceberg as just another data source and materializes into its own proprietary tables.
Doris, StarRocks, and ClickHouse deliver impressive performance by owning their storage layer, but when used in Iceberg workflows, their materializations are copied and siloed, trading openness for raw speed. These systems work best when the engine is the center of your stack.
In contrast:
Dremio’s Reflections are persisted as Iceberg tables and optimized using Iceberg’s metadata.
Materializations remain within the open data lake, visible to other tools and engines.
No need to duplicate or ingest data into a proprietary engine, which reduces overhead and boosts agility.
For organizations prioritizing open formats, interoperability, and low-friction acceleration, this distinction is more than an architectural detail — it’s a strategic advantage.
Snowflake and Databricks: Shifting Toward the Open Lakehouse
The rapid adoption of Apache Iceberg has made one thing clear: open table formats are winning mindshare. Both Snowflake and Databricks, historically known for tightly coupled platforms, are now adapting to this shift by embracing read and write support for Iceberg, enabling new forms of interoperability and lakehouse-style analytics.
But their approaches to materialization and acceleration still reflect their warehouse-centric origins, especially when compared to systems like Dremio, which were designed from day one to accelerate open data lake workloads natively.
Let’s unpack how Snowflake and Databricks are evolving and where their current materialization models land in the Lakehouse spectrum.
Snowflake: Extending the Warehouse into the Lakehouse
Snowflake has built its reputation on delivering a managed, performant, and user-friendly cloud data warehouse experience. Until recently, this experience was limited to data stored inside Snowflake-managed tables. But with the introduction of Iceberg Table support, Snowflake is extending its reach into open data lake architectures.
Iceberg Support
Snowflake now allows reading and writing to Apache Iceberg tables in external object storage (e.g., S3).
These tables can be managed either through Snowflake’s catalog or external catalogs like AWS Glue and Apache Polaris.
This marks a significant architectural shift, but not yet a fully open acceleration strategy.
Materialization in Snowflake
Snowflake supports traditional Materialized Views, which are persisted as Snowflake-managed internal tables.
These views are created explicitly and tied to data stored inside Snowflake (i.e., Snowflake-native tables).
Additionally, all materialized views are opaque to external systems, locked inside the Snowflake environment.
Acceleration Tools
Snowflake also includes features like automatic clustering, result caching, and query plan caching to accelerate workloads.
These tools work well for Snowflake-native data, but they do not all extend to Iceberg tables in external storage.
Pros
Seamless experience within the Snowflake ecosystem.
Low operational burden — fully managed materialized views for Snowflake-native data.
Early but promising support for Iceberg as a step toward interoperability.
Cons
All materialized data is locked into Snowflake’s internal storage, inaccessible to external engines.
Performance optimizations like clustering or caching are Snowflake-centric, not format-centric.
Compared to Dremio, which materializes directly on Iceberg tables and leverages Iceberg metadata for optimization, Snowflake’s current model reflects a transition phase, where openness is supported in terms of access but not yet in terms of optimization.
Databricks: Delta-First, Iceberg-Aware
Databricks has always positioned itself as a Lakehouse pioneer, thanks to its Delta Lake table format and unified platform for data engineering, ML, and analytics. However, as Apache Iceberg has gained momentum as an open standard, Databricks has responded by adding support for reading and writing Iceberg tables, too.
Iceberg Support
Databricks supports read and write operations on Iceberg tables, alongside its native Delta Lake format.
The platform offers options for managing Iceberg tables through Unity Catalog, further aligning with open Lakehouse principles.
However, much of Databricks’ ecosystem, especially its performance tooling, is still oriented around Delta Lake.
Materialization with Delta and Iceberg
Materialized Views in Databricks are implemented through Delta Live Tables (DLT) and manual view definitions.
DLT can orchestrate incremental pipeline updates and produce query-optimized, materialized outputs, but this is a developer-driven workflow, not an autonomous engine feature.
Materialized outputs can be persisted as Delta or Iceberg tables, depending on how the pipeline is configured.
That said, Databricks does not yet offer autonomous, transparent acceleration of Iceberg queries in the way that Dremio does with Reflections.
Performance Features
Databricks uses Photon, a vectorized query engine, and Auto Optimize, which compacts small files and manages data layout.
These features improve performance for both Delta and (increasingly) Iceberg workloads, but they require manual setup and tuning.
There’s no automatic query rewrite mechanism that swaps in materialized data in the background based on cost.
Pros
Strong data engineering capabilities and streaming pipelines (DLT).
Performance optimizations like Photon and Auto Optimize.
Support for persisting materialized outputs as Iceberg tables, maintaining format openness.
Cons
Materialization is manual and developer-driven, not transparent or autonomous.
No automatic query rewriting to use materializations.
Iceberg support is still emerging, and acceleration features lag behind Delta.
Compared to Dremio:
Both platforms can persist materialized data as Iceberg tables.
But only Dremio offers a semantic-aware, autonomous optimization layer (Reflections) that rewrites queries and manages refreshes without user input.
Databricks requires you to build and manage the logic yourself, making acceleration a dev-centric task.
Snowflake and Databricks are moving toward the Lakehouse, but their acceleration strategies remain tied to their stacks. Materialized views are either not applicable to Iceberg (Snowflake) or require manual pipelines (Databricks). In both cases, materializations are either proprietary or developer-managed, not autonomously maintained or format-native.
Dremio stands apart by delivering:
Transparent, automatic query acceleration
Materializations stored as Iceberg tables, visible to other tools
Integration with Iceberg’s metadata for incremental refreshes and clustering
As the Lakehouse paradigm matures, the ability to accelerate open formats without giving up control or interoperability may prove to be the ultimate differentiator.
Comparing Materialization Approaches Across the Lakehouse Landscape
With so many engines converging on the Lakehouse model, it’s essential to understand not just whether materialization is supported, but how it’s implemented, and what tradeoffs each approach entails in terms of performance, openness, and operational complexity.
Here’s a breakdown of the core differences across the major players:
Dremio
Materialization Mechanism: Reflections, not traditional MVs — live, incremental, and semantic-aware.
Persistence: Stored as Apache Iceberg tables directly in your object store (e.g., S3, ADLS), maintaining open format interoperability.
Query Rewrite: Fully automatic and transparent. Users query logical tables; Dremio selects optimal Reflections behind the scenes.
Refresh Behavior:
Autonomous on Iceberg tables (incremental + continuous refresh).
Manual or scheduled on non-Iceberg sources (databases, data warehouses, data lakes).
Additional Optimizations:
Arrow-native engine for zero-copy, vectorized execution.
Columnar Cloud Cache on nodes for frequently accessed files.
Query plan and result caching enhance repeat performance.
Tradeoff: Requires Iceberg to unlock full autonomy, but doesn’t lock users into proprietary layers. Reflections available in free and paid versions of Dremio (autonomous reflections an enterprise feature).
Trino (Open Source)
Materialization Mechanism: Materialized Views are defined manually via SQL.
Persistence: Stored as physical tables in underlying connectors (e.g., Hive, Iceberg).
Query Rewrite: Limited. Optimizer support is evolving; often requires users to query the MV directly.
Refresh Behavior: Manual refresh (
REFRESH MATERIALIZED VIEW
) or orchestrated via external schedulers (e.g., Airflow).
Tradeoff: Flexible and federated, but requires manual management and lacks deep integration with acceleration features.
Starburst (Enterprise/Galaxy)
Materialization Mechanism: Same MV system as Trino, with enterprise enhancements.
Persistence: Typically stored as Iceberg tables in object storage.
Query Rewrite: Improved optimizer support with automatic query rewrites and cost-based selection.
Refresh Behavior: Manual, scheduled, or incremental refresh with cron-like controls.
Additional Acceleration:
Warp Speed: Proprietary caching/indexing layer.
Stores raw data blocks and indexes on SSDs and in object storage.
Operates transparently, automatically tuning based on usage.
Tradeoff: Stronger performance with Warp Speed for enterprise clusters (no free version of Warp Speed), but it introduces proprietary layers that reduce visibility and interoperability.
Apache Doris
Materialization Mechanism: Integrated Materialized Views within the OLAP engine.
Persistence: Stored in Doris’s internal columnar storage, not in open formats.
Query Rewrite: Fully automatic within Doris queries.
Refresh Behavior: Supports synchronous (real-time) and asynchronous (scheduled) refresh modes, including incremental.
Iceberg Integration: Can read from Iceberg, but accelerates by ingesting data into Doris.
Tradeoff: High performance inside Doris, but sacrifices format openness and requires data duplication for acceleration.
StarRocks
Materialization Mechanism: Similar to Doris — internal Materialized Views tied to StarRocks storage.
Persistence: Stored in StarRocks’s internal format, not as Iceberg tables.
Query Rewrite: Automatic with robust optimizer support.
Refresh Behavior: Real-time (sync), auto-triggered, scheduled, or manual — with incremental support.
Iceberg Integration: Supports querying external Iceberg tables, but materializations remain isolated within StarRocks.
Tradeoff: Powerful performance for real-time analytics, but with proprietary storage and limited interoperability.
ClickHouse
Materialization Mechanism: Materialized Views applied at ingestion time or via periodic refresh.
Persistence: Views populate internal ClickHouse tables, typically using MergeTree engines.
Query Rewrite: Optimizer leverages target tables when queried directly; some rewrite support for ingest-triggered views.
Refresh Behavior:
Incremental via insert-triggered views.
Refreshable views support full recomputation on schedule.
Iceberg Integration: Can read Iceberg tables, but writes materializations to its internal format.
Tradeoff: Extremely fast for ingest-heavy use cases, but not designed for shared, open-table workflows.
Snowflake
Materialization Mechanism: Materialized Views on Snowflake-managed tables only.
Persistence: Stored as internal Snowflake-managed tables.
Query Rewrite: Automatic within the Snowflake engine.
Refresh Behavior: Fully managed by Snowflake; refresh cadence is abstracted away.
Iceberg Integration:
Can read and write Iceberg tables.
Tradeoff: Strong performance and simplicity for Snowflake-native data, but limited acceleration for Iceberg and no external materialization visibility.
Databricks
Materialization Mechanism: Delta Live Tables and manually defined views.
Persistence: Materialized data can be written to Delta or Iceberg tables, depending on the pipeline config.
Query Rewrite: No automatic rewrite. Materializations must be queried explicitly.
Refresh Behavior: Managed via orchestration logic in DLT or user-defined jobs.
Iceberg Integration: Can read/write Iceberg, but materialization workflows are still Delta-first and require manual configuration.
Tradeoff: Strong engineering capabilities and flexibility, but manual acceleration and limited automation for Iceberg workflows.
Key Observations
Dremio stands out as the only engine offering fully transparent, automatic acceleration on Iceberg-native materializations, with zero-copy execution and open format persistence.
Trino and Starburst offer flexible materialization with Iceberg persistence, but require manual refresh and query management, unless Warp Speed is used (which introduces proprietary layers).
OLAP engines like Doris, StarRocks, and ClickHouse deliver high-speed analytics by copying Iceberg data into internal formats, which limits reusability and openness.
Snowflake and Databricks are still catching up in their ability to accelerate open data lake workloads, with most materialization workflows confined to their managed formats or requiring manual configuration.
Conclusion: Materialization as a Strategic Advantage in the Lakehouse Era
As the Lakehouse architecture becomes the new default for modern analytics, materialization and query acceleration are no longer optional — they’re essential. But as we’ve seen, the way each engine approaches materialization reveals more profound truths about its design philosophy, performance model, and alignment with open data principles.
Some engines, like Doris, StarRocks, and ClickHouse, deliver speed by tightly coupling storage and execution. Still, that performance comes at the cost of openness and interoperability, especially when working with formats like Apache Iceberg.
Others, like Trino and Starburst, offer flexible federation and support for open formats, but often require users to manage materialization explicitly, with limited transparency and automation — unless proprietary extensions like Warp Speed are added.
Snowflake and Databricks are adapting quickly, introducing support for Iceberg and signaling a move toward open standards. Yet today, their acceleration strategies remain rooted in their original architectures, where materializations are either unavailable on open tables or manually orchestrated by users.
In contrast, Dremio takes a fundamentally different approach. By combining:
Apache Arrow-based execution
Autonomous Reflections stored as Iceberg tables
Incremental, metadata-aware refreshes
Built-in caching at multiple levels
And automatic query rewrites via the semantic layer
…it delivers warehouse-level performance directly on the data lake, without requiring users to choose between performance, simplicity, and openness.
In a landscape where speed often comes at the cost of lock-in, and flexibility can mean more operational burden, Dremio demonstrates that it’s possible to have both: an open Lakehouse experience with the acceleration built in.
As you evaluate Lakehouse engines, don’t just ask “can it query my Iceberg data?” Ask “how is it accelerating that query?”, “where is that materialized data stored?”, and “what effort is required to manage it?”
Because in the Lakehouse era, how you accelerate is just as important as what you accelerate.
Excellent article. As organizations are trying to evaluate the open table format, performance at a data warehousing level plays a crucial role. Thank you for your insights!