close

DEV Community

Cover image for Apache Data Lakehouse Weekly: June 17 to June 24, 2026
Alex Merced
Alex Merced

Posted on

Apache Data Lakehouse Weekly: June 17 to June 24, 2026

This week the lakehouse community spent most of its energy on two questions. The first is how the catalog grows from a table directory into a control plane that serves agents, BI tools, and query engines from one place. The second is how the file and table formats hold up when correctness gets tested at the bit level. Apache Polaris pushed hard on the first question with debates about non-Iceberg REST endpoints, runtime-activated datasources, and a semantic layer that lives in the catalog. Apache Iceberg and Apache Parquet pushed on the second with proposals about global snapshot consistency, primary keys, NaN ordering, and INT96 timestamps. Underneath all of it ran a quieter thread that touched several projects at once: the cost of shared continuous integration runners, the maturing of the C++ and Rust implementations, and a fresh round of release votes. Here is what the four core projects and DataFusion worked through over the past seven days.

Apache Iceberg

The most consequential design discussion of the week centered on data correctness across more than one table at a time. Xiening Dai's Global Snapshot Consistency proposal kept drawing replies from Russell Spitzer, Andrei Tserakhau, and Maninder Parmar. The core problem is real and getting harder. Today the Iceberg spec defines isolation at the level of a single write operation through table properties like write.delete.isolation-level. It says nothing about reading several tables as of one consistent point in time. Xiening framed two approaches that the thread treated as complements rather than rivals. One adds a batch LoadTables API that returns the metadata for many tables atomically as of a shared instant. The other introduces a commit sequence number, a monotonic value that lets a reader pin a consistent cut across tables. Maninder pointed out that the batch load path solves the read side, and the sequence number path gives the deeper guarantee. The reason this matters is the shape of modern workloads. An AI agent or a BI dashboard that joins five tables wants those five tables to reflect the same logical moment, not five independent and slightly skewed snapshots. Iceberg was built around single-table atomicity. This thread is the community starting to reason about multi-table atomicity as a first-class concern.

Xiening also addressed the worry that the two ideas conflict. He argued they coexist, since a batch load API can guarantee a snapshot read while a commit sequence number gives the stronger cross-table ordering for cases that need it. Andrei Tserakhau and Russell Spitzer kept the discussion grounded in concrete scenarios, the kind where a pipeline reads a fact table and several dimension tables and must not see a half-applied set of writes. The reason this thread carries weight beyond its immediate scope is that it forces the community to decide how much consistency machinery belongs in the catalog versus the table format versus the engine. That boundary question repeats across the lakehouse, and how Iceberg answers it here sets a precedent for the semantic layer and scan-planning debates happening one list over in Polaris.

A second proposal asked Iceberg to take on something it has long left to the engine. Chandra Sekhar opened First-Class Primary Key Tables in Apache Iceberg, arguing that native primary-key support belongs in the table format rather than being reconstructed by each compute layer. Iceberg has grown into a widely adopted format for analytics, and the request reflects a push toward operational and change-data patterns where a declared key carries real semantics. The discussion is early, and the hard parts are exactly where you expect them. Enforcement cost, merge-on-read behavior, and the interaction with equality deletes all need answers before this becomes a spec change. The thread is worth watching because a native key concept reshapes how engines plan upserts and how downstream consumers reason about row identity.

Read-path cost drew a focused conversation as well. Varun Lakhyani and Steve Loughran worked through combining three GET calls for Parquet reads, covering the root manifest, the data files, and small-file compaction. The target is the small-file workload, where the count of separate S3 GET requests dominates latency and cost. Varun noted that the community voted earlier to pursue a related fix, and this thread continues that line of work. Steve, who knows the object-store read path as well as anyone, joined to push on the details. The payoff is direct. Fewer round trips to object storage means lower latency and a smaller bill for workloads built from many small files, which is the common case for streaming ingest and frequent commits.

Statistics work picked up where earlier table-stats efforts left off. Tamás Máté proposed table-level quantile sketches in Puffin for Spark cost-based planning. Iceberg's Spark integration already computes table-level number-of-distinct-values statistics. Tamás wants to add quantile sketches stored in Puffin so the Spark planner gets a better picture of value distributions. Better distribution estimates lead to better join and filter planning, which leads to faster queries without any change to the data itself. The scope is deliberately narrow. This is a table-level, Spark-focused follow-up rather than a sweeping statistics overhaul, and that narrowness is what gives it a clear path forward.

Operational tooling got an interesting proof of concept. Sarthak Singh shared Iceberg Doctor, a forensics and visualization toolchain for diagnosing table health, built from experience managing Iceberg tables at Confluent. The motivation will be familiar to anyone who has run Iceberg at scale. Tables drift into bad states through too many small files, orphaned data, bloated metadata, and snapshot histories that grow without bound. Iceberg Doctor aims to surface those conditions and make them legible. The community response treated it as a useful direction, and tooling like this matters because the format has reached the point where running it well, not just writing to it, is the daily challenge for platform teams.

Several correctness and API threads rounded out the Iceberg week. Kevin Liu and Sung Yun worked through clarifying schema JSON type string serialization, the kind of specification detail that prevents subtle cross-engine mismatches later. Szehon Ho opened a discussion on adding a UDF-specific name. Cheng Pan and Imran Rashid started talking through repartitioning old partition-spec data files, a real pain point for tables that have evolved their partitioning over time. Grant Nicholas flagged a possible position-delete spec violation, and Yuya Ebihara raised a compatibility concern about HadoopConfigurable extending Iceberg's Configurable interface. Gábor Kaszab and Péter Váry both opened threads on column update metadata and file representation, pointing at column-level update tracking as an active area.

The multi-language story kept advancing. Kevin Liu posted the announcement for Apache Iceberg C++ v0.3.0, the native C++ implementation that gives non-JVM systems a first-party path to read and write Iceberg. On the Rust side, Sreeram Garlapati started a thread to enumerate production usage of iceberg-rust, asking teams to share where and how they run the Rust implementation in real deployments. That request is more strategic than it looks. Cataloging production users builds the case for stability commitments and helps the maintainers prioritize the gaps that matter to people who depend on the library.

Engine compatibility surfaced through Spark's faster release cadence. Anton Okolnychyi and Anurag Mantripragada discussed Iceberg's Spark versioning strategy in light of Spark dropping 3.4 after the Iceberg 1.11 release and the Spark community moving to quicker releases. The question is how many Spark versions Iceberg supports at once and for how long. Support breadth costs maintainer time. Dropping versions too fast strands users. The thread is the project trying to find a sustainable line as upstream Spark speeds up.

Infrastructure cost became its own conversation. Ajantha Bhat, Kevin Liu, and Manu Zhang worked through Iceberg's consumption of ASF shared GitHub-hosted runners. The Apache Software Foundation provides shared CI capacity, and busy projects burn through it fast. The discussion covered how to keep Iceberg's pipelines green without crowding out other ASF projects. This is unglamorous work, and it is the kind of thing that keeps a large project healthy.

Storage-layer work continued in parallel. A contributor opened a thread on refactoring the aliyun OSSFileIO implementation to improve performance and fix bugs, and Manu Zhang joined the discussion. FileIO implementations are the layer where Iceberg meets each object store, and cleaning up the Aliyun OSS path matters for the large base of users running on Alibaba Cloud. Vladislav Sidorovich also opened a pull-request review thread for updating the Delta Lake migration path to use Delta Kernel. Migration tooling is how teams move off Delta Lake and onto Iceberg without rewriting their data, and moving that path onto Delta Kernel keeps it aligned with how Delta itself now exposes its tables. Both threads are the kind of connective work that decides whether a team's move to Iceberg goes smoothly or stalls on an edge case.

Taken together, the Iceberg list this week read like a project balancing two jobs at once. One job is reaching for new capabilities that were once the engine's problem, with primary keys, global snapshot consistency, and column-level updates all in play. The other job is tightening the parts that already carry production weight, with read-path cost, storage IO, statistics, and migration tooling all getting attention. For a data team running Iceberg today, the practical signal is that the format is moving toward richer semantics without abandoning the operational concerns that decide the monthly bill. The small-file GET-reduction work and the Iceberg Doctor proposal in particular speak directly to teams whose tables have grown faster than their maintenance routines.

The community also looked outward. Anatolii Popov opened planning for Iceberg Summit 2027, and Sung Yun shared that Lakehouse Day EU registration is live, co-located with Community Over Code in Glasgow this October. Walaa Eldin Moustafa and Huaxin Gao each proposed dedicated syncs for materialized views and index support, a sign that both areas have enough momentum to warrant focused meetings. Alex Stephen noted an upcoming Iceberg Terraform provider release, pulling Iceberg deeper into infrastructure-as-code workflows.

Apache Polaris

Polaris was the busiest list of the week by a wide margin, and the volume reflects a project working out what a modern Iceberg REST catalog should be. The single largest thread asked a foundational question. Dmitri Bourlatchkov opened Non-IRC endpoints in IRC config responses, which ran past 25 messages and pulled in Yufei Gu, Robert Stupp, Russell Spitzer, Adnan Hemani, Adam Christian, and yun zou. During a code review, Dmitri noticed that Polaris returns endpoints in its Iceberg REST Catalog config responses that are not part of the IRC specification. That raises a design question with real stakes. Yufei argued that a Polaris client is also an IRC client with extra capabilities, which makes Polaris a superset of IRC, and that advertising those extra endpoints through the config endpoint fits capability discovery. Others worried about clients that expect strict IRC compliance and get surprised by unfamiliar entries. The debate is about how Polaris extends a standard without breaking the contract that makes the standard useful. Getting this right keeps Polaris interoperable with the broad Iceberg ecosystem while still letting it offer more.

Persistence architecture drew the second-largest thread. Alexandre Dutra introduced multiple datasources with runtime activation through a pull request and a follow-up message on pluggable persistence. The proposal lets a Polaris deployment define more than one backing datasource and activate them at runtime, and it connects to earlier work that sketched distinct persistence tiers. Dmitri Bourlatchkov, Romain Manni-Bucau, and Yufei Gu all engaged. The reason this matters is operational flexibility. Large deployments want to route different realms or workloads to different storage backends without redeploying, and a clean runtime-activation model makes that possible without turning the persistence layer into a tangle.

The thread most relevant to where the lakehouse is heading was Semantic Layer Support in Apache Polaris, which stayed active through the week. Yufei Gu framed the original problem clearly. AI agents, BI tools, notebooks, and query engines increasingly read the same data, and each system redefines the same metrics and dimensions in its own language. That duplication produces drift, where the same metric returns different numbers depending on which tool computed it. The proposal puts semantic definitions in the catalog so every consumer reads one source of truth. Adam Christian summarized offline conversations with Yufei, JB Onofré, and Dennis, signaling support while raising specific concerns about semantic drift and scope. Adnan Hemani pushed on the boundaries of the proposal, asking what belongs in the catalog versus the engine. This is the catalog reaching beyond table metadata toward business meaning, and it lines up directly with the way agentic analytics needs a stable, shared definition of what a number means.

Scan planning surfaced as a related extension of the catalog's role. Tornike Gurgenidze proposed scan planning with optional caching layers, adding Iceberg REST-compliant scan planning to Polaris. Yufei Gu, Prashant Singh, Adnan Hemani, and Dmitri Bourlatchkov all weighed in. Yufei's main concern was load. Scan planning, done centrally in the catalog, can become a heavy workload that competes with the catalog's core duties unless some delegation mechanism offloads it. The thread treated a phased approach as sensible, starting with the basic planning support and adding caching and delegation as the load picture becomes clear. Server-side scan planning is one of the more powerful ideas in the Iceberg REST world because it lets thin clients ask the catalog where the data is rather than computing it themselves, and Polaris is now working through how to host it without buckling.

JB Onofré moved a long-running design forward with the Polaris Directories proposal, backed by a pull request. The proposal grew out of months of discussion involving directories and table sources, and it gives Polaris a structured way to organize and source catalog contents. Yufei Gu welcomed the progress and committed to reviewing the pull request. Structural primitives like this set the shape of the catalog for years, so the careful, iterative path here is the right one.

Authentication and cloud parity got attention through the GCP counterpart to AWS STS session tags, where Adnan Hemani, Anand Kumar Sankaran, Dmitri Bourlatchkov, and Sung Yun worked through how Polaris passes scoped identity into Google Cloud the way it already does for AWS. Multi-cloud credential vending is table stakes for a catalog that wants to serve every major provider, and this closes a real gap.

Observability and events ran as a connected set of threads. EJ Wang and Dmitri Bourlatchkov discussed REST endpoints for table metrics and events. Alexandre Dutra raised forwarding Iceberg scan and commit metrics through events. Adnan Hemani followed up on the OpenLineage proposal and opened a thread on an OpenTelemetry event listener. Taken together these point at a catalog that emits structured signals about what is happening inside it, which is what platform teams need to monitor, audit, and bill for catalog activity.

Several engineering threads kept the project's foundations solid. Robert Stupp, Yong Zheng, Dmitri Bourlatchkov, and Yufei Gu chased down a broken setup-gradle in CI. Alexandre Dutra proposed deprecating TreeMapMetaStore and explained why Polaris needs the server-test-runner Gradle plugin. A contributor flagged a missing entity cache in the NoSQL backend, and Dmitri opened threads on handler-level idempotency for createTable and Iceberg table encryption support. Zhiyang Chen and Sung Yun discussed entity-level filtering for list operations, and Adam Szita raised multiple StorageConfigurationInfos per catalog.

Scale and access control showed up through the GitHub-bridged discussion threads. One asked about the feasibility of one realm per tenant at a scale of 10,000 tenants, a real question for anyone planning to run Polaris as multi-tenant infrastructure. Realms are Polaris's tenancy boundary, and whether the design holds at five-figure tenant counts decides whether Polaris fits large platform deployments without custom sharding. Another thread raised fine-grained branch and tag creation control, pushing Polaris toward the kind of granular permissions that data branching workflows need. A contributor also flagged a cleanup opportunity around nullability in PolarisResolutionManifest, the sort of internal tidiness that keeps a fast-growing codebase maintainable.

For a team evaluating Polaris right now, the week's threads add up to a clear picture. The project is no longer just answering whether it can serve the Iceberg REST API. It is answering what a catalog should own beyond tables: semantic definitions, scan planning, event streams, multi-cloud credentials, and a tenancy model that scales. That ambition is the right one for where the lakehouse is heading, and the open debates about IRC compliance and central-planning load are exactly the questions a serious catalog has to resolve before it asks production workloads to depend on it. and separately noted that the 1.6.0 release is targeted for around June 26. Alexandre Dutra and Adnan Hemani joined the vote thread. A 1.6.0 release puts the recent run of catalog improvements into users' hands and sets the baseline for the semantic-layer and scan-planning work still in design.

Apache Arrow

Arrow's most active thread had nothing to do with the columnar format and everything to do with keeping the project's benchmarks honest. Wes McKinney restarted work on the status of Arrow conbench data and the conbench open-source project after noticing the old benchmarking host had gone offline. Rok Mihevc reported that conbench now lives at conbench.arrow-dev.org, that the repositories were forked to the arctosalliance GitHub organization to protect them, and that the historical benchmark database was preserved in a migrated AWS account. Antoine Pitrou and Jacob Wujciak joined the discussion. Jacob made the practical case that improving conbench and its orchestration frees up cloud credits and, more valuable still, the engineer time that goes into fixing outages. Continuous benchmarking is how a performance-sensitive project like Arrow catches regressions before they ship, so restoring this infrastructure is real work with real payoff.

A cross-format question returned with the Variant type support thread. Micah Kornfield and Gang Wu noted that several teams are building variant support in Arrow C++ in parallel, with Gang's colleague Zehua having worked on it for a while and an open issue tracking the effort. The detail that ties this to the wider ecosystem is that iceberg-cpp depends on Arrow's variant type to support Iceberg v3. Variant is the semi-structured type that lets columnar formats carry flexible, JSON-like data without giving up columnar benefits, and it is showing up across Parquet, Arrow, and Iceberg at once. Coordinating the duplicate efforts into one solid implementation is the task in front of the Arrow community here.

The Rust side shipped on schedule. Andrew Lamb proposed the release of Apache Arrow Rust Object Store 0.14.0 RC1, and the vote passed with Andrew posting the result. Adam Reeve, Kevin Liu, L.C. Hsieh, and Raúl Cumplido took part in the verification. The object-store crate is a quiet workhorse, giving Rust-based data systems a uniform interface to S3, GCS, Azure Blob, and local files, and its steady release cadence is part of why the Rust data ecosystem keeps gaining ground. Kevin Liu and Sutou Kouhei also opened an Arrow-side thread on consumption of ASF shared GitHub-hosted runner resources, the same CI-capacity concern that surfaced on the Iceberg list. Ian Cook kept the cadence going with the Arrow community meeting on June 17.

The list also handled a bit of community housekeeping. A thread titled suspicious auto-reply pulled in Alenka Frim, Ian Cook, Jacob Wujciak, Kevin Liu, and Raúl Cumplido to sort out an odd automated message hitting the list. It is a small thing, and it is also the daily reality of running an open mailing list at the center of a large ecosystem. Keeping the list clean keeps the signal high for the contributors who depend on it.

Arrow's week, read as a whole, was about foundations rather than features. Restoring continuous benchmarking, shipping the Rust object-store release, aligning on a shared variant implementation, and managing CI capacity are all infrastructure work. None of it makes a headline. All of it decides whether the projects that sit on top of Arrow, which by now includes much of the Rust and C++ data ecosystem, can trust that performance holds and releases keep coming. The variant thread is the one with the longest reach, because a clean Arrow variant type unblocks Iceberg v3 support in the C++ world and keeps the semi-structured story consistent across the stack.

Apache Parquet

Parquet spent the week on questions of format evolution and bit-level correctness, which is exactly where a mature storage format earns its trust. The defining thread was Daniel Weeks on the future of Parquet versioning. The topic resurfaced at a community sync, and Daniel pulled together thoughts on how the format moves forward without breaking the enormous installed base of Parquet files and readers. Versioning a ubiquitous format is genuinely hard. Move too cautiously and useful features stall for years. Move too fast and you fragment the reader ecosystem, leaving files that some engines cannot open. The thread is the community trying to find a process that lets Parquet adopt new capabilities while protecting compatibility, and given how much of the world's analytical data sits in Parquet, the stakes are high.

A precise correctness question drew careful attention. Gang Wu and Gábor Szádovszky worked through clarifying NaN bit preservation in floating-point encodings. While implementing IEEE 754 total ordering and a NaN count in parquet-java, Gábor found that the spec underspecifies how floating-point NaN bit patterns are handled. NaN values carry payload bits, and whether those bits survive a write-and-read round trip affects ordering and statistics. This is the kind of corner that almost never matters until it does, and then it produces silent disagreement between engines about how the same column sorts. Pinning the behavior in the spec is how the format stays predictable across every implementation.

A related ordering question moved toward a decision. Divjot Arora opened a vote on GH-583 to define ordering for INT96 timestamps, following earlier discussion on the list and in the GitHub issue. INT96 is the legacy timestamp representation that the format has tried to move past for years, yet a vast quantity of older data still uses it. Defining a clear ordering for it removes ambiguity for engines that still read those files. Voting on the change signals that the community wants this settled rather than left to per-engine interpretation.

The read-planning layer got its own thread. Jiayi Wang, Will Edwards, Andrew Lamb, Andrew Pikler, and Ed Seidl discussed the clarification on row-group and column-chunk layout. The starting point was how parquet-java assigns a row group to a file split using the row group's midpoint. Will suggested that splitting is a client feature rather than a spec mandate, and the thread worked through whether a single address per row group gives a clean and unambiguous split assignment. This is the plumbing that decides how a large Parquet file gets carved up across parallel readers, and small ambiguities here turn into uneven work distribution at scale.

Daniel Weeks also followed up on supporting non-contiguous pages in Parquet, summarizing progress and the path forward after discussion about the scope of the change. Divjot Arora raised a separate question about INT96 statistics, and Alkis Evlogimenos and Micah Kornfield discussed making path_in_schema optional. Jiayi Wang and Manu Zhang continued the Parquet Footer Working Group with session 3, the focused group working on footer-related improvements that affect how quickly a reader learns a file's structure. On the engineering-hygiene side, Eduard Tudenhöfner and Steve Loughran discussed adopting AssertJ for test assertions, a small change that makes the test suite clearer to read and maintain.

Apache DataFusion

DataFusion kept its fast release rhythm and grew its leadership. Andy Grove proposed the release of Apache DataFusion Comet 0.17.0 RC1, and the vote moved quickly with verification from Andrew Lamb, L.C. Hsieh, Marko Milenković, Martin Grigorov, and Bhargava Vadlamani, who verified the build on Apple's M5 hardware. Andy posted the passing result. Comet is the accelerator that pushes Spark execution onto a native DataFusion engine, and its steady cadence gives Spark users a path to faster execution without leaving the Spark API behind.

The project also welcomed new leadership. The PMC announced Matt Butrovich as a new DataFusion PMC member, with congratulations from Jeffrey Vo, Phillip LeBlanc, and Qi Zhu. Growing the PMC is how a fast-moving project spreads the load of review and governance across more hands. Andrew Lamb separately organized a PlusOne.apache.org interview, part of the community's ongoing work to tell the story of the people behind the code.

The DataFusion family sits at an interesting spot in the lakehouse picture. Comet brings native acceleration to Spark, the core DataFusion engine powers a growing list of query systems, and the project shares maintainers and design sensibilities with iceberg-rust and the Arrow Rust crates. That overlap is why a Comet release and a new PMC member belong in a lakehouse roundup rather than off to the side. The same people moving DataFusion forward are the ones moving Rust-native Iceberg and Arrow forward, and the steady release rhythm here is a leading indicator for how fast the non-JVM lakehouse stack matures overall.

Cross-Project Themes

Three patterns connected these lists this week, and they say more together than any single thread does alone.

The first is the catalog turning into a control plane. Polaris spent the week debating a semantic layer, server-side scan planning, metrics and event endpoints, OpenTelemetry and OpenLineage hooks, and directory structures. Iceberg, meanwhile, worked through global snapshot consistency so that a reader can pin a coherent view across many tables. These look like separate efforts, and they answer the same underlying demand. Agents and BI tools want one place that tells them what a metric means, where the data is, and how to read several tables as of one moment. The catalog is becoming that place. The work in Polaris and the snapshot-consistency work in Iceberg are two ends of the same arc toward a catalog that serves coordinated, consistent, well-described data to whatever sits above it.

The second is the variant type as a shared dependency. Arrow discussed variant support in C++, and the thread surfaced that iceberg-cpp needs it for Iceberg v3, while Parquet continues its own semi-structured data work. Variant is moving through all three projects at once because the same workloads, flexible event data, nested JSON, and schema-on-read sources, push against the rigid edges of columnar formats. The coordination challenge is real. Duplicate implementations in Arrow C++ this week are a reminder that a cross-format type needs cross-project alignment, not three separate solutions that drift apart.

The third is the steady maturation of non-JVM implementations alongside a heavy release season. This week alone brought the Iceberg C++ v0.3.0 announcement, an active push to catalog iceberg-rust production usage, the Arrow Rust Object Store 0.14.0 release, and the DataFusion Comet 0.17.0 vote, with Polaris 1.6.0 in the wings. The pattern is unmistakable. The lakehouse stack is shedding its assumption of a JVM at every layer. C++ and Rust implementations now carry real production weight, which widens where Iceberg, Arrow, and Parquet can run and lowers the cost of embedding them. Running underneath all of it, the shared-CI-runner discussions on both the Iceberg and Arrow lists are the unglamorous tax of this growth. More implementations and faster releases mean more pipelines competing for the same foundation resources, and the projects are starting to manage that cost out in the open.

The pattern across the Parquet list this week was a format acting its age in the best sense. INT96 ordering, NaN bit preservation, row-group split addressing, and versioning are not flashy. They are the contracts that let dozens of independent readers and writers agree on what a file means. Every one of these threads closes a gap where two engines once risked quietly disagreeing. For data teams, the takeaway is reassuring rather than urgent. The format underneath most of the world's analytical storage is being maintained with the care that its ubiquity demands, and the versioning discussion in particular is the one to track for anyone who needs Parquet to keep gaining features without breaking the readers already in production.

What It Means for Data Teams

Step back from the individual threads and a single message comes through. The open lakehouse is consolidating its control plane in the catalog while hardening its formats at the foundation. These two movements serve the same goal, which is making data trustworthy and reachable for the new wave of agentic and AI-driven consumers. A team planning its lakehouse strategy for the rest of 2026 can read this week as confirmation of a few bets. Standardizing on the Iceberg REST catalog interface is getting safer as Polaris fills in semantic, planning, and observability features. Investing in non-JVM tooling is getting safer as the C++ and Rust implementations take on production load. Building multi-table, agent-facing analytics is getting a real answer as the snapshot-consistency and semantic-layer discussions mature.

There is also a caution worth holding. Several of the most important threads, global snapshot consistency, primary keys, the Polaris semantic layer, and Parquet versioning, are still in design. They point at the direction of travel, and they are not settled spec. The right posture is to follow them closely, contribute where a real use case gives weight to the discussion, and avoid building hard dependencies on behavior that the community has not finalized. The lists are open, the contributors named here respond to good arguments backed by real workloads, and the fastest way to shape where the lakehouse goes is to bring a concrete problem to the thread that is already trying to solve it.

Looking Ahead

The next week has a clear marker. The Apache Polaris 1.6.0 release is targeted for around June 26, so watch the vote thread close and the release announcement land. The INT96 ordering vote in Parquet should resolve, settling a long-standing ambiguity for legacy timestamp data. The Parquet versioning discussion is the slower burn worth tracking, because the process the community lands on will shape format evolution for years.

On the design side, the Polaris semantic-layer and scan-planning proposals both have momentum and open questions, and the Iceberg global snapshot consistency thread is the one to follow for anyone building multi-table, agent-driven analytics. The Iceberg column-update threads from Gábor Kaszab and Péter Váry are worth tracking too, since column-level update tracking touches change-data and incremental workloads that many teams care about. On the Arrow and Parquet side, the variant work and the footer working group are the slow, structural efforts that pay off over quarters rather than weeks. Further out, Lakehouse Day EU in Glasgow this October and the early planning for Iceberg Summit 2027 give the community its next in-person checkpoints. The through-line for the rest of the summer is the catalog growing up, the formats tightening their guarantees, and the non-JVM implementations earning more of the production load. If you run a lakehouse, the most useful habit this season is to read the design threads in the projects you depend on and bring your real workloads into the conversation while the decisions are still open.


Resources & Further Learning

Get Started with Dremio

Free Downloads

Books by Alex Merced

Top comments (0)