Announcing the Earthmover Data Marketplace: Subscribe to ARCO datasets from ECMWF, NOAA, and more. Explore the marketplace .

Evolving our Tensor Storage Engine: A Preview of Icechunk 2

Evolving our Tensor Storage Engine: A Preview of Icechunk 2
Sebastian Galkin
Sebastian Galkin

Staff Engineer

Earthmover is building the cloud platform for scientific data, focusing on weather, climate and geospatial use cases. In these domains, tensors, not tables, are the ideal data model. We have devoted major engineering effort for the past year to Icechunk, our open-source transactional tensor storage engine. Icechunk enables tensors as a first-class, production-ready data format for AI and analytics workloads. Icechunk works hand in hand with Zarr Python to provide these capabilities. Our customers, such as Brightband, Sylvera, and Kettle, are using Icechunk for the input and output for their AI-driven physical models of weather and climate.

The rapid adoption of Icechunk, and our own experience with hosting customer data for the past year, has allowed us to identify important ways to improve Icechunk. For example, we weren’t expecting repositories to reach tens of thousands of commits so soon. We weren’t expecting repositories with 100k groups or arrays. People have taken Icechunk into new and very interesting directions.

This is why we are excited to preview the upcoming release of Icechunk 2, with powerful new features and performance improvements.

New Features and Performance Benefits

Here is a list of new features you can use if you create an Icechunk 2 repository, or if you upgrade one of your existing repositories to Icechunk 2 on-disk format:

  • Full log of operations of every change that was made to the repo. This goes beyond the commit history, and it includes things like branch/tag creation/delete, garbage collection, expiration, etc. This will give users a full picture of what happened to the repo, and an easier way to recover from errors.
  • Node move/rename. Users of zarr and Icechunk struggle with designing their zarr hierarchy. If they need to move an array from one group to another, or even just fix a typo in an array/group name, that requires expensive reading and rewriting every chunk. In Icechunk 2, this operation becomes constant time, and requires almost no I/O. Only metadata is changed, no chunks are rewritten.
  • Ability to shift/reindex chunks. Chunk references can be updated in Icechunk 2 without rewriting chunks, only metadata. This allows shifting arrays in any direction, inserting new elements in the middle, etc without having to do intensive I/O.
  • Icechunk 2 introduces Session.amend as an alternative to Session.commit. Many users want Icechunk because it provides transactionality and atomicity, but they don’t necessarily need to generate long repo histories. amend allows them to “change” the previous commit without generating a new element in the repo ancestry, all without sacrificing any of Icechunk’s guarantees.
  • Rectilinear grids. Icechunk 1 only supports the most basic type of chunk grid, every chunk has the same size. In Icechunk 2 there is support for the recently introduced rectilinear grids. This is important for a range of applications, such as making virtual datasets from existing, unevenly partitioned NetCDF files and appending data with uneven intervals (like months).
  • Ability to flag repositories with different status: on-line, off-line, read-only. This offers users more protection against accidental updates, Icechunk 2 by default will reject operations that violate the current status of the repo.
  • Ability to host read-only Icechunk repositories on any HTTP server, not necessarily an object store. Also, a mechanism to use HTTP redirects to find the location of an Icechunk repository.
  • Metadata at the repository level. Icechunk 1 supports arbitrary metadata for commits, but we saw users that wanted to include metadata at the whole repository level. Icechunk 2 will allow this arbitrary metadata that can be updated and retrieved.

Icechunk 2 also brings some performance benefits:

  • Improved Ancestry Performance. Ancestry no longer needs to download one file per commit from the object store. In Icechunk 2 the commit tree is stored in a single file so a single object store request can complete the operation.
  • Expiration and garbage collection are much faster in repos with long histories.
  • Faster size metrics. Obtaining the total size of the repo using total_chunks_storage is much faster, because it requires less I/O.

The Migration Path

First, a very important declaration: we understand the pain that format and database migrations can cause and are very explicitly designing Icechunk 2 to minimize disruption. Icechunk 2 will fully interoperate with Icechunk 1. The Icechunk 2 library will continue to support reading, writing, and managing Icechunk 1 repositories.

As you create new repositories, you may want to create them using the new Icechunk 2 on-disk format. This will allow you to use any of the new features and performance benefits. To take advantage of these features for existing Icechunk 1 repositories, these repos will need to be migrated to the Icechunk 2 on-disk format. This migration only modifies index files, does not duplicate data or modify data files, and is reversible.  Upgrading repositories to the new Icechunk 2 format will be explicit and completely optional. The Icechunk 1 format will remain supported for the foreseeable future.

If you are ready to migrate your Icechunk 1 repositories to Icechunk 2, that’s very easy to do. The Icechunk 2 library includes a simple migration function you can execute, and we will have support in Arraylake to run the migration for you if that’s what you prefer. The important thing to remember is, if you migrate a repository to on-disk format version 2, or if you create a new repo using the new on-disk format, you won’t be able to open it with older, Icechunk 1.x versions of the library. The recommendation is: don’t upgrade an existing repository until you have updated all pipelines that use it to the Icechunk 2 library version.

You can try an alpha version of Icechunk 2 today! It’s currently on the main branch on the  Icechunk Github Repo and is installable via nightly wheels. We welcome feedback in the form of GitHub issues. We are expecting to officially release Icechunk 2 some time in early 2026.

Sebastian Galkin
Sebastian Galkin

Staff Engineer