Announcing the Earthmover Data Marketplace: Subscribe to ARCO datasets from ECMWF, NOAA, and more. Explore the marketplace .

Variable length chunks in Zarr

Joe Hamman
Joe Hamman

CTO & Co-founder

Max Jones
Max Jones

Cloud Engineer @ DevSeed

Davis Bennet
Davis Bennet

Software Engineer (Freelance)

A new extension to Zarr has just landed supporting variable length chunking. In this technical post, we’ll dive into the new feature, what it enables, and how you can start using it in your workloads.

TL;DR

You can now specify arbitrarily sized chunks using the new Rectilinear Chunk Grid in Zarr-Python and Zarrs. This allows you to align chunk boundaries with the natural structure of your data, rather than forcing a regular chunk grid layout.

Experimental feature. Rectilinear chunk grids are experimental in Zarr-Python 3.2 and must be explicitly enabled via zarr.config.set({"array.rectilinear_chunks": True}) (or the environment variable ZARR_ARRAY__RECTILINEAR_CHUNKS=True). The feature is expected to stabilize in Zarr-Python 3.3.

import zarr

zarr.config.set({"array.rectilinear_chunks": True})

store = zarr.storage.MemoryStore()

# Create an array where chunks along the second axis vary in size
arr = zarr.create_array(
    store=store,
    shape=(10, 10),
    chunks=[[5, 5], [2, 4, 4]],  # Regular along dim 0, variable along dim 1
    dtype="i4",
    zarr_format=3,
)
arr[:] = 1

Introduction

Zarr is fundamentally a chunked array format. Historically, Zarr has defaulted to a regular chunk grid, meaning that every chunk in the array has the exact same shape (e.g., (100, 100)), with the possible exception of chunks at the edge of the array which are cropped to fit the array shape.

While regular chunking is efficient for many use cases, it doesn’t fit every dataset. Data is often naturally partitioned along irregular intervals. A common example is daily time series data that you want to partition into month-sized chunks. Because months vary in length (28, 30, or 31 days), a regular grid forces you to either split months across chunks or pad them. The same problem shows up when virtualizing legacy archives whose files have non-uniform extents along the concatenation axis — until now, the regular grid has been a hard wall for that workflow.

The Zarr v3 specification introduced an extensible chunk grid abstraction, allowing us to define new ways to organize data. The rectilinear chunk grid extension is the first major feature to leverage this capability.

Design: The Rectilinear Chunk Grid

The regular chunk grid uses a single tuple to represent the chunk shape for the entire array. For example, an array of shape (10, 10) with a chunk shape of (5, 5) results in a 2 × 2 grid of chunks.

The new Rectilinear Chunk Grid extends this approach by allowing users to specify a list of chunk sizes for each axis. For that same (10, 10) array, you could specify the chunk structure as:

chunks = [[6, 4], [3, 3, 3, 1]]

This defines a grid where chunks along the first axis are size 6 and 4, and along the second axis are 3, 3, 3, and 1.

Regular chunk grid versus rectilinear chunk grid for the same 10 by 10 array

Run-Length Encoding (RLE) in stored metadata

For arrays with many chunks, an explicit list of every edge length would bloat the metadata document on disk. To keep the JSON compact, the chunk grid specification defines a Run-Length Encoding (RLE) form: the pair [value, count] represents count consecutive chunks of size value, and RLE pairs can be freely mixed with bare integers within a single axis specification.

For example, the second axis of chunks = [[6, 4], [3, 3, 3, 1]] is stored on disk as [[3, 3], 1] — “size 3 repeated 3 times, then a 1.” Other compact stored forms include:

  • [10, 20, 30] — three chunks with explicit sizes (no RLE needed)
  • [[10, 3]] — three chunks of size 10 (pure RLE)
  • [[10, 3], 5] — three chunks of size 10, then one chunk of size 5

Zarr automatically compresses repeated values into RLE format when writing metadata, and expands them back when reading. The RLE form lives entirely at the storage layer — when you call create_array from Python, write the expanded edges ([3, 3, 3, 1]) and let Zarr handle the on-disk compression.

Edge cases and gotchas

A few constraints to be aware of:

  1. Zarr v3 only. This feature is only supported for arrays stored in the Zarr v3 format.
  2. Shape constraints. The sum of the chunk lengths along each axis must be greater than or equal to the shape of the array. If the sum exactly matches the shape, the array is fully covered. If the sum exceeds the shape, the trailing chunk is partially populated, similar to edge chunks in regular grids.
  3. No .chunks property. Because chunk sizes are not uniform, the familiar arr.chunks attribute is not available on rectilinear arrays. Use .write_chunk_sizes instead (see below).
  4. Resize semantics. Resizing is supported, but with different mechanics than regular grids — see the section below.

Inspecting chunk sizes

The .write_chunk_sizes property returns the actual data size of each storage chunk along every dimension. It works for both regular and rectilinear arrays and returns a tuple of tuples (matching the dask Array.chunks convention):

zarr.config.set({"array.rectilinear_chunks": True})

z = zarr.create_array(
    store=zarr.storage.MemoryStore(),
    shape=(60, 100),
    chunks=[[10, 20, 30], [50, 50]],
    dtype="int32",
)
print(z.write_chunk_sizes)
# ((10, 20, 30), (50, 50))

When sharding is used, .read_chunk_sizes returns the inner chunk sizes instead. Calling arr.info on a rectilinear array reports Chunk shape: <variable>.

Resizing and appending

Rectilinear arrays can be resized. When growing past the current edge sum, a new chunk is appended covering the additional extent. When shrinking, existing chunk edges are preserved and the extent is re-bound — chunks beyond the new extent simply become inactive.

z = zarr.create_array(
    store=zarr.storage.MemoryStore(),
    shape=(30,),
    chunks=[[10, 20]],
    dtype="float64",
)
z[:] = np.arange(30, dtype="float64")
print(z.write_chunk_sizes)  # ((10, 20),)

z.resize((50,))
print(z.write_chunk_sizes)  # ((10, 20, 20),)

z.append(np.arange(10, dtype="float64"))
print(z.shape, z.write_chunk_sizes)  # (60,) ((10, 20, 20, 10),)

This is intentionally a simple model — append always adds one chunk equal to the new extent. Future work (described below) will expose richer controls over how the grid evolves.

Codecs and filters

Rectilinear arrays work with the full codec pipeline — compressors, filters, and checksums all behave as expected, with each chunk processed independently. There are no rectilinear-specific restrictions on which codecs you can use.

Usage patterns

Both Zarr-Python and Zarrs now support this extension. Here are four patterns showing how it can be applied.

1. Daily appends without rewriting edge chunks

Time-series datasets are often chunked coarsely along time — a single chunk might span 30, 90, or 365 days — to optimize for analytical reads that span long windows. But the same datasets are frequently updated at a much finer cadence. ERA5, for example, is updated daily.

With a regular chunk grid, this is an expensive mismatch. To append one day of data to an array chunked at 365 days per chunk, you have to read the partial edge chunk, decompress it, concatenate the new day, recompress, and write the whole thing back. Every daily update touches the full edge chunk. Over a year, you’ve rewritten the same chunk 365 times.

Rectilinear chunks change the shape of this problem. New data can be appended as small chunks of size 1 (or whatever the update cadence dictates), leaving the existing time-series chunks untouched:

Coarse 365-day archive chunks alongside a tail of size-1 daily appends

import zarr
zarr.config.set({"array.rectilinear_chunks": True})

# Existing archive: 5 years of data in 365-day chunks
# (Plus the leap day handled as its own small chunk)
time_chunks = [365, 365, 365, 366, 365]

arr = zarr.create_array(
    store="era5_archive.zarr",
    shape=(sum(time_chunks), 180, 360),
    chunks=[time_chunks, [90, 90], [90, 90, 90, 90]],
    dtype="float32",
    zarr_format=3,
)

# Daily update: append one day as a new size-1 chunk
# No edge-chunk rewrite, no decompress/recompress cycle
arr.append(new_day_data, axis=0)

Over time, this produces a long tail of size-1 chunks at the leading edge. That’s fine for ingestion — each daily update is a single, cheap write — but eventually you’ll want to fold those small chunks back into the coarse time-series structure for read performance. That compaction step is reserved for future work (see below) and pairs naturally with Icechunk, which can perform the rewrite as a transactional operation without disrupting concurrent readers.

2. Virtualizing archives with irregular file boundaries

Tools like VirtualiZarr and kerchunk let you build a single Zarr-compatible view over a collection of legacy files (NetCDF, HDF5, GRIB) without rewriting the underlying bytes. Until now, this has come with a sharp constraint: every file in the collection had to contribute the same chunk shape along the concatenation axis. A monthly climate model archive — where January contributes 31 days, February 28 or 29, and so on — couldn’t be virtualized into a single array without padding, splitting, or rewriting.

Rectilinear chunks remove this constraint. Each source file maps to a chunk of its natural size, and the resulting virtual array faithfully reflects the on-disk layout:

import zarr
zarr.config.set({"array.rectilinear_chunks": True})

# Virtualizing 12 monthly files from a 2024 climate model run
# Each file becomes one chunk along the time axis
chunks_2024 = [31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]

arr = zarr.create_array(
    store="model_run.zarr",
    shape=(sum(chunks_2024), 180, 360),
    chunks=[chunks_2024, [90, 90], [90, 90, 90, 90]],
    dtype="float32",
    zarr_format=3,
)

This is a meaningful unlock for any archive composed of files with non-uniform extents — daily forecast runs of varying length, satellite granules with variable swath widths, irregular reforecast cycles. The virtualization layer no longer has to lie about the data’s structure to fit a regular grid.

3. Chunking aligned with logical groupings

Many datasets carry a natural grouping that doesn’t map onto a regular grid. Genomic arrays partition by chromosome — chromosome 1 spans ~247 million bases, chromosome 22 only ~50 million. Hierarchical spatial schemes like HEALPix, S2, and H3 group cells by parent tile, with counts that vary across the surface — particularly for masked subsets like ocean-only products. HPC simulation outputs decompose into subdomains of differing extent. Catalogs split by instrument, campaign, or sky region.

In each case, the natural read-and-write unit is the group, not a fixed-size slab. A regular chunk grid forces a single size that fits no group exactly; rectilinear chunks let the chunk lengths along the partitioned axis come straight from the group sizes:

import numpy as np
import zarr

zarr.config.set({"array.rectilinear_chunks": True})

# `group_ids` carries the group label for each entry along the partitioned axis
# (chromosome, HEALPix parent tile, basin code, instrument, ...)
_, group_sizes = np.unique(group_ids, return_counts=True)

arr = zarr.create_array(
    store="grouped.zarr",
    shape=(group_ids.size, ...),
    chunks=[group_sizes.tolist(), ...],  # one chunk per group
    dtype="float32",
    zarr_format=3,
)

The Zarr-Python docs include a complete worked example using HEALPix parent-tile grouping on an ocean dataset, where the resulting chunk sizes span roughly 23 to 4096. The same construction applies wherever data is partitioned by an external hierarchy or category.

4. Rectilinear shard boundaries

Rectilinear grids can also define shard boundaries while keeping inner chunks regular. This is particularly useful for archives where shard size varies by region or time period — for example, a weather forecast archive where reforecast and operational segments have diffent forecast extents or ensenble dimensions — but you still want predictable, fixed-size chunks for query.

import zarr
zarr.config.set({"array.rectilinear_chunks": True})

z = zarr.create_array(
    store=zarr.storage.MemoryStore(),
    shape=(120, 100),
    chunks=(10, 10),                    # Regular inner chunks
    shards=[[60, 40, 20], [50, 50]],    # Rectilinear outer shards
    dtype="int32",
)

Each shard dimension must be divisible by the corresponding inner chunk size. Note that rectilinear inner chunks within a shard are not supported — only the outer shard boundaries can be rectilinear.

Future work

This release is just the beginning for variable chunking. Future developments we are exploring include:

  • Richer resize semantics. More sophisticated controls for append and resize operations, allowing users to define exactly how the grid should grow rather than relying on the default single-chunk-append behavior.
  • Compaction and regularization. A mechanism to “clean up” a grid. After writing irregular data, you might want to run a compaction step to regularize the chunks for faster reading. Such a utility would pair exceptionally well with Icechunk, providing a transactional framework to evolve arrays safely.
  • Rectilinear inner chunks within shards. Lifting the current restriction so that both shard and chunk grids can be rectilinear independently.

Conclusions

Variable length chunking unlocks new storage strategies for irregular data, aligning the physical storage of bytes with the logical structure of the information they represent.

This feature is available now in Zarr-Python 3.2 (behind the array.rectilinear_chunks config flag) and Zarrs, and is expected to stabilize in Zarr-Python 3.3. We are actively looking for feedback, so please test it out on your own datasets and let us know what you think!

Credits

This feature was prototyped with Lachlan Deakin and Davis Bennett at the Zarr Summit in Rome in October 2025 and completed by Max Jones (Development Seed) in Zarr-Python#3802. We’d like to thank the Navigation Fund for supporting the event that made this collaboration possible.

An early version of the feature was proposed in ZEP 0003, and Martin Durant developed an initial prototype that helped pave the way for this implementation.

Joe Hamman
Joe Hamman

CTO & Co-founder

Max Jones
Max Jones

Cloud Engineer @ DevSeed

Davis Bennet
Davis Bennet

Software Engineer (Freelance)