Zarr#
Zarr is open standard, for storing large multidimensional arrays. It is designed cloud ready and random access by dividing the data into chunks. Influenced by HDF5, it can be contain metadata and grouped into named hierarchies, annotated with key value metadata alongside the array.
Has multiple compression options and levels built-in
Supports multiple backend data stores (zip, S3, etc.)
Can read and write data in parallel* in n-dimensional compressed chunks
Why use Zarr?#
csv, txt doesn’t store array with more than 3-dimensional arrays. .npy does but can’t be scaled to larger-than-memory datasets or other situations in which you want to read and/or write in parallel.
Save NumPy Arrays with Zarr#
Instead save the numpy arrays with zarr.
A very good tutorial: link
Basic Zarr Workflow#
Create a Zarr Array#
import zarr
import numpy as np
# Create a Zarr array
# Data is stored in 100x100 chunks, optimizing I/O performance.
z = zarr.create(shape = (1000, 1000),
chunks = (100,100),
dtype='f8')
# Write data to the array
z[0: 500, 0:500] = np.random.rand(500, 500)
print(z[0:10, 0:10])
[[0.16792828 0.05855492 0.61850888 0.00334115 0.55858836 0.21650163
0.53774298 0.42341329 0.43683333 0.8620628 ]
[0.77786639 0.28022484 0.20837323 0.7247819 0.49771604 0.92462718
0.51060836 0.04236613 0.98047625 0.00410626]
[0.38210065 0.36627372 0.00582958 0.72082145 0.24324471 0.40437936
0.13279096 0.00191599 0.4308012 0.94889893]
[0.31733476 0.0967293 0.49516206 0.90658001 0.99522668 0.83354
0.87027598 0.11772291 0.65141924 0.15564434]
[0.98229912 0.62499048 0.17393857 0.30351276 0.47716583 0.14302452
0.39584313 0.17151204 0.71980751 0.85271517]
[0.03644305 0.33025467 0.05553023 0.13553851 0.89252233 0.94528132
0.3219762 0.86822442 0.95530793 0.55626303]
[0.24725786 0.55204327 0.80377698 0.06480157 0.90251993 0.92491466
0.37617816 0.4783279 0.88646059 0.35445961]
[0.15765295 0.66788802 0.0081186 0.33973027 0.75316576 0.72855571
0.12683921 0.17154689 0.21010142 0.40546485]
[0.14055949 0.31744168 0.54813014 0.46988983 0.07017627 0.83640945
0.82357796 0.09766152 0.34136162 0.35717373]
[0.03706085 0.4210289 0.42998954 0.98348018 0.05693666 0.51772671
0.05859623 0.69127757 0.61628705 0.53710781]]
Store Zarr Data on Disk#
A quick way: link#
zarr.save('data/data_1.zarr', z)
A more efficient way: link#
# Store zarr array to disk by opening a new file in write mode
z_data = zarr.open('data/data_2.zarr', mode='w', shape=(1000, 1000), chunks=(100, 100), dtype='f8')
z_data[0:500, 0:500] = np.random.random((500, 500))
Comparing both methods#
# Open the saved array
z_disk = zarr.open('data/data_1.zarr', mode='r')
print(z_disk.info)
Type : zarr.core.Array
Data type : float64
Shape : (1000, 1000)
Chunk shape : (100, 100)
Order : C
Read-only : True
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DirectoryStore
No. bytes : 8000000 (7.6M)
No. bytes stored : 1786823 (1.7M)
Storage ratio : 4.5
Chunks initialized : 100/100
# Open the saved array
z_disk = zarr.open('data/data_2.zarr', mode='r')
print(z_disk.info)
Type : zarr.core.Array
Data type : float64
Shape : (1000, 1000)
Chunk shape : (100, 100)
Order : C
Read-only : True
Compressor : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type : zarr.storage.DirectoryStore
No. bytes : 8000000 (7.6M)
No. bytes stored : 1752949 (1.7M)
Storage ratio : 4.6
Chunks initialized : 25/100
Chunks initialised is 25/100 in persistence mode and 100/100 in convenience mode. So, better use the persistence mode?
Zarr Groups#
Zarr supports hierarchical data storage (like HDF5), where datasets can be grouped.
# Create a group
group = zarr.open_group('data/group_data.zarr', mode='w')
# Add datasets to the group
group.create_dataset('dataset1', shape = z.shape, dtype='f8')
group.create_dataset('dataset2', shape = z.shape, dtype='f8')
# Access the datasets and save data
group['dataset1'][:] = z
group['dataset2'][:] = z
Zarr with MinIO#
We can use s3fs.S3Map to map Zarr directly to S3 storage. We need to store the data in a zarr group. In the following there is only onw dataset, but we can have multiple datasets in the same group.
https://zarr.readthedocs.io/en/stable/tutorial.html#distributed-cloud-storage
import s3fs
fs = s3fs.S3FileSystem(
key = 'minioadmin',
secret = 'minioadmin',
client_kwargs = {'endpoint_url': 'http://localhost:9000'}
)
# Define the S3 path for zarr to store the data
# Use s3fs as the Zarr store directly
store = s3fs.S3Map(root='my-bucket-2/data/zarr_data', s3=fs, check=False)
# Create a Zarr group
root = zarr.group(store=store)
# Create a Zarr dataset within the group
dataset_3 = root.create_dataset('example_array', shape= z.shape, chunks=(100, 100), dtype='f8')
dataset_3[:] = z
print(fs.ls('my-bucket-2/'))
print(fs.ls('s3://my-bucket-2/data/'))
['my-bucket-2/data', 'my-bucket-2/images']
['my-bucket-2/data/data.csv', 'my-bucket-2/data/zarr_data']
The tree of the S3 database will look like this.
╭─
╰─ tree -L 4
.
├── mc
├── minio
├── my-bucket
└── my-bucket-2
├── data
│ ├── data.csv
│ │ └── xl.meta
│ └── zarr_data
│ └── example_array
└── images
└── minui_web_ui.png
└── xl.meta
8 directories, 4 files
A nice tutorial to learn more about zarr: link
Xarray#
Zarr does not inherently handle labels or metadata beyond basic array attributes. Thus, Zarr is often used with xarray for labeled multi-dimensional large datasets.
Save xarray Dataset to Zarr#
import xarray as xr
# create a xarray dataset
data = xr.Dataset({
"temperature": (("x", "y"), np.random.rand(100, 100)),
"pressure" : (("x", "y"), np.random.rand(100, 100)),
},
coords = {
"x": np.arange(100),
"y": np.arange(100),
}
)
# Save to zarr
data.to_zarr('data/xarray_data.zarr')
<xarray.backends.zarr.ZarrStore at 0x77821091a8c0>
Load Zarr Dataset into xarray#
ds = xr.open_zarr('data/xarray_data.zarr')
print(ds)
<xarray.Dataset> Size: 162kB
Dimensions: (x: 100, y: 100)
Coordinates:
* x (x) int64 800B 0 1 2 3 4 5 6 7 8 ... 91 92 93 94 95 96 97 98 99
* y (y) int64 800B 0 1 2 3 4 5 6 7 8 ... 91 92 93 94 95 96 97 98 99
Data variables:
pressure (x, y) float64 80kB dask.array<chunksize=(100, 100), meta=np.ndarray>
temperature (x, y) float64 80kB dask.array<chunksize=(100, 100), meta=np.ndarray>