Many geoscientific datasets, such as those produced by climate and weather models, are stored in the NetCDF file format. These datasets are typically very large and often strain institutional data storage resources. While lossy compression methods for scientific data have become more studied and adopted in recent years, most advanced lossy approaches do not work easily and/or transparently with NetCDF files. For example, they may require a file format conversion or they may not work correctly with “missing values” or “fill values” that are often present in model outputs. While lossy quantization approaches such at BitRound and Granular BitRound have built-in support by NetCDF and are quite easy to use, such approaches are generally not able to reduce the data size as much as more advanced compressors (for a fixed error metric), like SPERR, ZFP, or SZ3.
We are particularly interested in reducing the data size of the CONUS404 dataset. CONUS404 is a publicly available unique high-resolution hydro-climate dataset produced by Weather Research and Forecasting (WRF) Model simulations that cover the CONtiguous United States (CONUS) for 40 years at 4-km resolution (a collaboration between NSF National Center for Atmospheric Research the U.S. Geological Survey Water Mission Area).
Here, we investigate one advanced lossy compressor, SPERR [1], together with its plugin for NetCDF files, H5Z-SPERR [2], in a Python-based workflow to compress and analyze CONUS404 data. SPERR is attractive due to its support for quality control in terms of both maximum point-wise error (PWE) and peak signal-to-noise ratio (PSNR), enabling easy experimenting of storage-quality tradeoffs. Further, given a target quality metric, previous work has shown that SPERR likely produces the smallest compressed file size compared to other advanced compressors. It leverages the HDF5 dynamic plugin mechanism to enable users to stay in the NetCDF ecosystem with minimal to no change to existing analysis workflows, whenever a typical NetCDF file is able to be read. And, importantly for our work, the SPERR plugin supports efficient masking of “missing values,” which are common to climate and weather model output. The support for missing values enables compression on many variables which are not naturally handled by other advanced compressors that rely on HDF5 plugins. Further, because H5Z-SPERR directly handles missing values, they can be stored in a much more compact format (and are restored during decompression), further improving compression efficiency. (Note that built-in NetCDF quantization approaches can work with missing values.)
Our experimentation demonstrates the benefit of enabling advanced lossy (de)compression in the NetCDF ecosystem: adoption friction is kept at the minimum with little change to workflows, while storage requirements are greatly reduced.
[1] https://github.com/NCAR/SPERR
[2] https://github.com/NCAR/H5Z-SPERR