ESSI2.2 | High-performance computation with big data in the geosciences
EDI
High-performance computation with big data in the geosciences
Co-organized by HS13
Convener: Kor de JongECSECS | Co-conveners: Juniper TyreeECSECS, Clément BouvierECSECS, Daniel Caviedes-Voullième, Arnau Folch, Corentin Carton de Wiart
Orals
| Fri, 08 May, 16:15–18:00 (CEST)
 
Room -2.33
Posters on site
| Attendance Mon, 04 May, 14:00–15:45 (CEST) | Display Mon, 04 May, 14:00–18:00
 
Hall X4
Posters virtual
| Wed, 06 May, 14:09–15:45 (CEST)
 
vPoster spot 1b, Wed, 06 May, 16:15–18:00 (CEST)
 
vPoster Discussion
Orals |
Fri, 16:15
Mon, 14:00
Wed, 14:09
Spatio-temporal Earth System Science (ESS) datasets are constantly growing in size, particularly those generated by high-resolution numerical models, due to increases in extent and resolution. Because of this, existing software to read, store, and write datasets, and translate the data may not be able to perform the work in a timely manner anymore, while future investment in hardware is likely to remain constrained. This limits the potential of numerical simulation models and machine learning models, for example. However, these models and the larger datasets they produce are essential for advancing ESS, supporting critical activities such as climate change policymaking, weather forecasting in the face of increasingly frequent natural disasters, and modern applications like machine learning.

In this session we bring together researchers working on novel software for processing and compressing large spatio-temporal datasets. By presenting their work to their colleagues, we aim to further strengthen the field of high-performance computation with big data in the geosciences.

We invite everybody recognizing the problem and working on ways to solve it to participate in this session. Possible topics include, but are not limited to:

- High-performance computing, parallel computing, distributed computing, cloud computing, asynchronous computing, accelerated computing, green computing
- Algorithms, libraries, frameworks
- Parallel I/O, data models, data formats, data cubes, HDF5, netCDF, Zarr, COG
- Data compression, including methods that provide guarantees for lossy compression
- Containerization, Docker, Kubernetes, Singularity, Apptainer
- Physically based modelling, physics informed machine learning, surrogate modelling
- Model coupling, model workflow management
- Large scale hydrology, remote sensing, climate modelling
- Lessons learned from case-studies

We recommend authors to highlight those (generic) aspects of their work that may be especially of interest to their colleagues.

Solicited authors:
Langwen Huang

To learn more about data compression and try out different compressors in practice, please also join the SC2.5 short course.

Orals: Fri, 8 May, 16:15–18:00 | Room -2.33

The oral presentations are given in a hybrid format supported by a Zoom meeting featuring on-site and virtual presentations. The button to access the Zoom meeting appears just before the time block starts.
Chairpersons: Daniel Caviedes-Voullième, Juniper Tyree
16:15–16:20
Part I: High Performance Computation
16:20–16:30
|
EGU26-19114
|
On-site presentation
Roc Salvador Andreazini, Xavier Yepes Arbós, Oriol Tintó Prims, Stella Paronuzzi Ticco, and Mario Acosta Cobos

The continuous increase in spatial and temporal resolution of Earth System Models (ESMs) is essential to better represent physical processes and extreme events. However, these advances come at a rapidly growing computational cost, pushing simulations towards unprecedented levels of parallelism on modern High Performance Computing (HPC) architectures. As a result, inefficiencies in load balance, communication, I/O, and memory usage increasingly limit scalability and scientific throughput.

Identifying and addressing parallel performance bottlenecks in large, multi-component climate models remains a complex and time-consuming task, often requiring specialized HPC expertise and manual profiling workflows. This represents a significant barrier for model developers aiming to efficiently exploit current and future exascale systems.

We present the Automatic Performance Profiling (APP) framework, an automated and extensible workflow designed to provide performance analysis of high-resolution ESMs. APP runs end-to-end profiling experiments and generates a comprehensive, multi-level performance report that combines high-level metrics (e.g., simulated years per day (SYPD) and scalability curves) with detailed insights into MPI communication patterns, cache behavior, and function profiling. This approach enables systematic identification of bottlenecks arising from extreme concurrency and fine spatial/temporal resolution demands.

Integrated with the Autosubmit workflow manager, APP facilitates reproducible performance studies, cross-platform and model configurations/resolutions comparisons. Its modular design supports multiple climate models (NEMO and ECE4) and HPC systems (BSC’s MN5 and ECMWF’s HPC2020) and allows straightforward extension to new HPC platforms and models.

By lowering the barrier to parallel performance analysis, APP empowers the climate modelling community to improve scalability and resource efficiency, supporting the sustainable development of next-generation high-resolution ESMs.

How to cite: Salvador Andreazini, R., Yepes Arbós, X., Tintó Prims, O., Paronuzzi Ticco, S., and Acosta Cobos, M.: Enhancing Earth system models efficiency: Leveraging the Automatic Performance Profiling framework, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19114, https://doi.org/10.5194/egusphere-egu26-19114, 2026.

16:30–16:40
|
EGU26-16512
|
On-site presentation
Edwin Sutanudjaja, Saeb Faraji Gargari, and Oliver Schmitz

For environmental scientists like hydrologists or ecologists, the performance of a model mostly refers to how well a simulation run mimics the modelled phenomenon, often evaluated by a broad range of measures comparing the simulated output to observed data. Increasing the model performance is then an ongoing process of improving the model by incorporating new or refining the existing implementation of environmental processes, possibly combined with using improved datasets at higher spatial and temporal resolutions. This, however, increases the computational burden of the simulations. Improving the computational performance of a model to efficiently support a range from stand-alone computers to HPC systems is typically not in the scope of an environmental scientist, while a reduced runtime would be beneficial for the entire modelling cycle.


The LUE (https://zenodo.org/records/16792016) environmental modelling framework is a software package for building HPC-ready simulation models. The Python bindings provide domain scientists a large set of spatial operations for model building. All LUE operations are implemented in C++ using HPX (https://doi.org/10.5281/zenodo.598202), a library and runtime environment providing an optimal asynchronous execution of interdependent tasks on both shared-memory and distributed computing systems. Models constructed with LUE can therefore run on HPC systems without further modifications of the Python code and without explicit knowledge of programming HPC systems. In addition, the lue.pcraster Python sub-package provides an almost effortless transformation of existing PCRaster Python based models to LUE. In our presentation we showcase PCR-GLOBWB (https://doi.org/10.5194/gmd-11-2429-2018), a model simulating hydrology and water resources at a global scale, as an example of transforming an existing large scientific code base to LUE. We also demonstrate how efficiently the model now uses hardware ranging from one to thousands of CPUs, and therefore is prepared for global modelling studies at resolutions finer than 1 km.

How to cite: Sutanudjaja, E., Faraji Gargari, S., and Schmitz, O.: Good model performance?, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-16512, https://doi.org/10.5194/egusphere-egu26-16512, 2026.

16:40–16:50
|
EGU26-20436
|
On-site presentation
Jenny Wong, Vojtech Tuma, Harrison Cook, Corentin Carton de Wiart, Olivier Iffrig, James Hawkes, and Tiago Quintino

In-memory HPC workflows promise significant performance gains by reducing I/O, but achieving these gains requires precise scheduling of data-dependent task graphs on heterogeneous computing platforms. While existing Python frameworks such as Dask provide abstractions for parallel execution, they are not designed to fully exploit advanced topology-aware scheduling, natively support tightly coupled CPU-GPU task graphs in complex HPC environments, or utilise captured profiling information during scheduling. 

Earthkit-workflows is a Python library with a declarative API for constructing task graphs, and the capability to schedule and execute them on local or remote resources. It targets heterogeneous environments, enables task-based parallelism across CPUs, GPUs, and distributed HPC or cloud systems. Expensive I/O operations and intermediate storage are minimised via shared memory and high-speed interconnects, allowing intermediate results to be exchanged efficiently during task-graph execution. Streaming outputs from tasks, such as stepwise forecasting, are given first-class support, to allow starting downstream tasks without delay. The library also offers extensible graph-building interface with a plugin mechanism, allowing users to define custom operations, and interoperates seamlessly with the wider earthkit ecosystem. 

The task-graph construction and execution capabilities of earthkit-workflows are being applied in ECMWF’s next generation of data processing frameworks. Individual data processing functions are published as modular and reusable graphs, enriched with profiling measurements, and then combined together to form operational workflows. Two operational workflows which happen to have a subgraph in common, for example two subgraphs retrieving the same data as input, can be automatically merged for efficient resource utilisation. For operational robustness, checkpointing capability is also provided. 

Earthkit-workflows additionally finds application as the core of Forecast-in-a-Box, ECMWF’s offering that combines data-driven weather forecasting models with meteorological product generation, in a manner portable to personal workstation, high power local device or cloud computing, and aimed at non-technical users. Support for GPU is particularly critical, enabling efficient inference for data-driven weather forecasting models, not limited to HPC environments. 

How to cite: Wong, J., Tuma, V., Cook, H., Carton de Wiart, C., Iffrig, O., Hawkes, J., and Quintino, T.: Minimising I/O, maximising throughput: earthkit-workflows, a task-graph engine for heterogeneous systems , EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-20436, https://doi.org/10.5194/egusphere-egu26-20436, 2026.

16:50–17:00
|
EGU26-9822
|
ECS
|
On-site presentation
Siddhant Tibrewal and Nils-Arne Dreier

Kilometer-scale Earth System Model (ESM) simulations increasingly generate petabyte-scale datasets. The scientific return from such datasets still remains constrained by their accessibility and heterogenity as well as the cost of their downstream analysis. Analysts often rely on ad-hoc workflows and even analyses on reduced datasets require repeated access to high-resolution data, limiting scalability. We present Hiopy (Hierarchical Output in Python), a tool for generating cloud accessible, analysis-ready dataset directly from a km-scale ESM simulation using the ICON model by computing hierarchical temporal and spatial aggregations in situ. Building on the work of Kölling et al. (2024, EGU), Hiopy produces multi-resolution, self-describing datasets that enable seamless access from coarse to native resolution using the Zarr format. To mitigate the computational and communication overhead of in-situ aggregations, Hiopy uses YAC (Yet Another Coupler) to couple the model to the output component and configures the aggregates such that the model’s domain decomposition and the prefered Zarr chunking are aligned for even distribution of workload across the output processes. As a result of this, the communication overhead is reduced and efficient parallel computation is possible without penalising the simulation throughput. Additional optimisations reduce communication buffers, eliminate redundant duplications in metadata handling, allows streaming the data directly to its final location and eases configuration for varying requirements. Hiopy supports native ICON model grids, regular latitude–longitude grids and the HEALPix grid and has been validated by producing publicly accessible datasets from the km-scale ESM simulations across multiple projects. This work demonstrates a practical tool in the software stack of high resolution climate modelling.

How to cite: Tibrewal, S. and Dreier, N.-A.: Coupling km-Scale Earth System Model to Hierarchical Output for Analysis-Ready Dataset, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9822, https://doi.org/10.5194/egusphere-egu26-9822, 2026.

17:00–17:05
Part II: Data Compression
17:05–17:10
17:10–17:20
|
EGU26-10844
|
ECS
|
solicited
|
On-site presentation
Langwen Huang, Luigi Fusco, Jan Zibell, Florian Scheidl, Michael Armand Sprenger, Sebastian Schemm, and Torsten Hoefler

As the resolution of weather and climate simulations increases, the amount of data produced is growing rapidly from hundreds of terabytes to tens of petabytes. The huge size becomes a limiting factor for broader adoption, and its fast growth rate will soon exhaust all available storage devices. To address these issues, we present EBCC (Error Bounded Climate-data Compressor). It follows a two-layer compression approach: a base compression layer using JPEG2000 to capture the bulk of the data with a high compression ratio, and a residual compression layer using wavelet transform and SPIHT (Set Partitioning In Hierarchical Trees) encoding to efficiently eliminate long-tail extreme errors. EBCC outperforms other methods in the benchmarks at relative error targets ranging from 0.1% to 10%. In the energy budget closure and Lagrangian trajectory benchmarks, it can achieve more than 100× compression while keeping errors within the natural variability derived from ERA5 uncertainty members. We implement EBCC as a standalone C library which is seamlessly integrated with NetCDF and Zarr pipelines.

How to cite: Huang, L., Fusco, L., Zibell, J., Scheidl, F., Sprenger, M. A., Schemm, S., and Hoefler, T.: EBCC: an Error Bounded Climate-data Compressor, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-10844, https://doi.org/10.5194/egusphere-egu26-10844, 2026.

17:20–17:30
|
EGU26-20630
|
ECS
|
On-site presentation
Clara Hartmann, Rafael Ballester-Ripoll, Julian A. Croci, Jorge Gacitua Gutierrez, Juan Jose Ruiz, Paola Salio, Alexandra Diehl, and Renato Pajarola

High-resolution numerical weather and climate simulations increasingly produce very large data with high dimensionality. Such datasets usually span three spatial dimensions, time, multiple physical variables, and ensemble members, leading to six-dimensional (6D) hypervolume datasets. Being grid-based, these datasets can be interpreted as 6D data tensors. The storage, processing, visualization, and analysis of such large data poses significant computational and memory storage challenges. Tensor decomposition and approximation methods have proven to be an efficient tool for compression and reconstruction of such large, high-dimensional scientific datasets. Using rigorous mathematical principles, tensor decompositions are exploiting multi-linear structure and redundancy inherent in scientific data, leading to an effective compression of the datasets while providing visually accurate results.

In this work, we investigate the applicability of tensor decompositions for the compression and efficient representation of 6D weather simulation data. We focus on two of the state-of-the-art low-rank tensor formats, tensor-train (TT) and Tucker decompositions. These methods generalize the singular value decomposition (SVD) to higher-order tensors, enabling compression of spatial, temporal, and physical modes through rank reduction. Therefore, the large high-dimensional tensor is factorized into multiple smaller, rank-reduced tensors with lower dimensionality, reducing the size of the original data significantly while preserving essential features. Such a reduced representation is also called a tensor approximation (TA).

We apply the tensor decompositions to a real-world weather simulation dataset from the Alpine region of Switzerland (COSMO-1E), organized along longitude, latitude, vertical level, time, physical variables (such as temperature), and 11 ensemble dimensions. We evaluate the performance of the compression in terms of storage reduction, relative reconstruction error, peak-signal-to-noise-ratio (PSNR), structural similarity index measure (SSIM), computational costs, and visual comparison to the original data. Our results demonstrate significant compression ratios while preserving high visual accuracy. For example, a TT-based compression with a compression ratio of 1 : 900 provides results with a relative error of only 0.0005. The obtained compression ratio reduces the size of 4GB of the original dataset to 4.6MB for the compressed dataset. Lower compression ratios lead to even higher accuracy.

Beyond efficient data compression, the linear structure of the tensor decompositions allows for efficient application of filters in the tensor domain. The computation of the mean, standard deviation or similar linear operations along user-defined dimensions can directly be performed on the decomposed tensors, without ever having to reconstruct the large 6D dataset. Furthermore, the structure of the tensors allows for efficient partial reconstruction and visualization of slices or subsets of the dataset without reconstructing the complete dataset.

Overall, this work highlights tensor decompositions as powerful tool for managing the growing size and complexity of high-dimensional weather simulation data. Their linear structure, which allows for efficient filter application in the compressed domain, makes them especially suitable for scientific analysis of complex datasets. Their integration into geoscientific data pipelines offers a promising pathway towards scalable and accurate data compression and analysis in numerical weather prediction and climate science. 

How to cite: Hartmann, C., Ballester-Ripoll, R., Croci, J. A., Gacitua Gutierrez, J., Ruiz, J. J., Salio, P., Diehl, A., and Pajarola, R.: Compression and Reconstruction of High-Dimensional Weather Simulation Data Using Tensor Decompositions, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-20630, https://doi.org/10.5194/egusphere-egu26-20630, 2026.

17:30–17:40
|
EGU26-20880
|
ECS
|
On-site presentation
Julian A. Croci, Marc Rautenhaus, Clara Hartmann, Jorge Gacitua Gutierrez, Juan Jose Ruiz, Paola Salio, Alexandra Diehl, and Renato Pajarola

Major challenges with modern weather and climate simulations are the resources required to store, analyze and visualize the generated data. This storage problem forces scientists to compromise on data dimensionality, for example by discarding physical variables or the reduction of temporal timestamps

Tensor decomposition and approximation (TA) methods recently got a revival in the context of neural networks, to reduce the number of network parameters. However, TA methods also exhibit interesting properties favorable for the lossy compression of volumetric data. For example, for turbulence volumes created by simulations, compression ratios higher than 300 can be reached while preserving high precision. This allows for more efficient storage of large multi-dimensional data grids. Furthermore, tensor decompositions allow for partial reconstruction as well as the application of linear functions in the compressed domain, making these representations especially suitable for a variety of down-stream tasks analyzing the data such as statistical analysis. However, one open question, as for all lossy compression techniques, is, how the loss influences the quality of said tasks.

For the operationalization of TA methods, another challenge is their parametrization. Various decomposition techniques exist and selecting the most appropriate one is non-trivial. Further, data likely needs to be divided into smaller pieces, e.g. chunks, to achieve the best results, meaning high compression ratios while introducing as little error as possible. The division of the data in this context can mean both, omitting dimension (and hence reducing the dimensionality of the tensor) as well as splitting the data within dimensions. Finally, different tensor decomposition methods allow for different setups, further widening the compression parameter space to explore.

In this work we present an experimental setup that verifies compression performance regarding error metrics directly on the data as well as impact of the compression losses in downstream visualization tasks. We are using an offline TA-based compression scheme in which the data is reconstructed, i.e. decompressed, before saving it again in a standard format and hence being easily able to be fed into downstream visualization applications such as Met.3D. On this example, we will discuss how numerical error metrics, such as the relative error or the RMSE, are not always representative for errors in the visualization of the data in downstream tasks, especially for variables derived from the data. Further, we present different strategies for partitioning the data into chunks and motivate the effectiveness of tensor decomposition methods in the domain of numerical weather forecast data.

How to cite: Croci, J. A., Rautenhaus, M., Hartmann, C., Gacitua Gutierrez, J., Ruiz, J. J., Salio, P., Diehl, A., and Pajarola, R.: Evaluating Tensor Decomposition and Approximation as Lossy Compression for Weather Data Visualization Tasks, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-20880, https://doi.org/10.5194/egusphere-egu26-20880, 2026.

17:40–17:50
|
EGU26-15706
|
On-site presentation
Fenwick Cooper, Shruti Nath, Antje Weisheimer, and Tim Palmer

1000 member ensemble forecasts of rainfall are compressed from ~230 MB to ~400 KB using lossy histogram compression. This level of compression allows fast download, analysis and responsive display on a website, even when using obsolete laptop computers or basic smartphones. The information lost to achieve this level of compression is ignored in all but the most specialist of applications, and the algorithm scales to much higher ensemble sizes with negligible additional storage. The method is currently in operation every day with national meteorological centres in East Africa.

 

Physics based weather models are routinely used produce ensemble forecasts with up to 100 members. These ensembles are an advance on single deterministic forecasts, in that they indicate uncertainty. With larger ensembles providing more accurate distributions of forecast variables. The downside of large ensembles is their storage, transmission and processing cost. Furthermore, machine learning models are being used operationally to generate very large forecast ensembles. For example, rainfall forecasts by ICPAC and national meteorology centres in East Africa are now routinely produced with 1000 ensemble members. Analysis and transmission of these forecasts using traditional methods is completely impractical given currently available hardware. Compression is necessary and can be achieved by storing the ensemble as a series of histograms, sacrificing spatial correlation information.

How to cite: Cooper, F., Nath, S., Weisheimer, A., and Palmer, T.: Histogram compression of large ensemble forecasts, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-15706, https://doi.org/10.5194/egusphere-egu26-15706, 2026.

17:50–18:00

Posters on site: Mon, 4 May, 14:00–15:45 | Hall X4

The posters scheduled for on-site presentation are only visible in the poster hall in Vienna. If authors uploaded their presentation files, these files are linked from the abstracts below.
Display time: Mon, 4 May, 14:00–18:00
I/O
X4.93
|
EGU26-5035
|
ECS
Junxian Chew and Kor de Jong

Forward simulation of geographical systems typically involves time step iterations of reading, compute and writing temporal states until the target end time. As the spatial fidelity of geographical data continues to be refined to achieve simulations with higher accuracy, so does the amount of operations associated with read, compute and write within each time step. Simulations of continental or global scale can only be completed within a reasonable time scale if the data can be distributed over multiple supercomputer nodes, in conjunction with parallel execution for the operations within each time step. 

The LUE framework is designed to be a general software platform that enables scientists in defining custom computational models and achieving scalable performance on large-scale computing environment. There have been some efforts in the parallel implementations of compute operations which demonstrated good scaling behaviour[1,2]. This is achieved in LUE by distributing small subsets of the global geographical dataset to available CPU threads across multiple supercomputer nodes in an asynchronous manner, each subset having its own set of compute operations to be executed. The asynchronicity of the workload queueing allows large number of subsets to be processed in parallel, as well as ensuring full workload occupancy to all available compute resources.

This advancement, however, inadvertently highlighted the inefficiency of serial handling of read/write operations. File access operations like read/write is also known as Input/Output (I/O) operations. Just as scalable computation requires parallel algorithms to realize, scalable I/O requires the utilization of parallel I/O libraries to distribute the I/O workload over multiple I/O-specific compute nodes. However, combining parallel I/O with asynchronously spawned computations, while ensuring that the resulting file output is correct is challenging. 

The challenge originates from complexities in ensuring data in memory is synced to the file storage system while the storage system is being acted on by all participating CPU threads. Often times, careless management of I/O results in unintended overwriting of file content due to concurrent accesses. This highlights the added difficulties in file access parallelization compared to in-memory operations such as computations. As such, much care is needed in the design and planning of file access and synchronisation patterns for meaningful gain in parallel I/O performance within an asynchronous many task execution. 

In this work, we attempt to implement a parallel read/write access pattern that works well with the asynchronous parallel compute paradigm deployed within the LUE modelling framework. Integration of parallel I/O in an asynchronous execution brings additional benefit of interleaved compute and I/O tasks. Part of the I/O latencies can be hidden by concurrent compute workloads, which is harder to realize in a synchronous parallel execution. Success of this work will enable scalable compute and parallel file access for geoscience simulation workloads carried out via the LUE framework, reducing the overall computational resource consumption for large scale simulations. 

References:
1. https://doi.org/10.1016/j.cageo.2022.105083
2. https://doi.org/10.1016/j.envsoft.2021.104998

How to cite: Chew, J. and de Jong, K.: Parallel file access: the missing piece in efficient large scale geosimulation, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-5035, https://doi.org/10.5194/egusphere-egu26-5035, 2026.

X4.94
|
EGU26-6903
|
ECS
Carlos Villalta López, Leonardo Mingari, Alexandros-Panagiotis Poulidis, and Arnau Folch

High-resolution meteorological data is essential for accurate volcanic ash dispersion modelling, particularly in regions with complex topography. However, performing fully dynamical atmospheric simulations at very fine spatial resolution is computationally expensive and may limit their applicability in contexts where urgent computing is required, such as operation forecasting. Diagnostic downscaling methods offer a potential alternative by enhancing coarse-resolution meteorological fields at a lower computational cost, but their added value relative to full dynamical nesting remains to be further explored. In this work, we assess the effectiveness of diagnostic meteorological downscaling using an integrated simulation workflow based on the MetPrep tool coupled with the FALL3D ash dispersion model. This approach is applied to the case study of the 2021 Tajogaite eruption (La Palma), comparing meteorological data from three WRF-ARW dynamically nested domains with increasing spatial and temporal resolution (domains d01, d02 and d03) against diagnostic downscaling applied to the coarser WRF domains (d01+MetPrep and d02+MetPrep). All dispersion simulations are run using identical eruptive parameters in order to isolate the impact of the meteorological downscaling method. The simulated ash deposits are compared against field observations using point-to-point validation metrics and spatial characterisation based on isopach area fits. In addition, physically motivated wind metrics, including vertical wind shear and wind-topography coherence, are analysed to interpret the effects introduced by diagnostic downscaling on the flow. Preliminary results show that diagnostic downscaling can partially bridge the gap between coarse and high-resolution dynamical simulations, improving the representation of near-surface flow and ash deposition patterns at a fraction of the computational cost. The study highlights both the potential and the limitations of diagnostic downscaling as an alternative to full dynamical nesting for volcanic ash dispersion applications.

Funded by the European Union. This work has received funding from the European High Performance Computing Joint Undertaking (JU) and Spain, Italy, Iceland, Germany, Norway, France, Finland and Croatia under grant agreement No 101093038, ChEESE-2P, project PCI2022-134973-2 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR.

How to cite: Villalta López, C., Mingari, L., Poulidis, A.-P., and Folch, A.: Evaluating a meteorological downscaling method for volcanic ash dispersion and deposition modelling, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-6903, https://doi.org/10.5194/egusphere-egu26-6903, 2026.

X4.95
|
EGU26-15196
Max Jones, Joe Hamman, Davis Bennett, Kyle Barron, and Justus Magin

As geoscientific datasets continue to grow in size and complexity, the Zarr community has developed a modern, open-source solution for storage and I/O of multi-dimensional arrays and metadata. Zarr offers a high-performance, highly scalable, cloud-native container for scientific data, which allows scientists to transcend the constraints of individual files and think in terms of coherent datasets. Zarr’s potential has led to widespread adoption across government, industry, and academia. In this presentation, we offer practical guidance for how to leverage the latest and greatest features in the Zarr ecosystem, including:

  • Sharding to reduce the number of files, benefiting HPC users in particular
  • Virtualization via VirtualiZarr and Icechunk to enable high-performance access to data spread across NetCDF4/HDF5, GRIB, or GeoTIFF files
  • Custom data types, compression schemes, and variable chunk grids
  • Client-side (i.e., in-browser) rendering of large multidimensional geospatial datasets

Through concrete examples and best practices, we demonstrate how the Zarr ecosystem enables researchers to work with multi-terabyte datasets as seamlessly as small files.

How to cite: Jones, M., Hamman, J., Bennett, D., Barron, K., and Magin, J.: Zarr at scale: virtualization, sharding, and performance optimizations for Earth science data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-15196, https://doi.org/10.5194/egusphere-egu26-15196, 2026.

Algorithms
X4.96
|
EGU26-9317
Daniel Caviedes-Voullième, Pablo Vallés, José Segovia-Burillo, Mario Morales-Hernández, Sergio Iserte, and Antonio Peña

Flood dynamics are transitions between low-flow stages which result in small wet areas and high-flow stages which naturally result in large flooded areas. The response of the dynamics of a flood to the time-varying forcing (may it be a hydrograph or precipitation) is precisely what flood models attempt to simulate. Therefore, it is a priori unknown. 

The computational load of 2D shallow water simulators is strongly dependent on the number of flooded cells, and thus the flooded area. Consequently, the dynamics of the flooded area translates into time-varying computational demands: low flow stages can be simulated with fewer resources, whereas peak-flow stages demand significantly higher computational capacity. Typically, modellers will choose a set of computational resources which suits the problem size and demands based on experience and preliminary tests. However, these static (used throughout the simulation) resource sets either slow down computations when they are too small for the high flow stages, or make inefficient use of resources when they are too large for the low flow stages. It follows that dynamic resource allocations, based on the computational demands, would be optimal.

In this contribution we present the integration of the SERGHEI-SWE hydrodynamic model with the Dynamic Management of Resources library (DMRlib) to enable malleability —i.e., the runtime adjustment of MPI process counts and computational resources— to improve computational efficiency in shallow-flow simulations. By coupling SERGHEI-SWE with DMRlib, we enable the solver to dynamically expand or shrink its resource set during execution, adapting to these changing computational needs based on minimal heuristics.

SERGHEI-SWE is a high-performance, exascale-ready, scalable shallow water solver supporting CPUs and GPUs. DMRlib extends it with lightweight runtime support for process-level malleability, coordinating with the MPI runtime and job scheduler to manage resource adaptations. Within SERGHEI-SWE, resource reconfiguration is fundamentally a generalization of dynamic domain decomposition, to allow both the size and number of subdomains to change during execution. As a proof-of-concept, we implement minimal heuristics to trigger malleability based on wet-cell fractions: as flooded areas increase, additional resources are requested; when they decrease, resources are released.

The malleable SERGHEI-SWE was evaluated using dam-breaks, river flood, and catchment runoff tests. Numerical accuracy was preserved, with negligible differences relative to static (non-malleable) runs. Dynamic resource management improved computational efficiency relative to minimal fixed-resource configurations. However, performance remained below the best-case static maximum-resource setup, and communication overheads limited gains in low-demand phases. Nonetheless, the proof-of-concept demonstrates both feasibility and potential at larger scales.

The approach is accurate, robust, and promising for improving resource utilization in large-scale hydrodynamic modeling. Future work will focus on refining reconfiguration heuristics, improving understanding of overheads, and combining malleability with dynamic load balancing to better exploit scalable HPC environments.

How to cite: Caviedes-Voullième, D., Vallés, P., Segovia-Burillo, J., Morales-Hernández, M., Iserte, S., and Peña, A.: Simulating flood dynamics on dynamic HPC resource sets, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9317, https://doi.org/10.5194/egusphere-egu26-9317, 2026.

X4.97
|
EGU26-10512
Max Dormann, Mudassar Razzaq, Claudia Finger, and Erik H. Saenger

Numerical simulations of elastic or acoustic wave propagation usually assume a stationary background medium. In many practical situations, however, such as marine exploration or the inspection of engineered structures pipelines, elastic waves propagate in bodies of moving fluid as well. Ambient flow fields introduce changes to the wave field such as a direction-dependent wave propagation velocity or phase shifts that can be observed in real-world measurements. To obtain simulations that more faithfully represent elastic wave propagation in coupled systems of stationary solids and moving fluids, and that are better suited for comparison with experimental, laboratory, and field data in the future, a formulation is introduced in which a material derivative expands the elastic wave equation. The resulting partial differential equation is solved using an augmented rotated-staggered finite-difference scheme that combines the spatial operators of the rotated-staggered grid with a conventional central-difference approximation. The performance of this new formulation is examined on the propagation of elastic wave fields in ambient steady uniform and steady laminar flow fields in combined fluid-solid models, and compared to reference simulation with no moving background medium. The analysis focuses on travel-time variations and phase shifts, demonstrating that the numerical results are consistent with analytical expectations for wave propagation in moving media.

How to cite: Dormann, M., Razzaq, M., Finger, C., and Saenger, E. H.: Finite-Difference modeling of elastic wave propagation in solid-moving fluid systems, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-10512, https://doi.org/10.5194/egusphere-egu26-10512, 2026.

X4.98
|
EGU26-10878
|
ECS
Jan Clemens, Lars Hoffmann, Rolf Müller, Felix Plöger, Marvin Henke, Nicole Thomas, Sabine Grießbach, and Catrin Meyer

Models for the calculation of Lagrangian particle dispersion in the atmosphere or the ocean are indispensable tools for understanding natural and anthropogenic processes. These processes range from volcanic ash clouds, through cloud microphysics to the study of the ozone layer on climate scales. With exascale machines at our disposal, such calculations can now be performed at significantly higher resolutions, both in terms of the driving wind field and particle number density.

Massive-Parallel Trajectory Calculations (MPTRAC) is a library designed to enable Lagrangian particle dispersion analysis for atmospheric transport processes in the free troposphere and stratosphere. It is optimized for modern high-performance computing infrastructure. MPTRAC was developed with contemporary high-performance computing (HPC) systems in mind, ensuring high scalability across GPU and CPU clusters through an MPI-OpenMP/ACC hybrid parallelization approach. Its data structures are tailored to the multi-layered cache systems of modern compute nodes. MPTRAC is routinely executed on the JUWELS-Booster supercomputer and is planned for deployment on the JUPITER exascale machine.

This contribution outlines ongoing developments in MPTRAC. A central aspect of the presented work is the implementation of domain decomposition, which partitions wind field data and associated tracer particles across distributed subdomains. This methodology promises to enhance computational efficiency and scalability, particularly in the context of large-scale atmospheric transport simulations. Furthermore, we detail the integration of MPTRAC with the ICON modeling framework through its community interface. This extension enables the direct application of particle-based transport methods within ICON, supporting high-resolution climate and weather simulations.

The described developments are conducted within the scope of the WarmWorld Project, which aims to enable high-resolution calculations using ICON.

MPTRAC is available under an open-source licence: https://github.com/slcs-jsc/mptrac

How to cite: Clemens, J., Hoffmann, L., Müller, R., Plöger, F., Henke, M., Thomas, N., Grießbach, S., and Meyer, C.: MPTRAC: Domain-decomposed Massively-Parallel Trajectory Calculations, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-10878, https://doi.org/10.5194/egusphere-egu26-10878, 2026.

X4.99
|
EGU26-16260
|
ECS
Siddik Barbhuiya and Vivek Gupta

The development of hyper-resolution land surface modelling poses significant computational challenges. Detailed water balance assessments, ensemble-based uncertainty quantification, and climate scenario exploration all require running physics-based models like VIC, Noah-MP, and CLM at continental scales with high spatial resolution, long temporal spans, and multiple parameter configurations. The computational cost becomes prohibitive. Machine learning surrogates have recently emerged as potential solutions; however, existing LSTM and CNN approaches have fundamental architectural problems. Sequential processing prevents parallel computation, limited receptive fields miss long-range dependencies, and most approaches only predict single variables, which restricts comprehensive hydrological analysis.

We present a shifted-window transformer framework that simultaneously predicts multiple land surface fluxes (runoff, evapotranspiration, and soil moisture) while maintaining computational efficiency at continental scales. The hierarchical attention mechanism captures both local temporal patterns through windowed self-attention and global temporal context through shifted-window operations. This eliminates recurrent bottlenecks. We adapt vision transformers for hydrological regression by tokenizing meteorological sequences temporally, using relative position biases to encode lag-dependent hydrological relationships, and designing multi-task regression heads that preserve both nonlinear interactions and direct physical drivers.

We demonstrate the approach by emulating the VIC model across India's 76,390 land grid cells at 6 km resolution, spanning diverse climate regimes. Training uses sparse spatial sampling with only a small fraction of available locations. This allows us to evaluate how well the surrogate generalizes VIC's process behaviors to the newer, unseen regions and parameter configurations. We test multiple variants, including autoregressive formulations that incorporate previous timestep outputs, and benchmark everything against LSTM baselines to isolate the contributions of the architecture.

How to cite: Barbhuiya, S. and Gupta, V.: Breaking Computational Bottlenecks in Land Surface Modelling with Shifted-Window Transformers, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-16260, https://doi.org/10.5194/egusphere-egu26-16260, 2026.

X4.100
|
EGU26-17230
|
ECS
Bomi Kim, Hyungon Ryu, Seungsoo Lee, Jun-Hak Lee, and Seong Jin Noh

High-resolution urban flood modelling is increasingly critical for disaster mitigation, but simulations remain computationally expensive, particularly when applying meter-scale grids over large spatial domains. Such computational constraints often restrict the practical use of high-resolution simulations in operational forecasting and scenario-based analyses. To address this challenge, this study investigates the use of multi-GPU acceleration to improve computational efficiency in large-scale urban flood simulations. We present a multi-GPU implementation of the H12 2D urban flood model based on an MPI–OpenACC framework. The H12 2D model is a physics-based two-dimensional urban flood model that supports CPU-based parallel execution and is extended here to GPU architectures. The proposed approach employs directive-based parallelization. This approach allows a single code base to be executed on both CPU and GPU systems without extensive code modification. Domain decomposition is managed using MPI, while computationally intensive kernels are offloaded to GPUs through OpenACC directives. This hybrid design ensures portability across heterogeneous high-performance computing environments and enables efficient use of multiple GPUs. We evaluate performance using spatial resolutions ranging from 1 to 20 m over two contrasting domains: an urban catchment in downtown Portland, Oregon (USA), and a downstream reach of the Han River basin (Republic of Korea). This study will discuss how computational performance varies with model resolution, domain size, and the distribution of computational workload across multiple GPUs, with a focus on scalability and parallel efficiency. The improved computational efficiency achieved in this study can support pseudo real-time urban flood prediction for early warning applications. In addition, the proposed framework facilitates large-scale, high-resolution simulations that can be used to generate ground-truth datasets for the development and validation of physics-informed or data-driven flood prediction models.

How to cite: Kim, B., Ryu, H., Lee, S., Lee, J.-H., and Noh, S. J.: Multi-GPU acceleration of high-resolution and large-scale urban flood modelling using MPI–OpenACC, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-17230, https://doi.org/10.5194/egusphere-egu26-17230, 2026.

Datastructures
X4.101
|
EGU26-10395
|
ECS
Leo Kotipalo, Urs Ganse, Yann Pfau-Kempf, Jonas Suni, and Minna Palmroth

Vlasiator is a global hybrid-Vlasov space plasma simulation, modeling the velocity distribution of ions in a large region of near-Earth space. Due to the high memory and computation demands of the kinetic method as well as the large physical scale, optimisations are required to make simulation feasible. This presentation explores optimisations used in the spatial and velocity grids.

We first consider the spatial dimension. For this, Vlasiator utilises cell-based octree adaptive mesh refinement (AMR). Essentially, each spatial cell may be split in all three spatial dimensions to create eight smaller children in order to improve simulation accuracy in relevant regions. This can be repeated if necessary, with runs typically using four levels of refinement. Refinement may be done statically at the start of the simulation, or dynamically based on the plasma parameters.

Vlasiator uses a combination of several parameters for dynamic runtime refinement. These include scaled gradients of macroscopic variables to detect steep changes, the ratio of the current density to perpendicular magnetic field for current sheets and reconnection, as well as pressure anisotropy and vorticity for foreshock refinement.

For the velocity grid we use a somewhat similar method of stretching. In order to simplify translation, the velocity grid is static and identical in each spatial cell. To eliminate splitting of acceleration pencils, the size of cells in each coordinate direction is a function of that coordinate. Thus if we consider a grid with higher resolution around some point, the grid appears stretched along the coordinate axes when moving away from that point. The main purpose of the stretched grid is to enable modeling of colder distributions requiring a higher resolution without increasing resolution for the entire velocity grid.

Combining these optimisations enables simulation on modern supercomputers with scale and resolution which would be unfeasible without them. This is achieved by limiting resources expended on regions where they are less critical for simulation accuracy and the scientific focus of a given run, while allowing higher fidelity in more important regions. These methods are applicable to other kinetic simulations, as well as grid-based simulations in general.

How to cite: Kotipalo, L., Ganse, U., Pfau-Kempf, Y., Suni, J., and Palmroth, M.: Improvements to 6D Grid Optimisation in Vlasiator, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-10395, https://doi.org/10.5194/egusphere-egu26-10395, 2026.

Modelling
X4.102
|
EGU26-7902
Mario Acosta, Sergi Palomas, Sophie Valcke, Pierre-Antoine Bretonnière, and Paul Smith

Global climate models are among the most computationally demanding scientific applications, with rapidly increasing resolution and complexity driving unprecedented requirements in high-performance computing. While model intercomparison efforts have traditionally focused on scientific output and physical fidelity, the computational performance, energy consumption and carbon footprint of climate simulations are becoming critical factors for the sustainability of next-generation modelling activities.

Building on previous coordinated work done for CMIP6, this work extends the scope towards a global assessment framework applicable to all major climate models. We present a list of metrics applicable to climate simulations to systematically quantify model performance, energy cost and associated carbon footprint using standardised and reproducible metrics across supercomputing platforms. The proposed framework combines workload analysis, runtime monitoring and workflow-level instrumentation to enable consistent comparisons between modelling systems.

This effort is conducted in the context of the World Climate Research Programme ESMO Infrastructure Panel (WIP), where a dedicated task team is coordinating the systematic collection of performance, energy and carbon footprint metrics from modelling centres participating in CMIP7, in collaboration with initiatives such as ESiWACE, ENES-RISe,  Destination Earth and FUTURA. The objective is to establish community-endorsed metrics and monitoring practices that can be integrated into operational model development and production workflows, from CMIP7 and beyond.

By treating computational efficiency and carbon footprint as first-class metrics in climate model evaluation, this work aims to support informed decisions on model design, resource allocation and optimisation strategies, contributing to a more efficient and sustainable future for global climate modelling.

How to cite: Acosta, M., Palomas, S., Valcke, S., Bretonnière, P.-A., and Smith, P.: Towards standardised metrics of performance, energy and carbon footprint for CMIP experiments, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7902, https://doi.org/10.5194/egusphere-egu26-7902, 2026.

X4.103
|
EGU26-17426
Stefan Verhoeven, Bart Schilperoort, Peter Kalverla, and Rolf Hut

Running numerical models you are unfamiliar with is not always straightforward. The models have different kinds of interfaces, different program languages, and different names for the same concepts. To standardize this, the Basic Model Interface (Hutton, 2020) was developed by the Community Surface Dynamics Modeling System (CSDMS). With the Basic Model Interface (BMI), users are presented with a standard set of functions to query and control numerical models. This standard interface also allows users to couple models together, allowing for the creation of standard components that can be coupled to create a full model (Peckham, 2013). 

However, coupling these models or components, whether they are written in C, C++, Fortran or Python, requires them to all share the same interpreter or (Python) environment. This is not always possible or viable and can require compilation on the end-user's side. This also prevents containerization of models. 

For cross-language and cross-container communication we developed grpc4bmi in 2018, making it possible to use the BMI over a HTTP connection. However, while highly performant, gRPC is not supported in many languages. To this end, we developed the new RemoteBMI protocol. RemoteBMI can communicate to models using the Basic Model Interface using a RESTful API, making it easier to support any language; only a HTTP server and JSON parser implementation are required. 

With grpc4bmi and RemoteBMI it is possible to package a model or model component inside a software container (e.g., Docker) and communicate with these models over an HTTP connection. This makes models more interoperable and reproducible, as container images can easily be archived and used by other people. It also enables running models on different machines than your own, and then directly communicating with them or coupling them to other models. 

With these technologies, you can now, for example, host models that require specific and difficult-to-share input data and provide them to anyone interested as a web-based service. This model-as-a-service (MaaS) architecture could also make it easier for end-users to try out your model in the browser before committing to installing it locally if they are interested. 

Currently, the grpc4bmi and RemoteBMI protocols are used by the eWaterCycle platform (Hut, 2022), allowing hydrologists and students easy access to containerized hydrological models through a common interface, accelerating both research and teaching. 

 --- 

Hutton, E.W.H., Piper, M.D., and Tucker, G.E., 2020. The Basic Model Interface 2.0: A standard interface for coupling numerical models in the geosciences. Journal of Open Source Software, 5(51), 2317, https://doi.org/10.21105/joss.02317. 

Peckham, S.D., Hutton, E.W., and Norris, B., 2013. A component-based approach to integrated modeling in the geosciences: The design of CSDMS. Computers & Geosciences, 53, pp.3-12, http://dx.doi.org/10.1016/j.cageo.2012.04.002. 

Hut, R., et al. (2022). The eWaterCycle platform for open and FAIR hydrological collaboration. Geoscientific Model Development, 15(13), 5371–5390. https://doi.org/10.5194/gmd-15-5371-2022  

How to cite: Verhoeven, S., Schilperoort, B., Kalverla, P., and Hut, R.: Enabling numerical Models as a Service (MaaS), EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-17426, https://doi.org/10.5194/egusphere-egu26-17426, 2026.

X4.104
|
EGU26-12783
Carles Tena, Marc Guevara Vilardell, Johanna Gehlen, Paula Camps Pla, Oscar Collado, Luca Rizza, and Laura Herrero

Air pollution is one of the most critical environmental threats, contributing to respiratory and cardiovascular diseases and millions of premature deaths worldwide. To support air quality assessment, forecasting and planning efforts, chemical transport models (CTMs) need to be fed with robust, temporally and spatially resolved emission input data. 

Official annual national emission inventories prepared by countries to fulfill mandatory reporting obligations provide robust and consistent data. However, for their use in CTMs, emission data needs to be spatially distributed over a grid, temporally broken down into hourly resolution and chemically mapped to the species defined in the CTMs mechanism. Bridging the gap between official inventory data and CTM model-ready emission input needs requires a scalable, transparent, and reproducible system that can process raw inventories into gridded, hourly and chemically speciated CTM-compatible datasets.

HERMES_Δ is a open-source emission model developed at the Barcelona Supercomputing Center (BSC) to address this challenge. Implemented in object-oriented Python and designed to run on High Performance Computing (HPC) infrastructures, it integrates temporal, spatial, vertical, and chemical disaggregation within a modular architecture. Configuration relies entirely on YAML or CSV files, allowing activity- and region-specific settings while maintaining traceability by preserving the connection between modeled emissions and their original reporting sources. Spatial disaggregation, which is the most computationally demanding step, is parallelized using MPI and optimized through domain decomposition. The produced output files are fully compatible with multiple state-of-the-art CTMs, including CMAQ, CHIMERE; MOCAGE, WRF-CHEM and MONARCH.

To assess the performance of HERMES_Δ, multiple benchmark experiments were performed  on the MareNostrum 5 and CIRRUS Spanish HPC facilities. All tests were performed considering a destination grid of 0.005° (~500 m) resolution covering Spain (peninsular and balearic islands), estimating hourly and speciated emissions for 24 time steps. Performance benchmarking, including time-to-solution and memory profiling, indicates good parallel scalability and resource efficiency. This enables the production of hourly gridded emissions for over 10 000 activity–region combinations, while maintaining reproducibility and strict Coordinated Universal Time (UTC) alignment.

In conclusion, HERMES_Δ provides a robust framework for processing official emission inventories to high spatial and temporal resolutions using geolocated activity proxies. By combining national emission inventories with efficient HPC methods, the system improves the representativeness of emissions in CTMs, strengthens collaboration between emission inventory compilers and air quality modellers, and enables more detailed and realistic simulations for policy development and operational forecasting.

HERMES_Δ is currently being implemented as the emission core of the official Spanish air quality forecasting system operated by the Spanish Meteorological Agency (AEMET)

How to cite: Tena, C., Guevara Vilardell, M., Gehlen, J., Camps Pla, P., Collado, O., Rizza, L., and Herrero, L.: HERMES_Delta: An open source, python-based, parallel software to process official emission inventories and support air quality modelling efforts in Spain, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12783, https://doi.org/10.5194/egusphere-egu26-12783, 2026.

X4.105
|
EGU26-14316
|
ECS
Abdulaziz Alabduljalil, Nada Alsulaiman, Yousef Alosairi, and Tahani Hussain

High-resolution coastal hydrodynamic models are used increasingly to measure and support environmental assessment, crisis mitigation, and forecasting. Yet, these models become constrained by the computing resources available to them resulting in lower-grade results. High-Performance Computing (HPC) thus becomes essential to increase simulation speeds while maintaining high resolutions. In this study, we present a benchmarking of resource configurations for a coastal hydrodynamic model using Delft3DFM Flexible Mesh (D-Flow FM), utilizing HPC resources while focusing on parallel performance, scalability, and efficiency. Benchmarking experiments were run while comparing two MPI libraries, MPICH and Intel MPI, across multiple CPU core counts and partition combinations on both two-dimensional and three-dimensional model configurations, including barotropic and baroclinic configurations. The results showcase how varied the runtime performance becomes depending on the hydrodynamic configuration, MPI implementation, and HPC parallel partition, and how HPC hardware can affect which combination is best. The goal is to provide guidance on finding optimal HPC configurations, including resource allocation and MPI library use, when running coastal hydrodynamic models of high resolution and quality.

How to cite: Alabduljalil, A., Alsulaiman, N., Alosairi, Y., and Hussain, T.: High-Performance Computing Benchmarking for Coastal Hydrodynamic Modelling Using Delft3D Flexible Mesh, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-14316, https://doi.org/10.5194/egusphere-egu26-14316, 2026.

GIS / AI
X4.106
|
EGU26-14912
|
ECS
Gabriele Esposito, Roberta Ravanelli, and Mattia Crespi

This study is part of a broader effort to modernize the Italian gravimetric database and to support the computation of the new national geoid. Its primary aim is the integration of historical gravimetric measurements with modern observations, including data from ongoing airborne surveys, to establish a consistent framework for analyzing temporal variations of the gravity field across Italy. The current study addresses the initial phase of this effort, including the digitization of historical records, the transformation of legacy coordinates into the official Italian geodetic reference frame, a preliminary GIS-based visualization, and the design of a unified database for future spatial and temporal analyses.

Historical gravimetric records from major volumes edited by the former Italian Geodetic Commission (Ballarin, 1936; Cunietti & Inghilleri, 1955; Riccò, 1903; Solaini, 1939; Soler, 1930), covering the late 19th century to the 1960s, were digitized. Pages were scanned at high resolution, and image enhancement techniques, including noise reduction, contrast adjustment, and edge sharpening, were applied to improve legibility and data extraction.

Digitization employed AI-based optical character recognition (OCR) using DeepSeek OCR (Wei et al., 2025), supported by ChatGPT-4 and ChatGPT-5 (OpenAI, 2023, 2025) for table-structure interpretation. This workflow enabled accurate recognition of degraded or complex tables, merged cells, and inconsistent delimiters. Data were initially stored in editable Excel spreadsheets as an intermediate validation step to verify, correct, and standardize key parameters, including geographic coordinates, orthometric height, absolute gravity measurements, year of observation, and survey campaign information. Historical coordinates referred to old Italian datums (Roma1940, ED1950, or other local datums) were converted to WGS84 (EPSG:4326) to ensure compatibility with modern measurements. A key challenge stemmed from the heterogeneity of the legacy reference frame, which required accurate datum transformations for reliable integration with contemporary datasets.

Following digitization and coordinate conversion, historical data are being prepared for integration with modern gravimetric measurements from the national network and ongoing airborne surveys. Initial GIS-based visualization provides an early assessment of spatial coverage and potential inconsistencies. The unified database is designed to manage spatial variability and temporal evolution of gravity and is scalable to accommodate future datasets.

Once fully established, the dataset will undergo quality control and validation using statistical and geospatial methods. While temporal gravity modeling lies beyond the scope of this contribution, the proposed workflow lays a solid foundation for subsequent analyses.


References

Ballarin, S., 1936: Trentadue determinazioni di gravità relativa. Commissione geodetica italiana.

Cunietti, M., Inghilleri, G., 1955: Rete Gravimetrica Fondamentale Italiana. Commissione geodetica italiana.

OpenAI. 2023. GPT‑4 Technical Report: https://cdn.openai.com/papers/gpt-4.pdf.

OpenAI. 2025. GPT‑5 System Card (Technical Overview): https://cdn.openai.com/gpt-5-system-card.pd

Riccò, A., 1903: Determinazione della Gravità Relativa in 43 Luoghi della Sicilia Orientale delle Calabrie. Memorie della Società Degli Spettroscopisti Italiani.

Soler, E., 1930: Due Campagne Gravimetriche sul Carso. Università di Padova.

Solaini, L., 1939: Determinazione di gravità relativa eseguite a Castelnuovo Scrivia, Tortona, Alessandria, Valmadonna, S. Salvatore Monferrato e Sannazzaro De' Burgondi nell'anno 1939. Commissione geodetica italiana.

Wei, H., Sun, Y., Li, Y. . DeepSeek-OCR: Contexts Optical Compress



How to cite: Esposito, G., Ravanelli, R., and Crespi, M.: Integration of Historical and Modern Gravimetric Data to Model the Temporal Variation of the Gravity Field over Italy, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-14912, https://doi.org/10.5194/egusphere-egu26-14912, 2026.

Data Compression
X4.107
|
EGU26-1872
|
ECS
Nicoletta Farabullini and Christos Kotsalos

As Earth System Sciences (ESS) datasets from high-resolution models reach petabyte scales, the scientific community encounters severe constraints in storage, transfer efficiency, and data accessibility. Identifying the right parameters for high compression ratios with strict scientific fidelity within the vast ecosystem of lossy and lossless compression algorithms is a complex and delicate technical challenge.

We present dc_toolkit (https://github.com/C2SM/data-compression): an open-source, parallelized pipeline designed to support researchers navigate through this complex landscape. It leverages a set of user-friendly, customizable command-line tools to allow users to make informed, data-driven decisions. By systematically evaluating over 40,000 combinations of compressors, filters, and serializers, it autonomously identifies the most suitable configuration for both structured and unstructured data with single or multiple variables.

The workflow comprises a three-stage approach: (1) Evaluation & Optimization: the toolkit leverages parallel processing (via Dask and mpi4py) to rapidly evaluate combinations while filtering out those that violate scientific precision requirements and user-defined error tolerances (L-norms). (2) Analysis & Visualization: to help scientists analyze the trade-offs between data reduction and information loss, the tool performs k-means  clustering on the outputs to display clear and organized results. Furthermore, it provides spatial error plotting to verify that domain-specific features (such as periodicity in global grids) are preserved. (3) Application & Interoperability: once the user has decided on a specific configuration, the toolkit handles the high-throughput compression of the dataset into Zarr-based storage. It ensures seamless integration into existing workflows by including utilities for a variety of actions such as inspecting compressed files and converting compressed data back to standard NetCDF format.

By providing a streamlined, automated, and verifiable method for selecting compression parameters, dc_toolkit lowers the entry barrier for lossy compression. It allows ESS researchers to more easily apply data reduction strategies with the confidence that the integrity of their downstream analysis remains intact. Accessibility is further enhanced through available web-based tools and GUI implementations for diverse user technicalities.

How to cite: Farabullini, N. and Kotsalos, C.: dc_toolkit: A parallelized pipeline to navigate the complex ecosystem of compression algorithms, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-1872, https://doi.org/10.5194/egusphere-egu26-1872, 2026.

X4.108
|
EGU26-9673
|
ECS
Juniper Tyree, Daniel Köhler, Robert Underwood, Clément Bouvier, Tim Reichelt, Heikki Järvinen, and Milan Klöwer

The volume of data produced by Earth System Science models, e.g. high-resolution weather and climate models, is increasing faster than the methods and budgets for storing, sharing, and analysing this data. To reduce data sizes, lossy data compression methods discard some quality, details, or precision of the original data. Even though some lossy compressors promise size reductions of 100x or more, the lack of trust in lossy compression, rooted in the fear of losing important information, has so far limited their adoption.

We introduce compression safeguards to help overcome this trust gap by

(i) enabling scientist users to precisely express their (general or specific) safety requirements for lossy compression, e.g. preserving specific values, regionally varying error bounds on the data or quantities derived from it, or any logical combination thereof,

(ii) securing any (existing) (lossy) compressor with the corresponding safeguards, which then

(iii) guarantee that the safety requirements are always met by the safeguarded compressor.

Compression safeguards thus provide a unified and flexible interface for specifying and guaranteeing user safety requirements that works with any existing compressor. They therefore shift the burden of trust in fulfilling these requirements away from specific compressor implementations. With the appropriate safeguards, even untrusted, potentially unsafe compressors can be used safely and without fear. We hope that compression safeguards will provide Earth System scientists with the guarantees to use lossy compression safely and without fear, thereby helping to unlock the benefits of lossy compression in reducing data volumes for the Earth System Science community.

We will showcase how our reference implementation, compression-safeguards (https://compression-safeguards.readthedocs.io/en/latest/), can be applied to safeguard important properties in several real-world meteorological examples, evaluate the impact on compression ratio (only low for sparse corrections) and computational load at compression (major) and decompression (negligible) time, and discuss the future pathway towards safe and fearless lossy compression.

How to cite: Tyree, J., Köhler, D., Underwood, R., Bouvier, C., Reichelt, T., Järvinen, H., and Klöwer, M.: Compression Safeguards - Towards Safe and Fearless Lossy Compression of Earth System Data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9673, https://doi.org/10.5194/egusphere-egu26-9673, 2026.

Posters virtual: Wed, 6 May, 14:00–18:00 | vPoster spot 1b

The posters scheduled for virtual presentation are given in a hybrid format for on-site presentation, followed by virtual discussions on Zoom. Attendees are asked to meet the authors during the scheduled presentation & discussion time for live video chats; onsite attendees are invited to visit the virtual poster sessions at the vPoster spots (equal to PICO spots). If authors uploaded their presentation files, these files are also linked from the abstracts below. The button to access the Zoom meeting appears just before the time block starts.
Discussion time: Wed, 6 May, 16:15–18:00
Display time: Wed, 6 May, 14:00–18:00
Chairperson: Andrea Barone

EGU26-6232 | Posters virtual | VPS22

Application of advanced lossy compression in the NetCDF ecosystem for CONUS404 data 

Shaomeng Li, Allison Baker, and Lulin Xue
Wed, 06 May, 14:09–14:12 (CEST)   vPoster spot 1b

Many geoscientific datasets, such as those produced by climate and weather models, are stored in the NetCDF file format.  These datasets are typically very large and often strain institutional data storage resources. While lossy compression methods for scientific data have become more studied and adopted in recent years, most advanced lossy approaches do not work easily and/or transparently with NetCDF files. For example, they may require a file format conversion or they may not work correctly with “missing values” or “fill values” that are often present in model outputs.  While lossy quantization approaches such at BitRound and Granular BitRound have built-in support by NetCDF and are quite easy to use, such approaches are generally not able to reduce the data size as much as more advanced compressors (for a fixed error metric), like SPERR, ZFP, or SZ3.

We are particularly interested in reducing the data size of the CONUS404 dataset.  CONUS404 is a publicly available unique high-resolution hydro-climate dataset produced by Weather Research and Forecasting (WRF) Model simulations that cover the CONtiguous United States (CONUS) for 40 years at 4-km resolution (a collaboration between NSF National Center for Atmospheric Research the U.S. Geological Survey Water Mission Area). 

Here, we investigate one advanced lossy compressor, SPERR [1], together with its plugin for NetCDF files, H5Z-SPERR [2], in a Python-based workflow to compress and analyze CONUS404 data.  SPERR is attractive due to its support for quality control in terms of both maximum point-wise error (PWE) and peak signal-to-noise ratio (PSNR), enabling easy experimenting of storage-quality tradeoffs. Further, given a target quality metric, previous work has shown that SPERR likely produces the smallest compressed file size compared to other advanced compressors. It leverages the HDF5 dynamic plugin mechanism to enable users to stay in the NetCDF ecosystem with minimal to no change to existing analysis workflows, whenever a typical NetCDF file is able to be read. And, importantly for our work, the SPERR plugin supports efficient masking of “missing values,” which are common to climate and weather model output.  The support for missing values enables compression on many variables which are not naturally handled by other advanced compressors that rely on HDF5 plugins. Further, because H5Z-SPERR directly handles missing values, they can be stored in a much more compact format (and are restored during decompression), further improving compression efficiency. (Note that built-in NetCDF quantization approaches can work with missing values.) 

Our experimentation demonstrates the benefit of enabling advanced lossy (de)compression in the NetCDF ecosystem: adoption friction is kept at the minimum with little change to workflows, while storage requirements are greatly reduced.

 

[1] https://github.com/NCAR/SPERR

[2] https://github.com/NCAR/H5Z-SPERR

How to cite: Li, S., Baker, A., and Xue, L.: Application of advanced lossy compression in the NetCDF ecosystem for CONUS404 data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-6232, https://doi.org/10.5194/egusphere-egu26-6232, 2026.

Please check your login data.