ESSI2.7 | Workflow approaches enabling scalable and reproducible state-of-the-art computation and data analysis in Earth System Sciences
EDI
Workflow approaches enabling scalable and reproducible state-of-the-art computation and data analysis in Earth System Sciences
Co-organized by BG9/GD6/GI3/SM9
Convener: Karsten Peters-von Gehlen | Co-conveners: Donatello EliaECSECS, Manuel Giménez de Castro MarcianiECSECS, Ivonne Anders, Valeriu Predoi
Orals
| Mon, 04 May, 16:15–18:00 (CEST)
 
Room 2.24
Posters on site
| Attendance Mon, 04 May, 14:00–15:45 (CEST) | Display Mon, 04 May, 14:00–18:00
 
Hall X4
Posters virtual
| Wed, 06 May, 14:12–15:45 (CEST)
 
vPoster spot 1b, Wed, 06 May, 16:15–18:00 (CEST)
 
vPoster Discussion
Orals |
Mon, 16:15
Mon, 14:00
Wed, 14:12
It has become more than evident by now that the increasing complexity and resource intensiveness of performing state-of-the-art Earth System Science (ESS), be it from a modeling or a pure data collection and analysis perspective, requires tools and methods to orchestrate, record and reproduce the technical and scientific process. To this end, workflows are the fundamental tool for scaling, recording, and reproducing both Earth System Model (ESM) simulations and large-volume data handling and analyses.

With the increase in the complexity of computational systems and data handling tasks, such as heterogeneous compute environments, federated access requirements, and sometimes even restrictive policies for data movement, there is a necessity to develop advanced orchestration capabilities to automate the execution of workflows. Moreover, the community is confronted with the challenge of enabling the reproducibility of these workflows to ensure the reproducibility of the scientific output in a FAIR (Findable, Accessible, Interoperable, and Reusable) manner. The aim is to improve data management practices in a data-intensive world.

This session will explore the latest advances in workflow management systems, concepts, and techniques linked to high-performance computing (HPC), data processing and analytics, the use of federated infrastructures and artificial intelligence (AI) application handling in ESS. We will discuss how workflows can manage otherwise unmanageable data volumes and complexities based on concrete use cases of major European and international initiatives pushing the boundaries of what is technically possible and contributing to research and development of workflow methods (such as Destination Earth, DT-GEO, EDITO and others).

On these topics, we invite contributions from researchers as well as data and computational experts presenting current scientific workflow approaches developed, offered and applied to enable and perform cutting edge research in ESS.

Orals: Mon, 4 May, 16:15–18:00 | Room 2.24

The oral presentations are given in a hybrid format supported by a Zoom meeting featuring on-site and virtual presentations. The button to access the Zoom meeting appears just before the time block starts.
Chairpersons: Karsten Peters-von Gehlen, Valeriu Predoi, Donatello Elia
16:15–16:20
Observations and NWP
16:20–16:40
|
EGU26-21804
|
solicited
|
Highlight
|
On-site presentation
Richard Hofmeister

The current era of Earth Observation (EO) is marked by an unprecedented increase in data volume and a growing number of satellite missions, driving a transition from dedicated processing infrastructure to cloud-native, distributed, and scalable orchestration. As Earth System Science, industry, and society increasingly rely on near-real-time EO data, efficient processing and workflow management have become critical components of modern ground segments. This presentation introduces an operational framework designed to meet the challenges of large-scale EO data processing. Examples from the Copernicus Sentinel programme and ESA’s Earth Explorer missions illustrate the framework’s scalable cloud deployment and operational performance. Common challenges - such as handling geospatial data formats, managing ground-segment anomalies, ensuring cybersecurity, providing standardized service interfaces, and leveraging public-cloud infrastructure - are addressed through a unified workflow approach. Operational experience from Copernicus payload data ground segment services, including monitoring via dashboards and control procedures, serves as a model for scientific missions and initiatives adopting these proven concepts. Scalability has emerged as a key feature, enabling efficient data transfers for the Copernicus Long-Term Archive, data access for Copernicus services, and higher-level processing workflows for scientific missions like BIOMASS. These orchestration strategies optimize resource use and energy efficiency for on-demand processing. The generic processing concepts demonstrated in the Copernicus and Earth Explorer programmes offer inspiration for new applications within the Earth System Science community, including hybrid approaches that integrate observations and simulation data.

How to cite: Hofmeister, R.:  A unified framework for large-scale, operational data processing in Earth Observation, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-21804, https://doi.org/10.5194/egusphere-egu26-21804, 2026.

16:40–16:50
|
EGU26-3821
|
On-site presentation
Pratichhya Sharma, Hans Vanrompay, and Jeroen Dries

Earth Observation (EO) data plays a crucial role in research and applications related to environmental monitoring, enabling informed decision-making. However, the continuously increasing volume and diversity of EO data, distributed across multiple platforms and varying formats, pose challenges for easy access and the development of scalable and reproducible workflows.

openEO addresses these challenges by providing a community-driven, open standard for unified access to EO data and cloud-native processing capabilities. It supports researchers to develop interoperable, scalable and reproducible workflows that can be executed using various programming languages (Python, R or JavaScript).

openEO has become a cornerstone technology across major initiatives in agriculture, natural capital accounting, and land-cover monitoring. In ESA’s WorldCereal project, it provides the scalable framework needed to process global Sentinel-1 and Sentinel-2 time series and integrate advanced machine-learning models, enabling dynamic 10-meter cropland and crop-type maps. It also supports the Copernicus Global Land Cover service and its tropical forestry component by delivering consistent and repeatable processing chains for annual 10-meter land-cover products, which are crucial for policy reporting and SDG monitoring. Beyond land cover, openEO supports efforts like ESA's World Ecosystem Extent Dynamics project by creating reproducible ecosystem-extent mapping and change detection maps — key elements for biodiversity and environmental management.

Building on this foundation, the openEO Federation, now integrated within the Copernicus Data Space Ecosystem (CDSE), provides seamless access to distributed Earth observation data and processing resources through a single, unified interface. By connecting multiple backends, it removes the need to juggle separate accounts or APIs and enables cross-platform workflows over datasets hosted by platforms such as Terrascope and CDSE.

openEO also strongly supports FAIR (Findable, Accessible, Interoperable, Reusable) principles. It exposes rich metadata, relies on standardised processes, and encourages the use of reusable workflow definitions. This promotes transparency, reproducibility, and the sharing of algorithms and data across research and operational communities. The approach has been validated in several large-scale implementations, including ESA’s WorldCereal and the JRC’s Copernicus Global Land Cover and Tropical Forestry Mapping and Monitoring Service (LCFM), demonstrating its maturity for both research and production environments.

By enabling reusable, federated, and reproducible Earth observation workflows, openEO is helping to build a more interoperable and efficient computational ecosystem, one that supports scalable innovation, collaboration, and long-term operational monitoring. Therefore, in this session, we aim to spark discussion on how openEO enables federated, FAIR-compliant, and reproducible workflow approaches for large-scale Earth observation applications.

How to cite: Sharma, P., Vanrompay, H., and Dries, J.: Reproducible and Scalable cloud-native EO data analysis using openEO , EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-3821, https://doi.org/10.5194/egusphere-egu26-3821, 2026.

16:50–17:00
|
EGU26-21909
|
ECS
|
On-site presentation
Nina Burgdorfer, Christian Kanesan, Victoria Cherkas, Noemi Nellen, Carlos Osuna, Katrin Ehlert, and Oliver Fuhrer

Operational Numerical Weather Prediction (NWP) workflows are increasingly challenged by rapidly growing data volumes, expanding product diversity, and the need for timely and scalable access to model data. At the same time, modern Earth system services are evolving toward open data policies that require not only standardized access to model output for internal and external users, but also flexible mechanisms to extract and process relevant information in a FAIR (Findable, Accessible, Interoperable, and Reusable) manner. In this context, MeteoSwiss, in collaboration with the European Centre for Medium-Range Weather Forecasts (ECMWF), is developing a modernized data workflow to improve access to NWP model data for internal and external downstream users. 

The redesigned workflow shifts from a product-centric dissemination model toward a scalable data-as-a-service approach. Rather than relying on the generation and distribution of numerous predefined products, recent ICON forecast output is organized in the Field Database (FDB) and exposed through Polytope, which provides semantic data access and feature extraction capabilities. The workflow automates the ingestion, indexing, access control, and on-demand extraction of forecast fields, and integrates these steps into existing HPC-based production workflows and downstream processing pipelines. By replacing file-based product generation with database-backed access, the workflow enables deterministic data extraction, explicit provenance tracking, and consistent versioning of datasets, so that identical data requests can be reproduced reliably across time and environments. We present recent developments in Earthkit and Polytope that, for the first time, enable such automated workflows on the icosahedral grids used by ICON. Standardized interfaces and modern processing tools from the Earthkit Python ecosystem enable downstream users and applications to retrieve and process tailored subsets of NWP data on demand. 

Our use of open-source, community-developed software (FDB, Polytope, Earthkit) as core workflow components illustrates how ECMWF technologies can be integrated into national weather service environments. Operational experience gained in this context contributes to improving the maturity and usability of these tools and supports their broader adoption by other ECMWF Member States, facilitating the transfer of FAIR, workflow-based data access concepts across the weather and climate community. 

How to cite: Burgdorfer, N., Kanesan, C., Cherkas, V., Nellen, N., Osuna, C., Ehlert, K., and Fuhrer, O.: Workflow Modernization for Open and Scalable Access to Operational NWP Data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-21909, https://doi.org/10.5194/egusphere-egu26-21909, 2026.

17:00–17:10
|
EGU26-15002
|
On-site presentation
Christopher Harrop and Isidora Jankov

The development of efficient, scalable, and interoperable workflow management systems is critical for supporting reproducible research to drive the scientific advancement of earth system modeling capabilities. Many workflow systems targeted for earth system science have been developed to meet that challenge, each having similar capabilities as well as some unique strengths. However, the earth system modeling community now faces additional challenges that impose new requirements. The landscapes of both high performance computing (HPC) environments and numerical modeling are evolving rapidly. HPC systems are composed of a growing diversity of hardware architectures that may be hosted on-prem or by a variety of cloud vendors. Earth model system components are also increasing in diversity as research to augment or replace traditional physics based models with machine learning models progresses. Additionally, a growing diversity of end-users with varying levels of knowledge and expertise require agentic workflows that can respond to their requests. A consequence of this rapid growth in diversity is a growing need to run workflows that span multiple systems in order to optimize data locality and access to resources that maximize performance of specific model components. The availability of, and requirement for, diversity naturally leads to a requirement for federated workflows that effectively harness the computational power of a diverse set of resources distributed both geographically and across multiple administrative domains. In this presentation, we introduce and report our progress with the development of Chiltepin, the first known federated numerical weather prediction workflow system within the National Oceanic and Atmospheric Administration (NOAA). Chiltepin is designed to address key challenges in numerical modeling, particularly those related to sustainable progress in a changing NWP landscape characterized by increasing diversity of technologies and use of high-performance computing resources distributed across both geographical and administrative boundaries.

How to cite: Harrop, C. and Jankov, I.: Toward Federated Agentic Workflows for Numerical Weather Prediction With Chiltepin, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-15002, https://doi.org/10.5194/egusphere-egu26-15002, 2026.

17:10–17:15
Digital Twins and km-scale ESMs
17:15–17:25
|
EGU26-9928
|
On-site presentation
Kameswarrao Modali, Karsten Peters-von Gehlen, Fabian Wachsmann, Florian Ziemen, Carsten Hinz, Rajveer Saini, and Siddhant Tibrewal

With the advancement of technical capabilities, Earth System Models (ESM) are rapidly moving toward much higher spatial resolutions - down to kilometer scale - to better capture key processes and feedbacks needed for robust climate impact assessments. This growing model complexity places significant demands on data infrastructures, which must evolve to support widespread application of  high-resolution simulations.

This evolution is needed across various stages of the ESM simulation data life cycle, right from the choice of the variables that need to be part of the simulation output, the format of the output, residence period and transfer of the data across various active storage tiers and the final movement to the cold storage tier (tapes) for long time archival. Also tools to handle the discoverability of these data must be developed and implemented. The evolution of the infrastructure also must take hardware constraints into account and should ideally be in line with the FAIR principles.

As part of the Warm World Easier project, these developments were the adaptation of the model output to zarr, a cloud native format, the development of bespoke tools like ‘zarranalyzer’ to handle the movement of the data across storage tiers by creating tarballs suitable also for the tapes, creating reference files for these tarballs in parquet format to summarize the entire dataset and the inception of these into a metadata catalog following the SpatioTemporal Asset Catalog (STAC) standard. Finally, a virtual machine to host the STAC catalog with appropriate access rights for the data providers and data curators within the federated structure, as well as the end users, was set up. 

Applying this data handling concept to km-scale ESM data bridges the gap between infrastructures that produce flagship datasets and those that enable their efficient and reliable reuse by the community. For example, data generated at large, compute-focused HPC centers with limited storage could be transferred to partner centers that provide specialized data services for long-term access and reuse. 

Through the federated and seamless setup of the research data infrastructure, data handling matters are abstracted away from the data users. Hence, the developed setup provides an end to end solution, achieving the objective of providing the km scale ESM simulation output to a broader scientific community tackling the urgent societal problems arising due to a warming planet.

How to cite: Modali, K., Peters-von Gehlen, K., Wachsmann, F., Ziemen, F., Hinz, C., Saini, R., and Tibrewal, S.: Research data infrastructure evolution for handling km scale simulations of a warming world, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9928, https://doi.org/10.5194/egusphere-egu26-9928, 2026.

17:25–17:35
|
EGU26-12058
|
ECS
|
On-site presentation
Pablo Goitia, Manuel Giménez de Castro Marciani, and Miguel Castrillo

Traditionally, climate simulations are executed on High-Performance Computing (HPC) platforms, organized in workflows that involve all the steps for the complete execution of the model, data processing, and management tasks. With the sustained increase in the computing capacity of these machines over the years, the accuracy and resolution of climate simulations have reached levels never seen before.

In this context, the European Commission launched the Destination Earth initiative, aimed at developing a digital twin of the Earth for the adaptation to climate change. This initiative seeks to operationalize the running of very high-resolution climate simulations that are coupled with applications that consume their data as it is produced. In order to address the challenge of processing the hundreds of terabytes that each single simulation involves, the ClimateDT project implemented a data streaming approach. This means that any delay between the production time of the climate model data and the subsequent consumption by the post-processing applications results in a workflow misalignment, leading to unacceptable delays in the total execution time. This poses unprecedented challenges on the workflow management side.

One of the main causes of the misalignments that commonly occur lies in the long time that each of the many thousands of tasks of the workflow spends in the queues of the HPC job schedulers, such as Slurm. To address this issue, the community proposed to aggregate workflow tasks into a single submission to the HPC without altering their execution logic—a technique known as task aggregation. Previous studies have demonstrated the effectiveness of this approach for climate workflows, yielding promising results. However, the current implementation is limited, as the task execution within an allocation still relies on the workflow manager, which is not able to perform the fine-grained workflow orchestration that a dedicated tool could do in a convenient way.

To overcome this limitation, we propose in this work to integrate existing HPC software into the Autosubmit Workflow Manager to enable in situ orchestration of aggregated tasks, such as the renowned Flux Framework and Parsl. This integration aims to abstract both developers and users from the complexity of managing supercomputing resources, providing an easy-to-use interface. The proposed approach is validated using the Destination Earth workflow to enable more complex, structured forms of task aggregation while reducing queue times in large-scale simulations.

How to cite: Goitia, P., Giménez de Castro Marciani, M., and Castrillo, M.: Optimizing the Destination Earth Workflow with in situ HPC Task Orchestration, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12058, https://doi.org/10.5194/egusphere-egu26-12058, 2026.

17:35–17:45
|
EGU26-11128
|
On-site presentation
Nicolas Choplain and Gaudissart Vincent

Antflow is a next-generation orchestration and publication framework designed to streamline the operational deployment of Earth Observation (EO) processing workflows, particularly within Digital Twin environments. By automating the transformation of scientific code into interoperable, shareable, and scalable services, Antflow removes the traditional barriers between algorithm development and production-grade execution.

At its core, Antflow enables scientists and developers to publish complex workflows directly from their Git repositories, using OGC Earth Observation Application Packages (EOAP) as the workflow definition mechanism. These EOAP descriptions allow Antflow to instantly expose workflows as OGC API Processes services, enriched with dynamic user interfaces and STAC-compliant cataloguing of outputs. This ensures that every workflow - no matter how experimental or mature - can be discovered, reused, and integrated across Digital Twin platforms.

Antflow’s hybrid orchestration engine distributes tasks across heterogeneous computing environments, from HPC clusters to cloud-native nodes. Git-based lineage guarantees traceability and scientific integrity, while integrated multi-provider retrieval mechanisms (EODAG) simplify access to EO data sources.

A key strength of Antflow is its ability to generate interactive user interfaces automatically. These interfaces allow domain experts, integrators, and end-users to parameterize, run, and monitor workflows through clean, intuitive views.

Antflow is currently used across several projects (CNES Digital Twin Factory, OGC Open Science Persistent Demonstrator). It acts as a middleware layer that bridges algorithm design, operational integration, and stakeholder consumption. By standardizing workflow publication, ensuring reproducibility, and supporting scalable execution, it accelerates the deployment of modelling chains such as 3D environmental reconstruction, forecasting, and multi-sensor analysis workflows.

How to cite: Choplain, N. and Vincent, G.: Antflow: Simplifying Workflow Sharing and Execution for Digital Twins, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-11128, https://doi.org/10.5194/egusphere-egu26-11128, 2026.

17:45–17:55
|
EGU26-17974
|
On-site presentation
stella valentina Paronuzzi ticco, Quentin Gaudel, Alain Arnaud, Jerome Gasperi, Mathis Bertin, and Victor Gaubin

The EDITO platform serves as the foundational framework for building the European Digital Twin of the Ocean. It seamlessly integrates oceanographic data and computational processes (non-interactive remote functions that take input and produce output) on a single platform that relies on both cloud and HPC (EuroHPC) resources. In this context, EDITO already provides many processes, such as OceanBench model evaluation and the ML-based GLONET 10-day forecast. To make scientists' work easier, we have developed a new way of generating processes on EDITO. We will use OceanBench evaluation as an example of a process that can be dispatched by the user on multiple targets, seamlessly handling the technical complexity of dealing with different hardware (cloud CPUs/GPUs, HPC, etc.). In our presentation we will explain how EDITO contributors will benefit from this new method of generating processes.   

How to cite: Paronuzzi ticco, S. V., Gaudel, Q., Arnaud, A., Gasperi, J., Bertin, M., and Gaubin, V.: Multi-target process dispatch on the European Digital Twin of the Ocean , EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-17974, https://doi.org/10.5194/egusphere-egu26-17974, 2026.

17:55–18:00

Posters on site: Mon, 4 May, 14:00–15:45 | Hall X4

The posters scheduled for on-site presentation are only visible in the poster hall in Vienna. If authors uploaded their presentation files, these files are linked from the abstracts below.
Display time: Mon, 4 May, 14:00–18:00
Chairpersons: Ivonne Anders, Karsten Peters-von Gehlen, Manuel Giménez de Castro Marciani
X4.109
|
EGU26-5728
|
ECS
Joppe Massant, Oscar Baez-Villanueva, Kwint Delbaere, Diego Fernandez Prieto, and Diego Miralles

The Global Land Evaporation Amsterdam Model (GLEAM) estimates daily land evaporation using a wide range Earth observation forcing datasets. In the project GLEAM-HR funded by the European Space Agency (ESA), we aim to create a global high-resolution daily evaporation dataset at 1 km for a period of eight years (2016–2023). To produce high-resolution evaporation estimates, all forcing data must be processed at 1 km resolution, requiring substantial computational resources. As the complete high-resolution forcing data no longer fits within the memory capacity of single HPC nodes, parallelization tools are necessary. To achieve this parallelization in a seamless way, a workflow orchestration ecosystem is designed that leverages the use of Zarr, Apptainer and Nextflow.

The Zarr ecosystem allows for easily writing to a dataset in parallel. Nextflow is an orchestration tool that allows dynamic job submissions, where the configuration of jobs can depend on the outcome of earlier jobs, such as the spatial domain to be processed. Apptainer is a containerization tool developed for HPC environments, allowing a “build once, deploy anywhere” approach. Combining these tools allows building a workflow orchestration environment that enables the automation of these parallel workflows while optimizing the job sizes for a given HPC environment.

The use of containers allows this workflow to be ported to different hardware without the need to set up all the environments again, making the designed workflow fully reproducible independent of the computing environment. Combining this with Continuous Integration and Continuous Delivery (CI/CD) practices to automate the container building and deployment, code development and workflow execution can be cleanly separated.

In a first test case, this processing workflow is used to produce global datasets of LAI, FPAR and vegetation cover fractions at 1 km resolution.  Future work focuses on the extension of this workflow to the other forcing datasets and the entire pipeline execution.

How to cite: Massant, J., Baez-Villanueva, O., Delbaere, K., Fernandez Prieto, D., and Miralles, D.: Parallel HPC workflow orchestration with Nextflow, supported by CI/CD and containerization tools for global high resolution evaporation modelling, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-5728, https://doi.org/10.5194/egusphere-egu26-5728, 2026.

X4.110
|
EGU26-11759
|
ECS
Manuel Giménez de Castro Marciani, Mario Acosta, Gladys Utrera, Miguel Castrillo, and Mohamed Wahib

Modern experimentation with Earth System Models (ESMs) is accelerated by the employment of automated workflows to handle the multiple steps such as simulation execution, post-processing, and cleaning, all while being portable and tracking provenance. And when executing on shared HPC platforms, users usually face long queue times, which increase the time to solution. The community has proposed to aggregate workflow tasks into a single submission in order to save in queue time with promising results. But by doing this the workflow manager has to deal with the remote task execution that otherwise would have been done by the HPC scheduler.

Therefore, we propose to integrate two workflow managers to create a versatile and general solution for the execution of these aggregated workflows: one that orchestrates the workflow globally and another that is in charge of running tasks within an allocation, which we refer to as "in situ."

In this work, we performed a qualitative and quantitative comparison of three suitable and representative workflow and workload managers running in situ, HyperQueue, Flux, and PyCOMPSs, on three of the top 20 HPCs: Lumi, MareNostrum 5, and Fugaku. We evaluated the portability and setup, failure tolerance, programmability, and provenance tracking of each of the tools in the qualitative part. In the quantitative part, we measured total runtime, task runtime, CPU and memory usage, disk write, and node imbalance of workflows running a memory-bound, a CPU-bound, and an IO-intensive application.

Our initial results yield recommendations to the community as to which workflow manager to use in situ. HyperQueue's easy installation and portability makes it the best solution for non-x86 platforms. Flux had the easiest running setup due to its preparedness to run nested in Slurm. Finally, PyCOMPSs is the only tool out of the three to provide provenance tracking with RO-Crates.

How to cite: Giménez de Castro Marciani, M., Acosta, M., Utrera, G., Castrillo, M., and Wahib, M.: Accelerating Earth System Workflows with In Situ Workflow Task Management, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-11759, https://doi.org/10.5194/egusphere-egu26-11759, 2026.

X4.111
|
EGU26-6238
|
ECS
Robert Reinecke, Annemarie Bäthge, David Noack, Matthias Zink, Simon Mischel, and Stephan Dietrich

In situ and remote sensing data are crucial in earth sciences, as they provide complementary perspectives on environmental phenomena. In situ data, collected directly from the Earth’s surface, offer high accuracy and detailed insights into local conditions, enabling precise measurements of variables such as soil moisture, temperature, and pollutant levels. Conversely, remote sensing data provides for extensive spatial coverage and the ability to monitor changes over time across vast areas, capturing large-scale patterns and trends that in situ data alone cannot reveal. By combining these two data sources and automatically preprocessing them into Analysis-Ready Data, researchers can enhance scientific insights, improve the robustness of machine learning applications, and refine models used to predict environmental changes or assess the impacts of human activity on natural systems. This integrated approach promotes a more comprehensive understanding of complex Earth processes, enabling better-informed decision-making and effective management strategies for sustainable development. However, preprocessing and combining in situ data from different sources can be highly complex, especially for global datasets. Joining this data with remotely sensed products may require substantial computational resources, given the increased number of observational records and high temporal resolutions. Here, we present a prototype of such a pipeline, CULTIVATE, an open-source data-processing pipeline that efficiently cleans in situ records and combines them with remote sensing data to create an automatically curated database. As new in situ data records are inserted, CULTIVATE updates only those records in the final database. In this presentation, we showcase CULTIVATE for over 200,000 global groundwater well observation time series that are merged with an extensive list of other time-series products, and we show how data curators can interact with the data processing pipeline. We further discuss how this prototype can serve as a blueprint for future architecture development for Research Data Infrastructures, how we can implement and enforce international standards, and how we can enable global datacenters to utilize automated data preparation in operational settings.

How to cite: Reinecke, R., Bäthge, A., Noack, D., Zink, M., Mischel, S., and Dietrich, S.: A prototype Open-Source data-processing pipeline to efficiently combine in-situ data with remote-sensing observations of the Earth, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-6238, https://doi.org/10.5194/egusphere-egu26-6238, 2026.

X4.112
|
EGU26-21194
|
ECS
Carmen Piñero-Megías, Laura Herrero, Artur Viñas, Johanna Gehlen, Luca Rizza, Ivan Lombardich, Oliver Legarreta, Òscar Collado, Paula Camps, Aina Gaya-Àvila, Marc Guevara, Paula Castesana, and Carles Tena

This work presents the sPanisH EmissioN mOnitoring systeM for grEeNhouse gAses (PHENOMENA), a python-based, open-source, multiscale emission model that computes high resolution (up to 1km2 and daily) and low latency greenhouse gas (GHG) emissions for Spain. The system uses a bottom-up approach, based on emission factors and activity data, and consists of four different modules: First, the downloading module retrieves low latency activity data from multiple sources, including APIs, open data repositories, websites, and private providers, with error handling and automatic retrials to minimize manual intervention. Next, the preprocessing module standardizes the data and applies quality-control checks. The activity data is then combined with emission factors in the calculation module, which covers 11 emission sectors. Finally, the resulting emissions are post-processed to meet the requirements of an open web platform where the results are displayed.

PHENOMENA is based on the OOP paradigm and designed to run on High Performance Computing (HPC) infrastructures. While each one of the emission sectors can run in parallel using MPI strategies, it is still not feasible to run all of them at the same time or download all the activity data at once, as different data providers have different temporal availability. Thanks to the modularity of the system, it can be split into different HPC jobs to handle the heterogeneous data frequencies, increase robustness through automatic retrials, run different instances at the same time and automatize monthly uploads to the web portal, using the Autosubmit workflow manager.

The resulting product is a web app which provides daily 1 km x 1 km gridded emission maps and emission totals aggregated per region and sector. The system's latency is determined by the availability of the activity data from external providers, ranging from daily updates to delays of up to four months.

PHENOMENA allows monitoring low-latency GHG emissions for Spain at high temporal and spatial resolution, providing information in an accessible way to support national to local policymakers. The system is scalable, robust against failures, and easily adaptable to new data providers, regions and emission sectors.

How to cite: Piñero-Megías, C., Herrero, L., Viñas, A., Gehlen, J., Rizza, L., Lombardich, I., Legarreta, O., Collado, Ò., Camps, P., Gaya-Àvila, A., Guevara, M., Castesana, P., and Tena, C.: PHENOMENA: a modular HPC model to facilitate automatic high-resolution greenhouse gas emission monitoring, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-21194, https://doi.org/10.5194/egusphere-egu26-21194, 2026.

X4.113
|
EGU26-7115
|
ECS
|
Jan Linnenbrink, Jakub Nowosad, Marvin Ludwig, and Hanna Meyer

Spatio-temporal predictive modelling is a key method in the geosciences. Often, machine-learning, which can be applied to complex, non-linear and interacting relationships, is preferred over classical (geo)statistical models. However, machine-learning models are often perceived as "black boxes", meaning that it is hard to understand their inner workings. Furthermore, there are several pitfalls associated with the application of machine-learning models in general, and spatio-temporal machine-learning models in particular. This might, e.g., concern the spatial autocorrelation inherent in spatial data, which complicates data splitting for model validation. 

Following from this, it is key to transparently report spatio-temporal models. Transparent reporting can facilitate interpreting, evaluating and reproducing spatio-temporal models, and can be used to determine their suitability for a specific research question. Standardized model protocols are particularly valuable in this context, as they document model parameters, decisions and assumptions. While such protocols exist for machine-learning models in general (e.g., Model Cards, REFORMs), as well as for specific domains like species distribution modelling (ODMAP), such protocols are lacking in the general field of spatio-temporal modelling. 

Here, we present ideas for STeMP (Spatio-Temporal Modelling Protocol), a protocol for spatio-temporal models that fills this gap. The protocol is designed to be beneficial for all parties involved in the modeling process, including model developers, maintainers, reviewers, and end-users. The protocol is implemented as a web application and is structured in three sections: Overview, Model and Prediction. The Overview section contains general metadata, while the following two sections go into more detail. The Model section includes modules describing, for example, the predictors, model validation procedures, and software. The optional Prediction section contains information about the prediction domain, map evaluation, and uncertainty assessment.

To make the protocol useful during model development, warnings are raised when common pitfalls are encountered (e.g., if an unsuitable cross-validation strategy is used). These warnings can be automatically retrieved from a filled protocol, spotlighting potential issues and helping authors and reviewers. Moreover, we provide the optional possibility to generate automated reports and also inspection figures from user-provided inputs (e.g., from model objects as well as from training and test data sets). The protocol is hosted on GitHub (https://github.com/LOEK-RS/STeMP) and hence open to flexible incorporation of feedback from the broader community.

With our presentation, we aim to encourage the discussion of our proposed model report in the spatio-temporal modelling community.

How to cite: Linnenbrink, J., Nowosad, J., Ludwig, M., and Meyer, H.: STeMP: Spatio-Temporal Modelling Protocol, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7115, https://doi.org/10.5194/egusphere-egu26-7115, 2026.

X4.114
|
EGU26-14853
|
ECS
Isidre Mas Magre, Hervé Petetin, Alessio Melli, James Petticrew, Michael Orieux, Miguel Hortelano, Luiggi Tenorio, and David Mathas

The integration of Machine Learning (ML) into Earth System Sciences has revolutionized predictive modeling. However, the transition from local prototyping to large-scale deployment is often hindered by fragmented codebases and the manual overhead of managing complex hyperparameter tuning on High-Performance Computing (HPC) clusters. We present AutoML, a framework developed to automate and standardize the ML lifecycle in HPC environments by leveraging the open-source Autosubmit workflow manager.

AutoML employs a configuration-driven architecture that decouples model logic from workflow execution. By utilizing Autosubmit’s proven capability to handle complex dependencies and remote HPC environments, AutoML allows researchers to scale experiments—from initial prototyping to production-level global pipelines—through a single configuration file. This approach directly addresses the challenge of experiment reproducibility and efficiency within ML projects. The framework automates critical steps in the typical ML workflow, including hyperparameter search space optimization, multi-node distributed training, and dynamic resource allocation on heterogeneous HPC architectures.

We demonstrate the framework’s utility through Atmospheric Composition applications at the Barcelona Supercomputing Center (BSC). By providing a standardized structural template AutoML fosters collaboration and ensures that advancements in machine learning for atmospheric science are scalable, computationally efficient, and transferable across research lines.

How to cite: Mas Magre, I., Petetin, H., Melli, A., Petticrew, J., Orieux, M., Hortelano, M., Tenorio, L., and Mathas, D.: AutoML: A Flexible and Scalable HPC Framework for Efficient Machine Learning in Atmospheric Modelling, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-14853, https://doi.org/10.5194/egusphere-egu26-14853, 2026.

X4.115
|
EGU26-17077
Maria Mirto, Marco De Carlo, Shahbaz Alvi, Shadi Danhash, Antonio Aloisio, and Paola Nassisi

Earth System Sciences (ESS) are increasingly characterized by large data volumes and high computational demands, which make complex analyses difficult to manage using ad hoc or manual solutions. This challenge is amplified when heterogeneous data sources, such as Internet of Things (IoT) infrastructures including wireless sensor networks, video cameras and drones, must be combined with high-performance computing (HPC) environments for climate modelling and advanced artificial intelligence (AI) algorithms.

The ARCA (Artificial Intelligence Platform to Prevent Climate Change and Natural Hazards) project, funded by the Interreg IPA ADRION Programme, was designed to respond to these challenges by providing a practical, workflow-based platform aimed at supporting climate change and natural hazard applications and, ultimately, reducing their impacts. The main objective of ARCA is to strengthen the cross-border operational capacity of stakeholders across the Adriatic–Ionian region, involving Italy, Croatia, Montenegro, Albania, Serbia and Greece. The platform supports the monitoring of forest ecosystems through AI-based tools, enabling continuous observation of forest areas and the prediction of multiple natural hazards, including droughts, wildfires and windstorms.

ARCA is built on a modular architecture centered on scientific workflows, which orchestrate multiple-type data ingestion, processing, analysis and AI model execution in a consistent and reproducible manner. The platform integrates big data technologies, workflow management systems and AI components, allowing complex processing chains to be automated while ensuring full traceability of data provenance, computational steps and model configurations. This approach supports FAIR principles and promotes the reuse of data and workflows across different applications and computing environments.

A key strength of ARCA lies in its ability to shield users from much of the underlying technical complexity, such as heterogeneous computing resources, access constraints and large data volumes, while still enabling scalable AI-driven analyses. As a result, researchers and practitioners can focus on scientific and operational questions related to climate impacts and hazard prevention rather than on low-level technical orchestration. In this contribution, we present the overall ARCA architecture together with selected use cases, illustrating how workflow-based approaches can effectively support scalable, transparent and reproducible ESS research in a multinational and federated context like the Adriatic–Ionian region.

How to cite: Mirto, M., De Carlo, M., Alvi, S., Danhash, S., Aloisio, A., and Nassisi, P.: ARCA: A Scalable and Reproducible AI-Driven Workflow Platform for Climate Change and Natural Hazard Applications, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-17077, https://doi.org/10.5194/egusphere-egu26-17077, 2026.

X4.116
|
EGU26-20269
Peter Baumann, Dimitar Misev, Bang Pham Huu, and Vlad Merticariu

Datacubes are an acknowledged cornerstone for analysis-ready Big Earth Data as they allow more intuitive, powerful services than zillions of "scenes". By abstracting from technical pains they offer two main advantages: for users, it gets more convenient; servers can dynamically optimize, orchestrate, and distribute processing.
We propose a combination of datacube service enhancements which we consider critical for making data exploitation more open to non-experts and more powerful, summarized as "Federated AI-Cubes": 

  • Location-transparent federation allows users and tools to perceive all datacube assets as a single dataspace, making distributed data fusion a commodity. Instrumental for this is automatic data homogenization performed at import and at query time, based on the open Coverage standards.
  • High-level datacube query languages, such as SQL/MDA and ISO/OGC WCPS, simplify analysis and open up data exploitation to non-programmers. Server-side optimization can automatically generate the individually best distributed workflow for every incoming query. At the same time, queries document workflows without low-level technical garbage, making them reproducible. 
  • The seamless integration of AI into datacube analytics plus AI-assisted query writing open up new opportunities for zero-coding exploitation. By not hardwiring a particular model a platform for easy-to-use model sharing emerges. Model Fencing, a new research direction, aims at enabling the server to estimate accuracy of ML model inference embedded in datacube queries. 
  • Standards-based interoperability allows users to remain in the comfort zone of their well-known clients, from map browsing over QGIS and ArcGIS up to openEO, R, and python frontends.
  • Cloud/edge integration opens up opportunities for seamless federation of data centers with moving data sources, such as satellites, including flexible onboard processing.

In summary, these capabilities together have potential for empowering non-experts and making experts more productive, ultimately democratizing Big Earth Data exploitation and widening Open Science.
In our talk, we discuss these techniques based on their implementation in the rasdaman Array DBMS, the pioneer datacube engine, which is operational on multi-Petabyte global assets contributed by research centers in Europe, USA, and Asia. We present challenges and results, supported by live demos many of which are public. Additionally, being editor of the OGC and ISO coverage standards suite, we provide an update on recent progress and future developments.
This research is being co-funded by the European Commission through EFRE projects FAIRgeo and SkyFed.

How to cite: Baumann, P., Misev, D., Pham Huu, B., and Merticariu, V.: Federated AI-Cubes: Towards Democratizing Big Earth Datacube Analytics, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-20269, https://doi.org/10.5194/egusphere-egu26-20269, 2026.

X4.117
|
EGU26-19451
|
ECS
Chathurika Wickramage, Fabian Wachsmann, Jürgen Kröger, Rohith Ghosh, and Matthias Aengenheyster

Kilometer-scale global climate simulations are now generating petabytes of output at such a rapid pace that data production is surpassing data standardization. Central ESM infrastructures have traditionally followed a “data warehouse” approach: extensive preprocessing, quality control, and formatting are performed before users receive self-describing, FAIR-aligned files. While this delivers highly standardized and interoperable products, it also creates a growing bottleneck, computationally and organizationally, so that routine actions like checking variables, extracting a region and time slice, or comparing experiments can become slow, and hard to reproduce in practice. The EERIE project (https://eerie-project.eu/about/) is a clear example: its eddy-rich Earth System Models generate detailed and valuable output, but at a scale and pace that overwhelms traditional file-by-file workflows and delays usable access.

At DKRZ, we address this with an end-to-end workflow that transforms raw EERIE model output into analysis-ready datasets (ARD) that are easy to discover, subset, and analyze without requiring users to copy or download terabytes of files. The central element of this workflow is to create virtual Zarr datasets of the raw model output received from the modeling groups, by extracting chunk information and storing them in the kerchunk format with VirtualiZarr (https://virtualizarr.readthedocs.io/en/stable/index.html). These native-grid virtual datasets are published through both an intake catalog (https://github.com/eerie-project/intake_catalogues) and a STAC (SpatioTemporal Asset Catalog; https://discover.dkrz.de/external/stac2.cloud.dkrz.de/fastapi/collections/eerie?.language=en) interface, enabling users to examine variables, time period, regions etc., and retrieve only the subset they need while the bulk remains in place. Alongside native model-grid resolution, the data is also provided on a common ¼ degree regular grid to facilitate inter-model comparison.  Finally, we employ widely used standards and publish standardized products through established climate-data services (ESGF; https://esgf-metagrid.cloud.dkrz.de/search and WDCC; https://www.wdc-climate.de/ui/project?acronym=EERIE). We also aim to publish the processing scripts used throughout the pipeline, enabling others to build on the lessons learned from the EERIE approach.

How to cite: Wickramage, C., Wachsmann, F., Kröger, J., Ghosh, R., and Aengenheyster, M.: Making Kilometer-Scale Earth System Model (ESM) simulations usable: A workflow approach from European Eddy RIch ESMs (EERIE) project., EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19451, https://doi.org/10.5194/egusphere-egu26-19451, 2026.

X4.118
|
EGU26-18841
|
ECS
Donatello Elia, Gabriele Tramonte, Cosimo Palazzo, Valentina Scardigno, and Paola Nassisi

The amount of data produced by Earth System Model (ESM) is continuously growing, driven by their higher resolution and complexity. Approaches for efficient data access, management, and analysis are, thus, needed now more than ever to tackle the challenges related to these large volumes. Moreover, data generated by ESM simulations could be organized in a way that is not the most effective for data analytics, slowing down scientists’ productivity. In this context, novel data formats and proper chunking strategies can significantly speed up access and processing of Earth system data and, in turn, the whole analysis workflow. 

In the scope of ESiWACE3 - Centre of Excellence in Simulation of Weather and Climate in Europe - we experimented the impact of different data formats and chunking configurations on high-performance data analytics operations/workflows. In particular, we evaluated performance of the well-known NetCDF format and the more recent cloud-native Zarr format, which is being increasingly used in Earth Science data analytics workflows and machine learning applications. Results show that the use of a proper data format and structure can noticeably reduce the time required for executing these analytics workflows, provided the structure is carefully tuned (e.g., chunking).

The work presents the main outcomes of such evaluation and how we are exploiting this knowledge to enhance Earth system data management workflows. In particular, the results achieved have contributed to enabling a more efficient access, delivery and analysis of large-scale data in CMCC’s tools and services, which are involved in different initiatives, including the ICSC - National Centre on High Performance Computing, Big Data and Quantum Computing.

How to cite: Elia, D., Tramonte, G., Palazzo, C., Scardigno, V., and Nassisi, P.: Efficient large-scale data structuring to support Earth System Science analytics workflows, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-18841, https://doi.org/10.5194/egusphere-egu26-18841, 2026.

X4.119
|
EGU26-22151
|
ECS
Franklyn Dunbar, Mike Gottlieb, Rachel Akie, and David Mencin

Earth System Science increasingly depends on scalable, reproducible computational workflows to manage complex data processing across heterogeneous environments and cloud infrastructure. In seafloor geodesy — a domain where high-resolution geodetic time series and acoustic ranging techniques are essential for understanding submarine tectonic and deformation processes — the need for robust, automated tooling is acute. We present Earthscope Seafloor Geodesy Tools, an open-source Python library developed by Earthscope consortium that supports preprocessing and GNSS-A processing workflows for seafloor geodesy data collected via autonomous wave glider platforms.
Earthscope Seafloor Geodesy Tools, provides modular utilities to translate, organize, validate, and prepare raw observational data for integration with GNSS-A positional solver inversion software (e.g., GARPOS), enabling reproducible, data pipelines within research and operational contexts. By encapsulating domain-specific processing steps into composable components, Earthscope Seafloor Geodesy Tools, enables workflow orchestration and large scale data processing across environments (i.e. local vs remote) and reproducibility of results.

How to cite: Dunbar, F., Gottlieb, M., Akie, R., and Mencin, D.: Earthscope Seafloor Geodesy Tools, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-22151, https://doi.org/10.5194/egusphere-egu26-22151, 2026.

Posters virtual: Wed, 6 May, 14:00–18:00 | vPoster spot 1b

The posters scheduled for virtual presentation are given in a hybrid format for on-site presentation, followed by virtual discussions on Zoom. Attendees are asked to meet the authors during the scheduled presentation & discussion time for live video chats; onsite attendees are invited to visit the virtual poster sessions at the vPoster spots (equal to PICO spots). If authors uploaded their presentation files, these files are also linked from the abstracts below. The button to access the Zoom meeting appears just before the time block starts.
Discussion time: Wed, 6 May, 16:15–18:00
Display time: Wed, 6 May, 14:00–18:00
Chairperson: Andrea Barone

EGU26-6022 | ECS | Posters virtual | VPS22

Geo2Gmsh: A Scalable Workflow for Automated Mesh Generation of Geological Models Using Gmsh 

Harold Buitrago, Juan Contreras, and Florian Neumann
Wed, 06 May, 14:12–14:15 (CEST)   vPoster spot 1b

Numerical modeling is a fundamental tool for understanding physically driven processes in geosciences. In multiparametric settings, the Finite Element Method is widely used because it can accommodate irregular geometries and complex boundary conditions. However, this advantage critically depends on the quality of the computational mesh, which must faithfully represent geological features such as faults, stratigraphic interfaces, and wells. In practice, mesh generation remains a major bottleneck, requiring specialized expertise and significant manual effort. We present Geo2Gmsh, an automated, lightweight workflow built on Gmsh (Geuzaine & Remacle, 2009), that generates geological meshes directly from simple text‐based descriptions of topological elements, including surfaces, lines, and points. These elements correspond to geologically meaningful features, allowing users to define faults, horizons, wells, and domain boundaries in a transparent, reproducible, and solver‐independent way. The workflow is demonstrated using two contrasting case studies: (1) Ringvent, an active sill‐driven hydrothermal system in the Guaymas Basin, and (2) the Eastern Llanos Basin, a foreland basin in eastern Colombia. To evaluate solver compatibility, we solved the heat equation in SfePy (https://sfepy.org/doc-devel/index.html) using the Eastern Llanos Basin model as the computational domain. Although the simulation is illustrative and not calibrated to observations, it confirms that meshes produced by Geo2Gmsh can be readily incorporated into numerical solvers. By explicitly embedding wells, faults, and geological interfaces in the mesh, Geo2Gmsh enables boundary conditions to be applied directly to physically meaningful features and allows model outputs to be extracted along them, simplifying both model setup and post‐processing. Meshes can be exported in standard formats (e.g., VTK, MSH, and Exodus via meshio), ensuring broad interoperability. Overall, Geo2Gmsh provides a lightweight, scalable, and reproducible workflow that dramatically lowers the technical barrier to geological mesh generation. This contribution establishes a practical foundation for reproducible, open-source numerical modeling in geosciences, facilitating the integration of geological knowledge into high-fidelity computational simulations.

How to cite: Buitrago, H., Contreras, J., and Neumann, F.: Geo2Gmsh: A Scalable Workflow for Automated Mesh Generation of Geological Models Using Gmsh, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-6022, https://doi.org/10.5194/egusphere-egu26-6022, 2026.

Please check your login data.