HS3.8 | Managing and Processing Heterogeneous and Imperfect Data: current practices and challenges in Hydrology and Geosciences
EDI
Managing and Processing Heterogeneous and Imperfect Data: current practices and challenges in Hydrology and Geosciences
Co-organized by ESSI1/GI2
Convener: Nanee Chahinian | Co-conveners: Franco Alberto Cardillo, Batoul HaydarECSECS, Cécile Gracianne, Franca Debole
Orals
| Thu, 07 May, 14:00–15:45 (CEST)
 
Room 2.31
Posters on site
| Attendance Thu, 07 May, 10:45–12:30 (CEST) | Display Thu, 07 May, 08:30–12:30
 
Hall A
Orals |
Thu, 14:00
Thu, 10:45
Data imperfection is a persistent and multi-faceted challenge in hydrology and more broadly in geosciences. Researchers and practitioners regularly work with datasets that are incomplete, imprecise, erroneous, heterogeneous, or redundant—whether originating from in-situ measurements, remote sensing, modelling outputs, or participatory sources.
While traditional statistical methods have long been used to address these limitations, the growing complexity and diversity of hydrological and environmental data have created new demands—and opportunities—for innovation. Advances in artificial intelligence, data fusion, knowledge representation, and reasoning under uncertainty now allow for more robust integration and interpretation of heterogeneous information.
This session aims to gather contributions that explore how we can move from imperfect, fragmented data toward coherent and actionable hydrological and environmental knowledge. We welcome abstracts on:

• Applications and case studies in hydrology or other domains, addressing missing data imputation, model inversion, uncertainty propagation, or multi-source integration—using time series, spatial data, imagery, videos, etc. The case studies may focus on hydrological and natural hazards (floods, droughts, earthquakes, landslides, marine submersion, etc..) or resources management (water supply, treatment, etc…)
• Methodological developments in data fusion, completion, uncertainty quantification, and AI-based knowledge extraction from heterogeneous data.
• Cross-disciplinary approaches that connect geosciences, and specifically hydrological sciences, with AI, data mining, and knowledge systems, including citizen science, crowd-sourced data, or opportunistic sensing.
• Experimental contributions in hydrology and geosciences relying on AI, such as novel models and algorithms, explainable methods, and comparative studies on domain-specific datasets.
• Feedback from data integration initiatives into domain specific or cross disciplinary repositories.

We particularly encourage contributions that highlight novel practices or conceptual frameworks for dealing with imperfect and multi-source data in complex environmental systems.

Orals: Thu, 7 May, 14:00–15:45 | Room 2.31

The oral presentations are given in a hybrid format supported by a Zoom meeting featuring on-site and virtual presentations. The button to access the Zoom meeting appears just before the time block starts.
14:00–14:05
14:05–14:15
|
EGU26-17687
|
Highlight
|
On-site presentation
|
Jeremy Rohmer

Managing multi-source data requires flexible approaches and tools to model many types of imperfections surrounding them and, and more braodly, to deal with uncertainties of multiple origins, namely aleatory (representing randomness) and epistemic uncertainty (related to imperfect knowledge). While the first origin can be adequately represented using classical probabilities, there is no simple, single answer for epistemic uncertainty. New theories of uncertainty based on "imprecise probabilities" have been developed in the literature to go beyond the systematic use of a single probabilistic law. In this communication, I analyze the application of these methods for quantifying uncertainty in various real-world cases of natural hazard assessment (earthquakes, floods, rockfalls) in terms of their advantages and disadvantages compared to the traditional probabilistic approach. On this basis, I draw lessons to support decision making under uncertainty and identify open questions and remaining challenges, in particular the integration of spatio-temporal geodata, the use of full process high-fidelity numerical models, and interfacing with AI-based approaches.

I acknowledge financial support of the French National Research Agency within the HOUSES project (grant N°ANR-22-CE56-0006).

How to cite: Rohmer, J.: Dealing with imperfect knowledge in natural hazard assessments: beyond classical probabilities and challenges, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-17687, https://doi.org/10.5194/egusphere-egu26-17687, 2026.

14:15–14:25
|
EGU26-19993
|
ECS
|
On-site presentation
Tom Keel, Matt Fry, and Sam Counsell

Reliable rainfall datasets are an essential foundation for hydrological research. The most extensive rainfall information is collected from rain gauge networks, which provide high-frequency observations on rainfall intensity at those locations, or their data can be interpolated onto a regular grid to provide consistent region-wide estimates.

For the UK, there are two major daily gridded rainfall products: (1) CEH-GEAR developed by the UK Centre for Ecology & Hydrology, and (2) HadUK-Grid developed by the Met Office. In each case, they are built from a selection of rain gauges from a multi-nation rain gauge network spanning Great Britain. Decisions made at each stage of rainfall data preparation, about collection, formatting, quality control and then gridding, introduce uncertainty into the resulting gridded rainfall products.

In this talk, we discuss plans for CEH-GEAR 15 min, a new sub-daily 1 km product developed as part of the UK’s multi-year Flood & Drought Research Infrastructure (FDRI) project. We detail each step of its production, from raw rain gauge to gridded rainfall estimates, and systematically discuss the sources of uncertainty introduced at each stage. 15-minute rainfall measurements tend to be highly variable in space and time, and intense storms or long dry periods create practical challenges for preparing gridded rainfall estimates. So, we quantify the sensitivity of those estimates to decisions made about quality control and data blending during notable rain events across the UK. We also present the associated open-source tools developed as part of FDRI, including RainfallQC, that aim to support reproducible rainfall data processing and alleviate some of the challenges in sub-daily rainfall data preparation.

How to cite: Keel, T., Fry, M., and Counsell, S.: Uncertainty produced in a 15-minute gridded rainfall product for the UK. , EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19993, https://doi.org/10.5194/egusphere-egu26-19993, 2026.

14:25–14:35
|
EGU26-3102
|
ECS
|
On-site presentation
Nina Houngue

The lack of extensive and functional ground observation networks introduces satellite-based rainfall products as an alternative. However, these datasets require prior evaluation. This study investigates the performance of four satellite- and gauge-based rainfall products: the Climate Hazards Group Infrared Precipitation with Station data version v2.0 (CHIRPS); Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks Climate Data Record (PERSIANN); Tropical Applications of Meteorology using Satellite data and ground-based observations (TAMSAT); and the Global Precipitation Climatology Centre full daily data (GPCC).

The assessment was conducted using grid-to-point comparisons at different time scales, and hydrological modelling over the Mono River Basin, located in the Republics of Benin and Togo. To assess the suitability of the four products for flood purposes, a two-step approach was applied: (1) a satellite-only approach in which each product was used as input to the HBV-light hydrological model for runoff simulation, and (2) an observation-satellite approach in which gaps in observation data were filled using each product prior to the hydrological modelling. In all simulations, areal precipitation was derived with kriging before being input into HBV-light. On the one hand, the simulation with CHIRPS-only showed poor performance (NSE = -0.08 during calibration and -0.22 during validation), while the simulations with PERSIANN-only, TAMSAT-only, and GPCC-only yielded moderate performance, with NSE values ranging from 0.5 to 0.67. On the other hand, simulations with the observation-satellite combinations also showed moderate performances, with NSE values between 0.55 and 0.69, including for the observation-CHIRPS case.

The poor performance of the CHIRPS-only simulation, combined with the similar performance of all observation-satellite combinations, indicates that the quality of the satellite product used for gap filling plays a limited role. Moreover, the absence of significant improvement when using observation-satellite combinations compared to their satellite-only counterparts (except for CHIRPS) suggests that gap filling with satellite products does not necessarily enhance data quality. These results indicate that, in the Mono River Basin, gap filling may not be necessary when spatial interpolation methods such as kriging are applied.

How to cite: Houngue, N.: When More Data Is Not Better: Evaluating Satellite Rainfall Products in a Data-Scarce River Basin, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-3102, https://doi.org/10.5194/egusphere-egu26-3102, 2026.

14:35–14:45
|
EGU26-3221
|
On-site presentation
Rocky Talchabhadel, Sunil Bista, Saurav Bhattarai, Subash Poudel, Amisha Bhandari, Sandhya Khanal, Aashish Gautam, Yogesh Bhattarai, Sanjib Sharma, and Nawa Raj Pradhan

Meteorological forcings under different climate scenarios exert substantial control over hydrologic-hydrological processes in watersheds and river systems. This study presents a comprehensive assessment of uncertainty in hydrologic projections by integrating a wide range of climate forcings, multiple bias-correction approaches, and several Shared Socioeconomic Pathways (SSPs). Specifically, we (i) quantify the total uncertainty in projected hydrologic responses, (ii) attribute uncertainty to individual sources, and (iii) examine how uncertainty propagates along the hydroclimatic modeling chain. The analysis is demonstrated for a range of watersheds using a fully calibrated Soil and Water Assessment Tool (SWAT) model. The hydrologic simulations are forced by outputs from thirty global climate models (GCMs) participating in the Coupled Model Intercomparison Project Phase 6 (CMIP6), obtained from the NASA Earth Exchange Global Daily Downscaled Projections (NEX-GDDP-CMIP6) dataset at a spatial resolution of 0.25° (~25 km) under two SSPs. To further refine the climate inputs, a linear bias-correction method is applied to daily temperature and precipitation time series to align long-term mean monthly values during the reference period (1985–2014) with PRISM observations. A total of four bias-correction scenarios are considered: (1) original NEX-GDDP-CMIP6 data, (2) precipitation-corrected data, (3) temperature-corrected data, and (4) jointly corrected temperature and precipitation data. This framework yields four forcing scenarios for each GCM–SSP combination, resulting in a total of 240 simulations (4 × 30 GCMs × 2 SSPs) for each watershed. Streamflow changes are evaluated for the near-future period (2031-2060) and far future period (2061-2090), relative to the historical baseline (1985-2014). Changes in probability distributions and cumulative distribution functions are analyzed across climate models, bias-correction methods, and SSPs. In addition, the relative contributions of individual uncertainty sources are quantified at monthly, seasonal, and annual time scales. By systematically accounting for uncertainties arising from climate forcings, bias-correction techniques, and socioeconomic pathways, this study provides a robust characterization of the range of plausible hydrologic futures. Such uncertainty-informed streamflow projections are essential for water-resources planning, flood and drought risk management, and the development of effective long-term water-management strategies.

How to cite: Talchabhadel, R., Bista, S., Bhattarai, S., Poudel, S., Bhandari, A., Khanal, S., Gautam, A., Bhattarai, Y., Sharma, S., and Pradhan, N. R.: Disentangling Sources of Uncertainty in Hydrologic Projections Using Multiple Climate Forcings, Bias-Correction Techniques, and Shared Socioeconomic Pathways, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-3221, https://doi.org/10.5194/egusphere-egu26-3221, 2026.

14:45–14:55
|
EGU26-11636
|
ECS
|
On-site presentation
Kenneth Gutiérrez, Gunnar Lischeid, Gökben Demir, Maren Dubbert, Alexander Knohl, and Christian Markwitz

Data imperfection is characterized by fragmentation, sensor failures, and high-dimensional noise. This remains a persistent challenge in environmental monitoring. As observation networks expand to capture heterogeneous soil-atmosphere interactions, traditional quality control methods based on rigid statistical thresholds often struggle to distinguish between sensor errors and genuine, non-linear system dynamics. This study presents a methodological development for knowledge extraction from imperfect and fragmented data, employing a multivariate visualization workflow that combines Principal Component Analysis (PCA) and Self-Organizing Maps (SOM) with Sammon Mapping.

We applied this unsupervised learning approach to a high-dimensional dataset (~100 variables) from a field-scale agricultural system, including measurements of soil moisture and temperature, eddy covariance-derived CO2, energy fluxes, radiation, wind, precipitation, groundwater level and discharge.

This allowed us to compare a discontinuous period in 2024 against a continuous period in 2025. The results demonstrate the method's robustness in extracting coherent structural patterns despite data incompleteness. While PCA effectively isolated the dominant thermodynamic baselines from high-frequency hydrologic events, the topological SOM projection provided a rapid, visual plausibility check.

The visualization facilitated the identification of possible irregularities in the sensors as spatial outliers in the 2024 dataset, facilitating instant anomaly detection without manual time-series inspection. Furthermore, the method successfully captured shifts in system dynamics, such as the decoupling of surface moisture from groundwater, validating its utility for identifying physical regimes in heterogeneous data. We conclude that this visual workflow offers a scalable, data-driven solution for moving from raw, imperfect observations toward actionable system diagnostics, bridging the gap between data acquisition and process understanding in complex environmental observatories.

How to cite: Gutiérrez, K., Lischeid, G., Demir, G., Dubbert, M., Knohl, A., and Markwitz, C.: Unsupervised pattern recognition for imperfect datasets: a visual workflow for plausibility checks and regime diagnosis in high-dimensional environmental time series, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-11636, https://doi.org/10.5194/egusphere-egu26-11636, 2026.

14:55–15:05
|
EGU26-6933
|
ECS
|
On-site presentation
Omar Et-targuy, Carole Delenne, Salem Benferhat, and Ahlame Begdouri

Wastewater network management relies on geographic data from multiple sources, which creates significant integration challenges: spatial inconsistencies, incomplete coverage, and varying levels of precision.

Although different data sources may cover the same portion of the network, they are generally produced in different contexts or at different times. This can result in discrepancies in the descriptions of the physical infrastructure of the wastewater network: some elements may be accurately represented in one source but absent in another, while other objects may be described slightly differently across sources. Furthermore, for certain parts of the network, the structure itself may vary depending on the source. Consequently, any operation to merge datasets or build a global network representation requires matching the objects described by each source in order to identify those corresponding to the same physical element, to recognize objects present in multiple sources, and to distinguish those with no correspondence in other datasets.

In this work, we propose a data integration methodology to address disparities among these data sources and to match the various elements of wastewater networks. This approach establishes correspondences between multiple datasets representing the same infrastructure from different sources. By combining spatial and structural information, the method identifies matching components across datasets and produces a unified representation that leverages the complementary information from each source while resolving conflicts and inconsistencies.

The approach has been validated on real-world wastewater network data from multiple sources and covering different time periods. The results demonstrate high integration accuracy. This methodology enables a complete and consistent representation of wastewater networks, addressing the challenges of data heterogeneity inherent in multi-source infrastructure management.

How to cite: Et-targuy, O., Delenne, C., Benferhat, S., and Begdouri, A.: Integration and Alignment of Multiple Water Network Data Sources, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-6933, https://doi.org/10.5194/egusphere-egu26-6933, 2026.

15:05–15:15
|
EGU26-7251
|
On-site presentation
Thanh Ma, Salem Benferhat, Minh Thu Tran Nguyen, Nanée Chahinian, Carole Delenne, and Thanh-Nghi Do

Geographic Information Systems (GIS) are reference tools for representing, storing, analyzing, and visualizing geolocated data, particularly those related to urban infrastructures such as water networks. In addition to GIS reference data, there exists a significant amount of complementary data, referred to here as external data, generally produced in specific contexts such as urban network maintenance. When properly exploited, these external data sources, which are rich in information, can enhance GIS and help address the issue of missing data. However, these external data are often not geolocated, which makes their integration into GIS particularly complex.

The main objective of this work is to propose artificial intelligence–based methodologies to geolocate non-georeferenced external data, particularly maps related to urban water networks, by leveraging multisource data cross-analysis. The proposed approach relies on the joint exploitation of geolocated GIS data and external data lacking geolocation. It consists in analyzing maps using object detection techniques to extract characteristic elements, such as buildings or specific structures, which are then matched with corresponding entities available in the relevant GIS. By exploring different geographic areas of the same spatial extent as the maps and assessing the degree of similarity between the extracted elements and those referenced in the GIS, the method enables the identification of the most plausible area of correspondence and, ultimately, the geolocation of the maps in question.

This work addresses several major challenges in the context of geolocating external data using GIS data. The first challenge concerns the identification and selection of relevant elements capable of effectively guiding the search within available GIS. The second challenge lies in accounting for the sometimes limited reliability of object detection systems during the matching process. The third challenge involves defining appropriate similarity measures and selecting sufficiently discriminative elements for the matching process. Finally, the fourth challenge is algorithmic in nature, given that a map generally represents only a limited portion of a GIS, which raises issues similar to those encountered in large-scale matching approaches.

Acknowledgments :
This work was supported by the CHIST-ERA project ATLAS "GeoAI-based augmentation of multi-source urban GIS" under grant numbers CHIST-ERA-23-MultiGIS-02 and ANR-24-CHR4-0005 (French National Research Agency).

How to cite: Ma, T., Benferhat, S., Tran Nguyen, M. T., Chahinian, N., Delenne, C., and Do, T.-N.:  Cross-analysis of Multisource Data for Geolocation of Non-georeferenced Urban Infrastructure Data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7251, https://doi.org/10.5194/egusphere-egu26-7251, 2026.

15:15–15:25
|
EGU26-11107
|
On-site presentation
Jordan Labbe, Hélène Celle, Julie Albaric, Pierre Nevers, Gilles Mailhot, Jean-Luc Devidal, and Nathalie Nicolau

Water management is becoming an increasingly complex task that must account for not only climate change but socio-economic pressures as well. This is particularly true in the case of alluvial aquifers which are often connected to surface waters, thus requiring a watershed scale policy. Conflicts of use might emerge especially during droughts which are occurring more frequently. In this context, the alluvial aquifer of the Allier River (France) is an interesting case study. This is a major regional resource for drinking water, industries and irrigation which extends over 210 km long between Langeac and the confluence with the Loire River. The Naussac dam keeps the Allier River at a minimum flow rate and secures water uses downstream, but the summer drought of 2023 was extreme and the dam was almost completely emptied. If this situation were to repeat itself over a longer period, the consequences on the productivity of pumping fields implanted on the alluvial aquifer are unknown. This work is part of the MODALL² project in which we propose to build a transient model of the alluvial aquifer using MODFLOW (Groundwater Vistas 8). One of the main challenges is to gather and organize a set of often heterogeneous data (incomplete time series, spatial data sparsely distributed etc.) from various sources. With the intention of improving the existing network, 50 additional water loggers have been deployed for groundwater level monitoring. 30 Electrical Resistivity Tomography (ERT) profiles were carried out to refine the thickness of alluvial deposits on the well-fields and thus, the geometry of the model. Given the elongated dimension of the alluvial aquifer, the study area is divided into 9 sub-models with which a ‘cascade modelling’ is performed. The purpose is to better understand how droughts spread across the whole hydrosystem and to what extent the pumping fields will be affected. ERT surveys have revealed that the thickness of alluvial deposits varies significantly from one site to another, ranging from 5 to 15 m downstream where the alluvial plain is more widespread. Hydrodynamic data show the influence of the river on groundwater level variations depending on the distance from the river. Lastly, the heterogeneity of the input datasets introduces uncertainty into the model that will need to be estimated. Beyond the potential to use modeling to anticipate future water crises, this work also proposes a methodology for handling large-scale heterogeneous datasets.

How to cite: Labbe, J., Celle, H., Albaric, J., Nevers, P., Mailhot, G., Devidal, J.-L., and Nicolau, N.: The effects of droughts on pumping fields at the watershed scale: building a model from a heterogeneous dataset., EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-11107, https://doi.org/10.5194/egusphere-egu26-11107, 2026.

15:25–15:35
|
EGU26-11579
|
ECS
|
On-site presentation
Wendinkonté Fabrice Cédric Sawadogo, Romain Chassagne, and Olivier Atteia

Well-log datasets commonly contain missing values due to acquisition issues, operational constraints, and economic limitations, which complicate quantitative subsurface analysis and useful extraction of information in geothermal and more largely subsurface characterisation. Imputation is therefore a key preprocessing step, yet many existing approaches primarily focus on within-well continuity and treat the problem as a depth-wise or time-series task, often overlooking spatial redundancy between neighbouring wells.

In this contribution, we compare three complementary modeling paradigms for well-log imputation: tabular machine-learning methods, sequential deep-learning models, and spatially informed graph-based approaches. The comparison is conducted within a unified and reproducible experimental framework based on cross-well validation and realistic missingness scenarios, including isolated gaps as well as extended block-wise and complete log-wise gaps.

Results highlight clear differences in behaviour across modeling families. Tabular methods exhibit limited robustness when missing values become structured, while sequential models improve depth-wise continuity but remain sensitive to large gaps and absent logs. In contrast, spatially informed graph-based models show increased stability by exploiting inter-well relationships, leading to more coherent reconstructions at the field scale.

These results suggest that evaluating imputation quality solely through local error metrics is insufficient for realistic subsurface applications. By emphasizing the importance of spatial coherence and inter-well information, this study supports the use of spatially aware formulations as a valuable alternative to purely depth-wise approaches for geothermal and broader subsurface characterization workflows.

How to cite: Sawadogo, W. F. C., Chassagne, R., and Atteia, O.: A comparative benchmark of tabular, sequential, and graph-based models for well-log imputation, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-11579, https://doi.org/10.5194/egusphere-egu26-11579, 2026.

15:35–15:45
|
EGU26-14708
|
ECS
|
On-site presentation
Imadeddine laouici and Fatma Chamekh

The understanding of the subsurface relies on integrating heterogeneous geological information originating from geological maps, geological models, and textual sources such as reports and scientific publications. In current practice, these sources remain rarely homogenized and are reconciled manually by domain experts, mostly in the context of 3D geomodel construction projects. Even when information is reconciled, existing methods offer limited support for expert knowledge integration, traceability of interpretations, and automated wholistic consistency checking.

We propose SemTrack, an ontology-based integration approach designed to formalize, reconcile, and exploit multi-source geological information within a unified knowledge graph. In this framework, SemTrack integrates structured information extracted from maps and numerical geological models with unstructured knowledge derived from textual documents, all aligned through a dedicated modeling ontology. The resulting knowledge graph supports logical reasoning and knowledge inference using SWRL rules to ensure the consistency of geological constraints and allows to explicitly encode expert interpretations record. This enables the automation of conceptual inconsistencies detection, transparent inference of implicit geological relationships, the completion of missing information across multiple sources, and advanced complex querying of initially heterogenous geological information.

How to cite: laouici, I. and Chamekh, F.: A Knowledge Graph–Based Approach for reconciling Geological information, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-14708, https://doi.org/10.5194/egusphere-egu26-14708, 2026.

Posters on site: Thu, 7 May, 10:45–12:30 | Hall A

The posters scheduled for on-site presentation are only visible in the poster hall in Vienna. If authors uploaded their presentation files, these files are linked from the abstracts below.
Display time: Thu, 7 May, 08:30–12:30
A.31
|
EGU26-1739
|
ECS
Mahmoud Hashoush, Emmanuelle Cadot, and Franco Alberto Cardillo

Missing data represents a challenge in large-scale epidemiological studies as it can introduce a strong and negative bias in the final estimates when not handled appropriately. This issue is particularly relevant in environment health research due to complex relationships between the exposure to risk factors and delayed outcomes. In this work, we evaluate the effectiveness of statistical and Machine Learning (ML) approaches to fill in missing values in data we collected to assess the potential impact on public health of gold mining activities in the Ecuadorian Amazon.

There is growing concern regarding the adverse effects on human health in the Ecuadorian Amazon caused by the environmental impact of gold mining activities in the area. To investigate potential associations with adverse birth outcomes, we collected data published by the Ecuadorian National Institute of Statistics and Census (INEC) relative to the annual live birth and fetal death cases in the years from 2014 to 2023. As it is typical in large-scale epidemiological studies, the data contain a proportion of missing values, likely related to the registration and the data entry process. 

Addressing missing values is considered important for the correct assignment of cases from one hand and the characterisation of risk factors from another. Furthermore, it enables the modelling process when searching for associations between exposure and outcome without erroneous under- or over-reporting of odds ratios (Type I and Type II errors). Currently, the most common approach in epidemiology is to use statistical methods and, specifically, Multivariate Imputation by Chained Equations (MICE), normally instantiated with parametric conditional models. MICE imputes missing values by repeatedly predicting each incomplete variable from the others using standard regression models. In most applications, these predictions rely on linear or generalised linear relationships between variables. This can reduce its effectiveness in predicting missing values in presence of complex, non-linear interactions about variables. Machine Learning represents an interesting alternative as it capture complex, non-linear relationships beyond the linear models typically assumed in MICE, are more flexible with respect to departures from missing-at-random patterns, and reduce the risk of model misspecification by relying on data-driven, implicit model selection rather than requiring the analyst to pre-specify an imputation model.

In this study, we present a robust experimental comparison between MICE and several ML-based imputation approaches applied to the Ecuadorian birth data. We assess their performance and discuss the respective strengths and limitations within an epidemiological context.

How to cite: Hashoush, M., Cadot, E., and Alberto Cardillo, F.: Missing data imputation in epidemiology: a comparison between MICE and Machine Learning methods, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-1739, https://doi.org/10.5194/egusphere-egu26-1739, 2026.

A.32
|
EGU26-9507
|
ECS
Batoul Haydar, Nanée Chahinian, and Claude Pasquier

Urban sewer networks are critical infrastructures that support residents' everyday life and ensure the collection and transportation of wastewater and stormwater. Yet operational datasets describing these networks are frequently imperfect: pipes may be missing, connectivity may be fragmented, and flow direction may be inconsistent due to incomplete attributes (e.g., invert levels, slope) or digitizing errors. We present a topology-focused study that transforms sewer data into a directed network by combining (i) graph-based representation and (ii) geometry-based consistency checks and rules. Starting from a directed (multi)graph built from available pipe and node geometries, which represent the edges and nodes in the graph, we detect topological anomalies including disconnected components, missing connections, dead ends, and closed loops.

When two pipes converge at a manhole with no outgoing pipe, it forms a non-outlet sink. To resolve this, we apply a two-stage methodology: edge orientation to reduce flow inconsistencies and resolve any sink nodes, followed by targeted edge addition to reconnect remaining disconnected components when reversals alone are insufficient. We test feasibility of the approach on a large open-access urban sewer dataset. The results illustrate how topology-oriented methods can still be applied to establish a well-connected network when data attributes are missing or unreliable.

How to cite: Haydar, B., Chahinian, N., and Pasquier, C.: From Imperfect Sewer Data to Coherent Topology: A Graph-Based Approach , EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9507, https://doi.org/10.5194/egusphere-egu26-9507, 2026.

A.33
|
EGU26-7685
|
ECS
Flavien Baudu, Carole Delenne, Thibault Catry, Sophie Ricci, Ludovic Cassan, Vincent Herbreteau, and Renaud Hostache

Floods are among the most destructive and costly natural disasters. While risk assessment and management have helped mitigate their impact in recent decades, climate change is expected to increase both their frequency and severity. This underscores the urgent need for predictive tools to better anticipate and prevent the adverse effects of flooding. Two-dimensional Shallow-Water (SW) hydraulic models offer a reliable solution for flood prediction, providing critical information such as floodplain extent, water levels and flow velocities. However, these models require boundary conditions (such as input flows), precise topography and bathymetry (i.e. riverbed geometry) as well as parameters to be calibrated (such as terrain roughness.  Unfortunately, such data are often sparse or entirely unavailable in many regions due to the high cost and logistical challenges of in situ measurements. In particular, if the topography can be obtained using LiDAR acquisition of Numerical Terrain Models, the bathymetry remains unaccessible because LiDAR signal does not pass through the water surface.

In this context, Data Assimilation (DA)—a method that optimally combines uncertain models with observations—becomes particularly valuable for estimating such missing data or parameters. Our study proposes an innovative approach to reconstruct riverbed geometry by assimilating flood extent information derived from satellite imagery, specifically Synthetic Aperture Radar (SAR) data, which can reliably detect floodwater extents.

To account for observational uncertainty, we generate a probabilistic flood map from SAR images, where each pixel’s value represents its probability of being water, based on observed backscatter. Using a tempered particle filter (TPF), we assimilate multiple SAR-derived probabilistic flood maps into an ensemble of hydraulic simulations (referred to as "particles"). These simulations share the same model architecture but incorporate randomly sampled riverbed geometries. 

To evaluate our methodology, we conducted a synthetic twin experiment based on a real-world case study of the River Severn near Tewkesbury, UK—a region prone to frequent flooding. We first perform a hydraulic simulation (the "control run") using a reference riverbed geometry and realistic boundary conditions. From this simulation, we generate several synthetic probabilistic flood maps, which were then assimilated into a second simulation to estimate the riverbed geometry using the TPF.

Our results demonstrate the effectiveness of this approach: the estimated riverbed geometry closely matches the reference. Additionally, contingency maps reveal strong agreement between the flood extents predicted by the control run and those obtained through the DA experiment.

How to cite: Baudu, F., Delenne, C., Catry, T., Ricci, S., Cassan, L., Herbreteau, V., and Hostache, R.: Data assimilation to retrieve unknown bathymetry in shallow water model, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7685, https://doi.org/10.5194/egusphere-egu26-7685, 2026.

A.34
|
EGU26-7776
|
ECS
Tristan Bourgeois, Nicolas Flipo, Marie Pettenati, and Hervé Noel

Water resource management is a major challenge for the coming decades. Its effective application across diverse territories therefore relies on an accurate representation of hydrological processes, generally achieved through physically based distributed hydrological models which in turn depend on spatially consistent and representative hydroclimatic forcing. At regional scales, capturing local variability in hydroclimatic drivers (precipitation, temperature, evapotranspiration) often requires combining datasets with different spatial resolutions and methodological assumptions.

Within the Eau-SPRA project (ADEME, France 2030 Programme), the CaWaQS model (Flipo et al., 2022; Flipo et al., 2023) is applied to the Loire River basin to support socio-hydrological modelling from regional to local scales. CaWaQS is a coupled distributed surface–subsurface hydrological model simulating both river discharge and groundwater dynamics. It currently lacks an explicit snow representation, which can significantly affect hydrological dynamics across scales, particularly in large river basins such as the Loire and under climate change conditions (Valéry et al., 2014).

To address these challenges, we developed CawSAR (CaWaQS Snow Accounting Routine), an open-source Python-based preprocessing framework designed to harmonize multi-source climate data (e.g. reanalysis products, radar observations) over a target study area. Based on a 3D matrix representation (time, x, y) of climate fields, it integrates multiple functionalities within a single, reproducible workflow. Climate data are harmonized through systematic downscaling, upscaling and regridding performed on a grid-cell basis using physical external-drift adjustments (altimetric gradient). CawSAR also enables cross-comparison of climate data sources across different spatio-temporal scales and implements a degree-day snow model to compute snow accumulation and melt. Finally, it generates liquid input time series (sum of liquid rainfall and snowmelt) fully compatible with the CaWaQS core model, ensuring direct integration into hydrological simulations.

Applied to the Loire basin, CawSAR illustrates how physically based preprocessing and multi-source harmonization enhance hydroclimatic forcing consistency for regional-scale hydrological modelling.

How to cite: Bourgeois, T., Flipo, N., Pettenati, M., and Noel, H.: CawSAR: an open-source framework for preprocessing hydroclimatic data in physically based hydrological modelling, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7776, https://doi.org/10.5194/egusphere-egu26-7776, 2026.

A.35
|
EGU26-14633
Peter Lünenschloß, David Schaefer, and Jan Bumberger

Quality control (QC) and data cleaning remain major bottlenecks in geoscientific data analysis as data volumes, dimensionality, and heterogeneity continue to increase. While machine- and deep-learning-based approaches have demonstrated impressive performance in selected applications, their practical adoption is often constrained by the availability of sufficiently large labelled training datasets and by the effort required to calibrate and adapt model hyperparameters across datasets and domains, particularly in unsupervised flagging scenarios. Conversely, rule-based, deterministic, and statistical QC approaches offer greater transparency and interpretability, but are frequently tailored to specific data structures and lack the flexibility required to robustly generalise to varying observational contexts and non-ideal data distributions.

We present a software framework that addresses this gap by enabling the formulation of QC pipelines in terms of a small set of basic anomaly descriptions, such as outliers, noisy regimes, and data gaps. These anomaly notions are intuitively understood by domain experts, while their systematic combination allows the representation of a wide range of anomaly patterns encountered in geoscientific observations.

The parameters of these compositions are then automatically calibrated with the data at hand, resulting in an instantiated QC pipeline. By internally reducing the calibration problem to the fitting of individual anomaly descriptions defined by only a small number of well-understood parameters, the optimisation achieves robust convergence even with a limited number of supervised examples. Within the framework, such examples can be generated interactively during pipeline construction by domain specialists themselves or imported from existing sources. This design lowers the entry barrier for effective automated quality control while enabling the explicit integration of domain knowledge into the calibration process.

The framework is implemented as a new module within the open-source quality-control software SaQC, thereby integrating seamlessly with existing data import, preprocessing, and flag management workflows. Calibrated QC pipelines can be exported and stored as portable, human-readable configuration files in a tabular format. These configurations can subsequently be loaded and applied using the SaQC application to new and unseen datasets, enabling reproducible and automated quality control.

In the poster, we present the conceptual design of the framework and demonstrate its application to a hydrological dataset, highlighting the transparent, combinatorial configuration interface and the integrated supervision workflow.

 

How to cite: Lünenschloß, P., Schaefer, D., and Bumberger, J.: Composing Transparent Quality Control Pipelines from Basic Anomaly Descriptions, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-14633, https://doi.org/10.5194/egusphere-egu26-14633, 2026.

A.36
|
EGU26-23039
Gracianne Cécile, Youssef Fouzai, Mirga Bokidingo, Caterina Negulescu, Yves Lucas, Gilles Grandjean, and Fatima Chamekh

Assessing exposure and vulnerability to natural hazards increasingly relies on national geospatial reference datasets. However, these datasets are often incomplete, heterogeneous and inconsistent across spatial scales, which limits their direct usability for multi-hazard risk analysis. In France, the BD TOPO building database exemplifies these challenges, with a large share of buildings lacking key attributes such as usage type, despite their importance for vulnerability assessment.
This contribution presents the approach developed within the CERES project (Cartography and Characterization of Exposed Elements from Satellite Imagery) to address reference data incompleteness and multi-source integration challenges in a geoscience risk context. Focusing on a large study area in the Centre-Val de Loire region, we first quantify and analyze the spatial and semantic gaps of BD TOPO building attributes, showing that more than 40% of buildings are labelled with unknown usage. We then demonstrate how deep learning applied to very high-resolution aerial imagery can be used to probabilistically infer missing semantic information, significantly reducing uncertainty while explicitly accounting for classification ambiguities.
Beyond data completion, we highlight the difficulties encountered when jointly exploiting heterogeneous datasets originating from national mapping agencies, land cover products, socio-economic statistics and hazard layers. These include spatial misalignments, inconsistent scales of representation, varying levels of reliability, and the absence of a shared data model. To address these issues, CERES proposes a multi-scale data structuring framework combining data modelling and processing designed to preserve data provenance, uncertainty and semantic traceability across sources.
By articulating reference data analysis, machine-learning-based enrichment and database design, this work provides a concrete illustration of current practices and challenges in managing imperfect geospatial data for geoscience applications. The results underline the necessity of coupling data-driven approaches with explicit data governance and modelling strategies to produce robust, transparent and reusable datasets for territorial risk assessment.

How to cite: Cécile, G., Fouzai, Y., Bokidingo, M., Negulescu, C., Lucas, Y., Grandjean, G., and Chamekh, F.: Managing Incomplete Urban Reference Data for Risk-Oriented Geoscience Applications: Lessons from the CERES Project, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-23039, https://doi.org/10.5194/egusphere-egu26-23039, 2026.

A.37
|
EGU26-15597
TaeWoong Ok, ChiYoung Kim, KiYong Kim, and ChanWoo Kim

In South Korea, river stage gauging stations operate redundant water level gauges to mitigate instrument malfunctions and anomalous measurements. Currently, redundant gauges are installed at over 60% of gauging stations, reflecting their widespread implementation; however, their quality management and practical utilization remain limited. In many cases, installation and operational conditions are not fully accounted for in observed water levels, leading to significant discrepancies between primary and redundant gauges. These discrepancies may arise from river characteristics, artificial configuration errors, or site-specific conditions.

 

This study investigates the causes of discrepancies between primary and redundant gauges and proposes appropriate correction methods. Anomaly detection was first conducted on redundant gauge measurements using limit tests, duration tests, and regression tests to ensure data reliability. Based on this, the relationships between primary and redundant gauge readings were analyzed using simple regression, multiple regression, and nonparametric LOESS (Locally Estimated Scatterplot Smoothing) regression. These procedures not only facilitated the derivation of site-specific correction methods but also supported the preliminary development of a real-time quality control program, moving beyond conventional manual, non-real-time quality management.

 

Nevertheless, because the causes of discrepancies and installation conditions vary by site, site-specific correction strategies are required, and ongoing monitoring and refinement of measurements and corrections remain necessary. Furthermore, real-time utilization of redundant gauges is challenging at newly established stations. Despite these limitations, the proposed correction strategies have the potential to go beyond simple substitution of primary gauge readings, enabling higher-quality hydrological data production and improved quality control. These strategies are expected to enhance real-time hydrological monitoring systems and strengthen the reliability of national hydrological data management frameworks.

Keywords : Redundant, Water Level Gauging, Uncertainty, Operational Monitoring

 

Acknowledgements

This work was supported by Korea Environment Industry & Technology Institute(KEITI) through Research and Development on the Technology for Securing the Water Resources Stability in Response to Future Change Project, funded by Korea Ministry of Climate, Energy, Environment(MCEE)(RS-2024-00332300).

How to cite: Ok, T., Kim, C., Kim, K., and Kim, C.: Quality Control of Redundant Water Level Gauges in South Korea River Gauging Stations, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-15597, https://doi.org/10.5194/egusphere-egu26-15597, 2026.

A.38
|
EGU26-15843
Chi Young Kim, Chanwoo Kim, and Taewoong Ok

Complete daily streamflow time series are essential for sustainable water resources management and reliable hydrological modelling; however, even short data gaps can substantially reduce the usability of streamflow records. Recurrent missing data may lead to inefficient model calibration, decreased reliability of peak and low-flow estimates, and biased hydrological statistics. Therefore, rather than leaving missing values unfilled, it can be beneficial to infill daily streamflow using appropriate methods and to provide flags indicating imputed periods. 
In South Korea, streamflow monitoring prior to 2008 primarily focused on flood-related observations, resulting in relatively limited daily streamflow records; since then, the production of continuous daily streamflow data for water resources management has expanded. As of 2024, daily streamflow records from more than 420 gauging stations are managed and disseminated, yet a non-negligible number of stations still contain missing values due to various causes such as river works and uncertainties in stage–discharge relationships associated with the operation of hydraulic structures. 
This study comparatively evaluates gap-filling techniques using paired upstream–downstream gauging stations located in basins with diverse rainfall regimes and hydrological characteristics. We assess conventional methods widely used in practice (scaling, linear regression, and equi-percentile/quantile-based approaches) under different missing-data conditions and benchmark them against an extended long short-term memory (extended LSTM) time-series model designed for streamflow infilling. Performance is evaluated using the Nash–Sutcliffe efficiency (NSE), root mean square error (RMSE), and percent bias (PBIAS). In addition, flow duration curves (FDCs) are compared to examine each method’s ability to reproduce the post-infilling flow regime distribution. The outcomes are expected to support condition-dependent selection of gap-filling strategies and to improve the reliability of daily streamflow datasets with explicit quality flags.

How to cite: Kim, C. Y., Kim, C., and Ok, T.: Comparative Evaluation of Daily Streamflow Gap-Filling Using Paired Upstream–Downstream Gauges, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-15843, https://doi.org/10.5194/egusphere-egu26-15843, 2026.

A.39
|
EGU26-7760
Carole Delenne, Ti-Hon Nguyen, Minh-Thu Tran-Nguyen, and Salem Benferhat

Data related to urban infrastructures often come from multiple sources and exist in a wide variety of formats, such as Geographic Information Systems (GIS), textual information, numerical databases, images, or videos, which can make their processing, querying, and analysis complex. This work falls within this context and aims to propose new approaches for the management of heterogeneous data in stormwater and wastewater networks.

More specifically, we focus on video data, particularly Closed-Circuit Television (CCTV) inspection videos of sewer pipelines. These videos are essential for the management and maintenance of urban networks. On the one hand, they enable the identification of anomalies that may affect the integrity of pipelines, such as blockages or structural degradation. On the other hand, they provide key information on the structural properties of pipelines and networks, including pipe diameter and the direction of wastewater flow.

We propose a classification algorithm for wastewater inspection videos aimed at detecting major anomalies in CCTV inspection sequences of sewer networks, with a particular emphasis on identifying variations in pipe diameter, internal cracks, chemical corrosion, and the presence of turbid water within the pipelines. This task is crucial for predictive maintenance and hydraulic modeling of sewer systems. Information related to the identification of variations in pipe diameter can also be leveraged to enrich and complete missing pipe diameter attributes in Geographic Information Systems.

Our approach is based on the Video Vision Transformer (ViViT) and TimeSformer architectures, which effectively capture both spatial and temporal relationships in video data. We also describe various methodologies for generating training datasets from a subset of manually annotated images. Experimental results obtained on real-world CCTV sewer inspection videos provided by Montpellier Méditerranée Métropole demonstrate promising performance in anomaly detection.

How to cite: Delenne, C., Nguyen, T.-H., Tran-Nguyen, M.-T., and Benferhat, S.: Anomaly detection in wastewater pipeline videos using self-attention, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7760, https://doi.org/10.5194/egusphere-egu26-7760, 2026.

A.40
|
EGU26-17357
Franco Alberto Cardillo, Angela Andrigo, Francesco De Biasio, Franca Debole, Marco Favaro, Alvise Papa, Umberto Straccia, and Stefano Vignudelli

High water events in Venice are a recurrent phenomenon, as the city is located only slightly above mean sea level and is directly influenced by water-level variations within the lagoon. Flooding occurs when several physical processes act in combination. The astronomical tide determines the baseline water level, which is subsequently modulated by seiche oscillations in the Adriatic Sea, meteorological forcing (e.g. wind stress and atmospheric pressure), and slower, low-frequency geophysical processes and sea level rise. When these factors co-occur, even if individually moderate, large portions of the city may experience flooding.

Repeated flooding has significant economic and social impacts, limits pedestrian and naval traffic and contributes to the degradation of buildings and cultural heritage. To mitigate these effects, a range of protective measures is implemented and coordinated by an early warning system. The effectiveness of these measures depends on their timely activation. However, mitigation actions are associated with substantial economic costs and may themselves generate negative impacts if deployed unnecessarily. For instance, interruptions to public transport services affect daily activities, while the operation of the MOSE barrier entails considerable financial costs. Accurate and reliable forecasts are therefore essential to balance flood protection with the economic and social costs of mitigation measures.

Current forecasting systems primarily estimate water levels and peak values, and these are typically estimated at a limited number of locations. These systems are based on sophisticated statistical and hydrodynamic models. Although they perform well in most situations, their accuracy can be affected by uncertainties in atmospheric forcing and by limitations in representing the full variability of high water events. This work explores the potential of complementary approaches based on the analysis of observational data rather than explicit physical modelling.

Data-driven approaches, in particular Machine Learning (ML) methods, analyze historical data without relying on predefined, human-designed model structures. ML models are able to capture recurring patterns and complex feature interactions that are difficult to incorporate into traditional numerical models. Among these approaches, clustering techniques aim to identify recurrent types of events based on similarities in their temporal evolution and associated meteorological conditions. This enables events characterized by similar water levels to be differentiated according to the combinations of underlying meteorological drivers, thereby providing additional information to support forecasting and response planning.

In this work, we present a preliminary analysis based on several clustering approaches, including k-means, DBSCAN, and deep learning–based methods, applied to a multi-decadal atmospheric dataset and to the longest available reconstructed hourly sea-level records for the northern Adriatic Sea, specifically developed for this study. We compare the resulting event classifications and discuss how cluster-derived information may complement existing forecasting systems in support of flood-mitigation strategies for the city of Venice.

How to cite: Cardillo, F. A., Andrigo, A., De Biasio, F., Debole, F., Favaro, M., Papa, A., Straccia, U., and Vignudelli, S.: A Preliminary Analysis of High Water Events in Venice Based on Multi-Decadal Observations and Clustering, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-17357, https://doi.org/10.5194/egusphere-egu26-17357, 2026.

A.41
|
EGU26-6637
Salem Benferhat, Nanée Chahinian, Carole Delenne, Ines Couso Blanco, Luciano Sanchez Ramos, and Zoltan Kato
This presentation addresses a major challenge: fully leveraging the potential of geospatial data to improve Geographic Information Systems (GIS). Using urban flooding as a case study, it aims to integrate heterogeneous data sources of varying nature and quality levels in order to enhance both the expressiveness and reliability of GIS.
 
This work presents ongoing and planned research activities within the ATLAS CHIST-ERA project, which is entirely dedicated to this objective through a multidisciplinary approach. The project mobilizes complementary expertise in GIS, artificial intelligence, machine learning, computer vision and 2D/3D image analysis and object detection, statistics, urban network mapping, as well as geoalignment techniques.
 
The presentation is structured around two main objectives, both oriented toward GIS enrichment, with direct applications for flood risk management.
 
The first objective consists of combining and integrating external data within GIS. This approach enables seamless data integration and facilitates the revision, completion, and enrichment of existing datasets, while improving their expressiveness, particularly through the introduction of 3D representations. Such enriched representations are essential for accurately modeling surface runoff, flow paths, and hydraulic connectivity in urban environments subject to flooding.
 
The second objective focuses on integrating imperfect or uncertain data, such as amateur videos, crowdsourced observations, or data lacking precise georeferencing. To address these limitations, the project relies notably on the use of variational autoencoders for processing imprecise data, and proposes uncertainty and imprecision management mechanisms aimed at improving data quality by reducing inaccuracies and explicitly modeling confidence levels.
 
Acknowledgments :
This work was supported by the CHIST-ERA project ATLAS "GeoAI-based augmentation of multi-source urban GIS" under grant numbers CHIST-ERA-23-MultiGIS-02 and ANR-24-CHR4-0005 (French National Research Agency).

How to cite: Benferhat, S., Chahinian, N., Delenne, C., Couso Blanco, I., Sanchez Ramos, L., and Kato, Z.: GeoAI-based augmentation of multi-source urban GIS, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-6637, https://doi.org/10.5194/egusphere-egu26-6637, 2026.

A.42
|
EGU26-11928
|
Virtual presentation
hamza khyari and Salem Benferhat

Data completion is a major challenge in many applications, particularly in Geographic Information Systems (GIS) for water networks. Numerous approaches have been proposed to address this problem, ranging from classical statistical methods to artificial intelligence-based techniques.

In this presentation, we address the problem of missing or imprecise data in water network GIS by proposing a clustering-based data completion approach. For a given attribute with missing or uncertain values, each possible value in the attribute domain is considered as a candidate for completion. Each candidate is evaluated by analyzing its impact on the clustering of the entire dataset: inserting a candidate value induces a specific global clustering, whose quality is assessed using appropriate clustering validity criteria. The value that yields the highest-quality clustering, namely the one that best captures the intrinsic structure of the data, is selected as the final completion value.

To cope with the combinatorial explosion resulting from multiple attributes with missing values and large domains, several strategies are employed to reduce the number of candidate completions, including aggregation mechanisms, while maintaining both the effectiveness and efficiency of the proposed approach.

How to cite: khyari, H. and Benferhat, S.: A Disjunctive Interpretation Approach to Missing Data Based on Clustering Quality, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-11928, https://doi.org/10.5194/egusphere-egu26-11928, 2026.

Please check your login data.