ESSI1.11 | Geospatial Foundation Models for Earth Observation and Earth Sciences: Current Solutions and Future Perspectives
Geospatial Foundation Models for Earth Observation and Earth Sciences: Current Solutions and Future Perspectives
Convener: Nicolas Longépé | Co-conveners: Begüm Demir, Gabriele Cavallaro, Rahul Ramachandran, Valerio Marsocci
Orals
| Wed, 06 May, 14:00–15:45 (CEST)
 
Room -2.33
Posters on site
| Attendance Tue, 05 May, 14:00–15:45 (CEST) | Display Tue, 05 May, 14:00–18:00
 
Hall X4
Posters virtual
| Wed, 06 May, 14:00–15:45 (CEST)
 
vPoster spot 1b, Wed, 06 May, 16:15–18:00 (CEST)
 
vPoster Discussion
Orals |
Wed, 14:00
Tue, 14:00
Wed, 14:00
Foundation Models (FMs) are set to revolutionize domains like Earth Observation (EO) and Earth Sciences. Trained on vast unlabeled datasets via self-supervised learning, they can uncover complex patterns and latent information. Once pre-trained, Geospatial FMs can be adapted to diverse tasks with minimal fine-tuning or additional data. As a result, this paradigm shift is set to reshape the entire information value chain, with far-reaching implications for industry, research and development, and the broader scientific community.

This session aims to share the latest research and technological advances and discuss practical solutions for effectively integrating FMs into the Earth Observation and Earth Sciences ecosystems. We encourage interdisciplinary collaboration, and submissions from AI researchers, EO and Earth data scientists and industry experts, as well as from stakeholders from High-Performance Computing (HPC), Big Data, and EO application communities.

The main topics for the session are:
● Latest Advances in AI Foundation Models: FMs can process data from various sensors, including multi- or hyper-spectral, SAR, LiDAR, and more, enabling comprehensive analysis of the Earth's dynamics holistically. Recent progress marks a shift from sensor-specific models toward sensor-aware or sensor-agnostic architectures.
● Benchmarking and Evaluating Foundation Models: Establishing standardised fair evaluation metrics and benchmarks to assess the performance and capabilities of FMs, ensuring reliability and efficiency, moving beyond simplistic or canonical use cases.
● Embedding and Geospatial Semantic Data Mining: FMs enable advanced geospatial semantic mining by leveraging latent space embeddings to uncover meaningful patterns and relationships. This enhances interpretation while reducing the need for large volumes of raw data across time and space.
● Implications of Foundation Models for the Community: Understanding the potential societal, environmental, and economic impacts of FMs, fostering informed decision-making and resource management. Seamless integration with downstream systems such as digital twins, public dashboards, and early warning platforms, including deployment at the edge (e.g. onboard satellites) is essential. Emerging roles of Agentic AI, in synergy with Large Language Models (LLMs) open new pathways for autonomous, context-aware EO applications.

Orals: Wed, 6 May, 14:00–15:45 | Room -2.33

The oral presentations are given in a hybrid format supported by a Zoom meeting featuring on-site and virtual presentations. The button to access the Zoom meeting appears 15 minutes before the time block starts.
Chairpersons: Begüm Demir, Valerio Marsocci, Rahul Ramachandran
14:00–14:05
Next-generation EO representations
14:05–14:15
|
EGU26-13425
|
On-site presentation
Mikolaj Czerkawski, Marcin Kluczek, and Jędrzej S. Bojanowski

The current landscape of geospatial AI models is expanding rapidly, with new open-source models being released nearly every month. Consequently, with this rising number of potential general-purpose models, many of which claim state-of-the-art performance, their suitability for specific tasks or spatiotemporal contexts is often difficult to judge. A key step towards the democratisation of model benchmarking can be made by releasing large-scale datasets of pre-computed embeddings, such as those shared within the Major TOM project.

Yet, even with easy access to global and dense embeddings of a given model, it is not clear how to evaluate them on a global scale, given the scarcity and spatiotemporal biases of high-quality labels. This work explores a set of evaluation tests that can be conducted on a global scale, moving beyond canonical use cases to understand the inherent biases of individual models.

First, a set of proxy tasks with worldwide coverage is introduced. In this benchmark prototype, several sensitivity variables are tested, including time, location (estimation of spatiotemporal context), and VIIRS nightlights data (estimating a proxy for human activity). Despite not being traditional downstream tasks, these three variables have the advantage of uniform quality across the entire dataset. This allows for standardised, fair evaluation of representations extracted from Sentinel-2 and Sentinel-1 data across a range of pre-trained encoders as part of the Major TOM Embedding suite.

Secondly, a suite of techniques for comparing internal representation geometries of latent space vectors from multiple models is introduced to evaluate the similarities and differences between individual models. This approach does not require any reference labels, enabling a deeper understanding of geospatial semantic relationships encoded by different architectures.

Ultimately, this work advances the large-scale evaluation of deep learning models for Earth observation data, utilizing these model comparisons to develop a set of recommendations for future benchmarking efforts within the Earth Science community.

How to cite: Czerkawski, M., Kluczek, M., and Bojanowski, J. S.: Time, Space, and Nightlights: Global Evaluation of Major TOM Earth Embeddings, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-13425, https://doi.org/10.5194/egusphere-egu26-13425, 2026.

14:15–14:25
|
EGU26-19720
|
On-site presentation
Karin Mora, Julia Peters, Konstantin Ntokas, Martin Reinhardt, Gunnar Brandt, Teja Kattenborn, Guido Kraemer, David Montero, Clemens Mosig, and Miguel D. Mahecha

Monitoring and understanding Earth system dynamics and their response to climate change and human activity requires innovative approaches to analyse complex and multivariate remote sensing data. However, the current trend is towards large models that require a lot of memory and computational power to be trained. The DeepFeatures project addresses this challenge by developing an embedding approach to create Feature Data Cubes, which capture the underlying spatio-temporal ecosystem dynamics as a low dimensional representation in latent space. These reduced representations enable the use of simpler, resource-efficient downstream models, which are easier to train and require minimal computational resources.

Specifically, the project builds on the rationale that each spectral index (SI), which is calculated from spectral bands and represents certain surface properties such as vegetation greenness,  reflects a specific aspect of ecosystem behaviour. Despite the development of over two hundred spectral indices, current studies often narrow their focus to individual SIs, overlooking the broader context of land surface processes represented by not considered SIs. The DeepFeatures project addresses this challenge by adopting a spatio-temporal multivariate approach. The SIs are derived from Sentinel-2 observations to generate a SI Data Cube. A deep learning embedding algorithm is applied to reduce the SI dimension and extract a latent space to create the Feature Data Cubes.

To demonstrate the potential of the Feature Data Cubes, the project focuses on inference across a range of scientific applications, including modelling gross primary production, analysing  tree mortality and greening trends, biodiversity  monitoring for conservation, comparing phenological features using satellite and crowd-sourced data, and studying the ecological impacts of open-pit lignite mining.

DeepFeatures emphasises the deployment of transparent and reproducible workflows, from generating Sentinel-2 derived Training Data Cubes to creating Feature Data Cubes. It aims to have an accessible, extensible, and modifiable framework for diverse applications, fostering broad community engagement and enabling open exploration of Earth system dynamics.

This presentation will showcase the methodology, scientific cases, and transformative potential of the DeepFeatures framework, highlighting its contributions to Earth observation and climate research.

The project DeepFeatures is funded by ESA’s AI4Science activity. Website: https://rsc4earth.de/project/deepfeatures/ 

How to cite: Mora, K., Peters, J., Ntokas, K., Reinhardt, M., Brandt, G., Kattenborn, T., Kraemer, G., Montero, D., Mosig, C., and Mahecha, M. D.: DeepFeatures: Learning Latent Representations from Spectral Indices for Ecosystem Monitoring, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19720, https://doi.org/10.5194/egusphere-egu26-19720, 2026.

14:25–14:35
|
EGU26-10800
|
ECS
|
On-site presentation
|
Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, and Mikolaj Czerkawski

The work presented herein showcases early results of COP-GEN, a general-purpose diffusion model supporting flexible zero-shot translation between a number of popular data modalities related to the Copernicus programme: Sentinel-2 (both L1C and L2A), Sentinel-1 RTC, Copernicus DEM-30, Land Use Land Cover Maps, Cloud Masks, geospatial coordinates, and timestamps.

COP-GEN is designed as a diffusion model with a transformer backbone, which offers two concrete advantages. Firstly, the diffusion formulation respects the stochastic nature of cross-modal translation tasks; so that nearly every conditional generation query can be satisfied by a diverse range of plausible outputs rather than a single deterministic sample. Secondly, the sequence-based architecture facilitates the integration of diverse data modalities by flattening their latent representations, along with modality-specific diffusion timesteps, into a single sequence of tokens. Consequently, COP-GEN is capable of synthesising missing data from any subset of modalities in a zero-shot manner.

The model is pre-trained at global scale on MajorTOM, using over one million paired, geographically distributed samples spanning diverse climate zones, land-cover types, and acquisition conditions. By training jointly on matched data modalities, COP-GEN can, for example, estimate Land Use Land Cover, cloud coverage, atmospheric correction, and the spatiotemporal context of the available observations.

The first set of results indicates strong generative capability and high output diversity across modalities. The work concludes by discussing the available open-source implementation along with potential use cases.

How to cite: Espinosa, M., Gmelich Meijling, E., Marsocci, V., Crowley, E. J., and Czerkawski, M.: COP-GEN: Stochastic Generative Modelling of Copernicus Data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-10800, https://doi.org/10.5194/egusphere-egu26-10800, 2026.

14:35–14:45
|
EGU26-7697
|
ECS
|
On-site presentation
Charly Zimmer, Josefine Umlauft, Guido Kraemer, David Montero, and Miguel D Mahecha

Earth observation datasets, especially those derived from remote sensing, are often characterized by significant data gaps. However, the pretraining of Geospatial Foundation Models requires mostly complete samples, leading to very selective sampling strategies that leave out large parts of the original observations. The problem is exacerbated in spatiotemporal data where these restrictions apply to the entire time series. Systems like Prithvi-EO-2.0 allow very small gap regions that can be addressed with interpolation during preprocessing. But a solution for integrating significant gap areas (>20% of the sample) into the pretraining process is yet to be established. We introduce an architecture that builds upon the random masking strategies in popular MAE-style architectures by additionally force-masking patches that contain gaps. Doing so requires a BERT masking scheme where masked patches are encoded instead of being removed from the sequence. Custom loss functions are introduced to account for the gaps in both the targets and the masked patches. While the resulting encoder-only architecture does not benefit from the reduced computational complexity in MAE-style masking, we mitigate this effect by using factorized space-time attention in the Video Vision Transformer (ViViT) backbone, thus creating a simple and lightweight model that is easily scalable. We demonstrate the potential of the architecture by performing spatiotemporal representation learning in a multivariate setup involving global Land Surface Temperature (LST) observations. The model is embedded in a framework that provides customizable sampling strategies for large-scale Earth observation datasets, including control over parameters like the maximum gap ratio per sample, the sampling strides, and the involved variables in shared-grid datasets like Earth System Data Cubes (ESDC). This flexibility in sampling enables the generation of training datasets with millions of samples, thus exposing the full volume of information stored in Earth observation data to Geospatial Foundation Models.

How to cite: Zimmer, C., Umlauft, J., Kraemer, G., Montero, D., and Mahecha, M. D.: Gap-Aware Transformer-Based Foundation Model Pretraining for Spatiotemporal Earth Observation Data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7697, https://doi.org/10.5194/egusphere-egu26-7697, 2026.

14:45–15:00
Rethinking EO Foundation Models: stability, benchmarking, and multimodal generalization
15:00–15:10
|
EGU26-2602
|
ECS
|
On-site presentation
Mehmet Ozgur Turkoglu and Helge Aasen

Recent progress in Earth Observation (EO) foundation models has raised expectations that large-scale pretraining will yield general-purpose representations comparable to those in natural language processing and computer vision. In this work, we show that this promise has not yet been realized. We introduce spectral stability as a principled criterion for foundation models, measuring the extent to which the principal singular subspaces of pretrained weights are preserved during fine-tuning. Through this lens, we conduct a comparative analysis of several EO foundation models, including AnySat and Presto, alongside established models from vision and language, namely DINOv2 and BERT. Our analysis reveals a stark contrast between domains. BERT and DINOv2 exhibit strong spectral stability, with fine-tuning primarily inducing rotations within a small low-rank subspace. In contrast, EO models display severe spectral instability, where fine-tuning substantially rewrites their dominant singular directions. We show that this instability explains two key limitations of current EO foundation models. First, pretraining does not consistently accelerate downstream learning. Second, low-rank adaptation methods such as LoRA can fail or collapse, as the pretrained subspaces are only partially useful. Using extensive experiments on the TimeMatch benchmark for cross-regional crop classification, we demonstrate that despite strong performance claims, pretrained EO models yield inconsistent or marginal improvements over random initialization and do not achieve state-of-the-art performance. These findings indicate that current EO models lack the representational universality characteristic of true foundation models. We conclude that spectral stability is a critical property for robust transfer learning in Earth Observation, and we argue that future EO foundation models should prioritize spectral coherence through improved pretraining objectives and architectural designs that better capture the underlying structure of geospatial data.

How to cite: Turkoglu, M. O. and Aasen, H.: Are Earth Observation Foundation Models Really Foundation Models? Investigation Based on Spectral Analysis, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-2602, https://doi.org/10.5194/egusphere-egu26-2602, 2026.

15:10–15:20
|
EGU26-12736
|
ECS
|
On-site presentation
Lucia Gordon, Serge Belongie, Christian Igel, and Nico Lang

Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal downstream tasks with 12 input modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models on MMEarth-Bench and find that multimodal models generally perform best. While pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor and using multimodal inputs at test time can sometimes lead to geographic overfitting. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time, regardless of whether the pretrained model accepts them as input. We show that TTT-MMR improves model performance on both random and geographic test splits, and that geographic batching (TTT-MMR-Geo) leads to a good trade-off between regularization and specialization during TTT. Our dataset, code, and visualization tool are linked from the project page at https://lgordon99.github.io/mmearth-bench.

How to cite: Gordon, L., Belongie, S., Igel, C., and Lang, N.: MMEarth-Bench: Global Environmental Tasks for Multimodal Geospatial Models, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12736, https://doi.org/10.5194/egusphere-egu26-12736, 2026.

15:20–15:30
|
EGU26-21700
|
ECS
|
On-site presentation
Mehran Alizadeh Pirbasti, Gavin McArdle, and Vahid Akbari

Pretraining geospatial foundation models (FMs) is expensive, and architectural choices control inductive bias for multiscale context, cross-resolution behavior, and band/sensor variation. Therefore, benchmarking reduces the risk of scaling the “wrong” base. A model benchmarking framework for geospatial image segmentation is a critical prerequisite for developing robust and scalable geospatial FMs. In the emerging era of Earth observation FMs, success hinges on strong, well-characterized base architectures that can generalize across sensors, modalities, and geographies. The extreme heterogeneity of Earth observation vision data (different spectral bands, resolutions, and regions) makes such generalization especially challenging, underscoring the need for systematic, controlled benchmarking across diverse model families to identify viable architectures for different scenarios.
Our work rigorously evaluates a broad spectrum of segmentation architectures and backbones under consistent conditions. We benchmark classical convolutional architectures (U-Net, DeepLab, UPerNet, FPN, PAN, and LinkNet) alongside modern transformer-based models (Dense Prediction Transformer (DPT) and SegFormer). For this comparison, we use representative backbones from both CNNs (ResNet and MobileNet) and Mix Vision Transformer (MiT). By comparing these heterogeneous models on equal footing, we determine which architectural patterns and hybrid combinations yield representations most conducive to generalization. This diversity in evaluation identifies well-founded architectural bases for geospatial FMs.
To guide architecture selection and pipeline design, we deploy a comprehensive suite of metrics covering both accuracy and efficiency. We evaluate segmentation accuracy via IoU, Dice, and boundary F1-score, and also measure efficiency (convergence speed and inference latency). These holistic benchmarks reveal critical trade-offs. For instance, some lightweight CNN models excel in speed, while transformer models boost boundary F-1 score. By capturing such nuances, our benchmark informs which architectures are best suited as general-purpose base models. It highlights how certain encoder–decoder combinations optimally balance performance and efficiency, and flags architectures with high transfer-readiness for new tasks and domains.
The result is a reproducible, transferable model landscape that serves as a blueprint for FM development. Our benchmark framework effectively preconditions the FM pipeline, enabling researchers to enter the scaling phase with proven architecture candidates that have demonstrated cross-task and cross-sensor robustness. This “model landscape” allows subsequent large-scale pretraining to confidently build on architectures that ensure broad downstream generalization even in agentic (autonomous) deployment scenarios.
Finally, we situate this work within the broader trend toward sensor-agnostic, self-supervised FMs in Earth observation. We argue that intelligent architecture search must precede any massive self-supervised pretraining effort. Early vetting of architectures under diverse conditions ensures that large-scale training resources are invested in the most promising designs. In summary, we frame this hybrid benchmarking framework as a strategic new layer in the geospatial FM ecosystem. The insights extend beyond segmentation, providing a reference point for building fine-tunable, sensor-agnostic foundation models that can be readily adapted to various downstream tasks and even deployed onboard satellites or other edge platforms. By solidifying architecture evaluation as an essential step, this work makes a serious scientific and strategic contribution toward the next generation of Earth observation AI.

How to cite: Alizadeh Pirbasti, M., McArdle, G., and Akbari, V.: Segmentation Model Benchmarking: A Strategic Prerequisite for Robust Geospatial Foundation Models, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-21700, https://doi.org/10.5194/egusphere-egu26-21700, 2026.

15:30–15:45

Posters on site: Tue, 5 May, 14:00–15:45 | Hall X4

The posters scheduled for on-site presentation are only visible in the poster hall in Vienna. If authors uploaded their presentation files, these files are linked from the abstracts below.
Display time: Tue, 5 May, 14:00–18:00
Chairpersons: Begüm Demir, Rahul Ramachandran, Valerio Marsocci
Foundations and representation learning (Architectures, SSL, VLMs, comparisons)
X4.50
|
EGU26-2530
Mohanad Albughdadi, Marica Antonacci, Vasileios Baousis, Federico Fornari, Tolga Kaprol, and Claudio Pisa

Large-scale foundation models trained on multi-sensor satellite imagery has been driving recent advances in Earth Observation (EO) tasks. Although such models achieve impressive transferability across diverse downstream tasks, their computational and memory demands hinder accessibility, reproducibility, and deployment in resource-constrained environments. This work explores a compact and efficient alternative, introducing a metadata-aware Mixture-of-Experts Masked Autoencoder (MoE-MAE) for EO representation learning (Albughdadi, 2025).

The proposed MoE-MAE is a self-supervised transformer-based architecture with only 2.5 million parameters. It combines sparse expert routing and geo-temporal conditioning. The sparse routing allows token specialization while keeping active computation low. The geo-temporal conditioning injects information about latitude, longitude, and cyclic temporal attributes directly into the model. The proposed design enables the algorithm to exploit spatial and temporal regularities inherent in EO data without requiring dense, and computationally costly transformers.

The model is pretrained in the BigEarthNet-Landsat (BEN-LS) (Corley et al., 2025) dataset using a masked reconstruction loss function augmented with auxiliary unmasked and load-balancing losses to encourage stable expert utilization. The learned encoder representations are then evaluated using linear probing on two benchmark datasets: (1) BEN-LS, a multi-label land-cover dataset with explicit metadata, and (2) EuroSAT-Landsat (EuroSAT-LS) (Corley et al., 2025), a single-label classification datasets without metadata. Despite the encoder’s small size (~2.3 M parameters), the proposed MoE-MAE achieves competitive results with models’ orders of magnitude larger. On BEN-LS, the frozen encoder achieves a micro mean average precision of 0.767, comparable to SSL4EO-L ViT-S/16 MoCo v2 (0.775) (Stewart et al., 2023). On EuroSAT-LS, the model maintains strong transferability, achieving 84.2% accuracy, even in the absence of geo-temporal metadata.

Expert specialization across spatial patterns is revealed through adequate ablation and visualization studies, which show that some experts respond primarily to vegetation, others to water or textured regions. This demonstrates interpretable behaviour and complementary feature learning. Additionally, only about half of the model’s expert feed-forward capacity is activated per token, confirming computational sparsity in practice. These findings suggests that such models can retain strong representational power while substantially reducing training and inference costs.

This work presents a first step toward small-scale architectures for EO representation learning by integrating metadata, and leveraging sparse computation to approach the performance of massive transformers. Future work will extend this framework to multi-sensor and multi-temporal datasets to capture dynamic Earth processes efficiently.

Albughdadi, M. (2025). Lightweight Metadata-Aware Mixture-of-Experts Masked Autoencoder for Earth Observation.arXiv:2509.10919.

Stewart, A. J., Lehmann, N., Corley, I. A., Wang, Y., Chang, Y.-C., Braham, N. A. A., Sehgal, S., Robinson, C., & Banerjee, A. (2023). SSL4EO-L: Datasets and Foundation Models for Landsat Imagery. arXiv:2312.05241.

Corley, I., Sharma, L., and Crasto, R. (2025). Landsat-Bench: Datasets and Benchmarks for Landsat Foundation Models.arXiv:2506.08780.

How to cite: Albughdadi, M., Antonacci, M., Baousis, V., Fornari, F., Kaprol, T., and Pisa, C.: Efficient Earth Observation Representation Learning Using Metadata-Aware Mixture-of-Experts Masked Autoencoder, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-2530, https://doi.org/10.5194/egusphere-egu26-2530, 2026.

X4.51
|
EGU26-11394
|
ECS
Guosen Xu, Huanfeng Shen, Xinghua Li, Mingjie Xu, Dekun Lin, and Tao Jiang

The emergence of foundation models marks a transformative era in Earth observation, delivering powerful and adaptable tools to tackle the complexities of processing massive satellite imagery. Currently, land cover mapping faces two primary obstacles: 1) the prohibitive cost and high reliance on high-quality labels for data annotation; and 2) the significant spectral and spatial variability of identical ground objects caused by differences in temporal phases, locations, and sensors. Visual Foundation Models (VFMs), with their potent generalization capabilities, offer a means to effectively bridge the domain gap. Inspired by it, a high-resolution remote sensing foundation model leveraging knowledge distillation and feature fusion HiD-FM is proposed. Specifically, HiD-FM undergoes self-supervised pre-training on a dataset of one million high-resolution unlabeled images. By synergizing knowledge distillation with feature fusion, it integrates the generalization power of pre-trained VFMs into a semi-supervised learning framework, thereby boosting performance on unlabeled data and enhancing fine-grained feature representation. Extensive experiments on semantic segmentation tasks demonstrate that HiD-FM consistently outperforms some RSFMs (such as RVSA, SMLFR and CMID), particularly in data-scarce scenarios. On the LoveDA and GID-15 datasets, our method surpasses both specialized models and existing foundation models across various labeling ratios. Notably, using only 30% of the training data, HiD-FM achieved OA of 83.19% on the GID-15 dataset. Furthermore, transfer learning experiments on GF-2 imagery across diverse spatiotemporal contexts yielded superior visualization results. HiD-FM enables rapid and cost-effective adaptation to target domains, thereby significantly advancing the field of remote sensing interpretation.

How to cite: Xu, G., Shen, H., Li, X., Xu, M., Lin, D., and Jiang, T.: HiD-FM: A High-Resolution Remote Sensing Foundation Model with Knowledge Distillation and Feature Fusion for Image Semantic Segmentation, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-11394, https://doi.org/10.5194/egusphere-egu26-11394, 2026.

X4.52
|
EGU26-12871
|
ECS
Zhenshi Li, Xueliang Zhang, Pengfeng Xiao, and XiaoXiang Zhu

Vision-language foundation models (VLFMs), such as CLIP, have demonstrated remarkable generalizability across diverse downstream tasks, including both cross-modal and vision-centric tasks. Leveraging large-scale textual supervision, VLFMs capture a broad spectrum of visual concepts and achieve breakthrough performance in zero-shot image understanding. However, current remote sensing (RS)-specific VLFMs, while performing well on image-level tasks, exhibit limited capability in fine-grained tasks such as open-vocabulary semantic segmentation (OVSS). This limitation stems from their adherence to the CLIP training paradigm, which aligns image and text features only at the global level, thereby degrading performance in tasks requiring high-quality visual representations at local level. Moreover, existing VLFMs that incorporate fine-grained alignment mechanisms still exhibit limited performance on remote sensing tasks, whether through direct transfer to RS scenarios or fine-tuning on RS image-caption datasets. This further underscores the need for developing RS-tailored fine-grained VLFMs.

To address this, we construct the first multi-granularity RS image-text dataset, MGRS-200k (Figure 1). MGRS-200k contains approximately 200k RS images, each annotated with both short and long global captions, as well as multiple object-level bounding boxes with corresponding categories, totaling over one million instances. We further investigate existing fine-grained VLFM training methods and find that their explicit region-text alignment strategies often disrupt semantic coherence, as their underlying assumptions do not hold in RS scenarios,  and thus degrade fine-grained understanding.

Building on these, we propose FarSLIP, a Fine-grained Aligned RS Language-Image Pretraining framework (Figure 2). FarSLIP first employs patch-to-patch self-distillation to align local and global visual cues, enhancing feature discriminability while preserving semantic coherence. It then applies CLS token-based region-category alignment using the MGRS-200k dataset to further improve spatial awareness. FarSLIP achieves state-of-the-art performance in zero-shot RS image understanding, excelling not only on image-level tasks such as scene classification and image-text retrieval, but more importantly on fine-grained tasks like OVSS. Additionally, it serves as a strong foundation for multimodal large language models (MLLMs) in RS image comprehension.

Figure 1.  Examples of our proposed MGRS-200k dataset. 

Figure 2. Overall rchitecture of FarSLIP. The model is trained in a two-stage manner. In Stage I, FarSLIP is optimized with image-caption alignment and patch-to-patch self-distillation. In Stage II, image-caption alignment and region-category alignment are jointly employed on the MGRS-200k dataset.

How to cite: Li, Z., Zhang, X., Xiao, P., and Zhu, X.: FarSLIP: A Vision-Language Foundation Model for Fine-Grained Remote Sensing Understanding, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12871, https://doi.org/10.5194/egusphere-egu26-12871, 2026.

X4.53
|
EGU26-18737
|
ECS
Eva Gmelich Meijling, Valerio Marsocci, Frederick Schindlegger, Kenzo Bounegta, and Nicolas Longepe

This study presents a comparative analysis of two diverse Geospatial Foundation Models (GFMs) developed by consortia under the European Space Agency (ESA): THOR and TerraMind. THOR introduces a compute-adaptive architecture designed to handle heterogeneous sensors and variable patch sizes. This enables flexible compute–accuracy trade-offs and high performance in limited training data regimes. It is also the first GFM to extensively include Sentinel-1, -2, and -3 data. TerraMind, in contrast, is a multimodal GFM with both discriminative and generative capabilities, pretrained with a dual‑scale scheme that fuses token‑level context and pixel‑level detail, enabling any‑to‑any cross‑modal generation and Thinking‑in‑Modalities (TiM) to infer missing modalities during fine‑tuning and inference. The cross-comparison, aimed to understand the level of maturity of European technologies in AI4EO, covers a collection of Earth Observation use cases provided by the two consortia, encompassing several tasks (segmentation, change detection, and classification), across diverse and overlooked domains, including climate disaster analysis, methane leak detection, forest biomass monitoring, and sea ice mapping. To ensure consistent preprocessing and evaluation of the two models and use cases, we benchmarked them in two very widespread and acknowledged framework: PANGAEA and TerraTorch. The analysis focuses on task coverage, architectural capabilities, and performance metrics, highlighting differences in adaptability, modality integration, and downstream application effectiveness. Results provide insights into the strengths and limitations of current GFMs for various scenarios, making it possible to grasp insights on different GFMs approaches, not limited to THOR and TerraMind.

How to cite: Gmelich Meijling, E., Marsocci, V., Schindlegger, F., Bounegta, K., and Longepe, N.: TerraMind vs. THOR: A comparative analysis of ESA’s Geospatial Foundation Models, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-18737, https://doi.org/10.5194/egusphere-egu26-18737, 2026.

X4.54
|
EGU26-19581
|
ECS
Anna Luise von Blohn, Miguel Mahecha, and Julia Peters

Machine learning models for Earth system prediction tasks differ substantially in their training strategy and the type of context they encode. Task-specific models are trained from scratch using a limited set of variables assumed to directly influence the prediction target. These models lack broad spatial and cross-variable context. In contrast, Earth system foundation models are pre-trained on large and heterogeneous data sets and are expected to capture richer environmental context that can be transferred to downstream prediction tasks.

In other machine learning domains, such as natural language processing, fine-tuning pre-trained foundation models has become standard practice due to consistent performance gains over models trained from scratch. Whether similar benefits arise for Earth system time-series prediction tasks remains unclear.

To address this gap, we compare task-specific transformer encoder models operating on pixel-level time series with fine-tuned Earth system foundation models across a set of time-series prediction tasks describing vegetation response to environmental change, including Gross Primary Productivity. This comparison isolates the effect of pre-training on predictive performance by keeping the prediction targets fixed. 

Our aim is to determine which modelling approach yields higher predictive accuracy for environmental time-series analyses.

How to cite: von Blohn, A. L., Mahecha, M., and Peters, J.: Foundation versus Task-Specific Models for Environmental Time-Series Prediction, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19581, https://doi.org/10.5194/egusphere-egu26-19581, 2026.

Embeddings, retrieval & similarity search
X4.55
|
EGU26-11672
|
ECS
Yijie Zheng, Weijie Wu, Bingyue Wu, Guoqing Li, Mikolaj Czerkawski, and Konstantin Klemmer

Recent advancements in Earth embeddings have opened up new frontiers for geosciences, enabling efficient analysis of vast volumes of geospatial data. However, the practical utilization of these embeddings is often hindered by complex software environments and the requirement for specialized computational expertise. To help democratize access to Earth embeddings, we introduce EarthEmbeddingExplorer, an open-source, web-based application designed to enhance the accessibility, understanding and interactivity of Earth embeddings for the broader geoscience community. 

EarthEmbeddingExplorer integrates multiple state-of-the-art foundation models, including SatCLIP, FarSLIP, and SigLIP, to support cross-modal retrieval of Sentinel-2 imagery via text, image, and geographic location queries. Our implementation leverages the MajorTOM Core-S2L2A dataset as the primary data source; we pre-computed approximately 250,000 embeddings per model based on a uniform spatial sampling of the MajorTOM grid. This approach ensures a representative global coverage of 1.2% of the Earth's land surface. To ensure accessibility, all models and datasets are hosted on open-source frameworks, specifically ModelScope and Hugging Face. The application provides an intuitive interface for visualizing the geographical distribution of the retrieved results, rendering top-match thumbnails, and exporting comprehensive metadata. Such transparent and low-cost access to large-scale embedding analysis is essential for identifying model-specific advantages and limitations. By enabling instant cross-model comparisons within specific spatiotemporal contexts, EarthEmbeddingExplorer allows users to evaluate model performance for their unique monitoring needs and domains of interest.

Ongoing development focuses on expanding EarthEmbeddingExplorer’s capabilities by integrating additional embedding models such as DINOv2, and increasing global spatial coverage. We are further implementing FAISS-based vector similarity search to enable near-instantaneous queries across tens of millions of global embeddings. Future iterations will prioritize modular software architecture, standardized APIs, and detailed documentation to facilitate community-driven contributions of new embedding models and datasets. The web applications are accessible at https://huggingface.co/spaces/ML4Sustain/EarthExplorer and at https://www.modelscope.cn/studios/VoyagerX/EarthExplorer.

How to cite: Zheng, Y., Wu, W., Wu, B., Li, G., Czerkawski, M., and Klemmer, K.: Sentinel-2 Image Retrieval with Global, Cross-modal Embeddings, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-11672, https://doi.org/10.5194/egusphere-egu26-11672, 2026.

X4.56
|
EGU26-14376
|
ECS
Bartosz Augustyn, Marcin Kluczek, Jedrzej Bojanowski, and Mikolaj Czerkawski

Foundation Models enable rich semantic representations of Earth Observation data by using embeddings generated from large, heterogeneous, and often unlabeled datasets. One of their most impactful applications is semantic similarity search, which allows EO data discovery based on context and meaning rather than metadata alone.  

This work presents global EO embedding datasets deployed within the Copernicus Data Space Ecosystem (CDSE), enabling large-scale semantic and similarity search across satellite imagery. The embeddings are generated using multimodal Foundation Models that map EO imagery and textual queries into a shared space, allowing natural language to retrieve semantically related observations. This approach supports the discovery of complex geospatial patterns such as land cover types, human activities, or environmental phenomena without explicit labeling.  

To ensure global consistency and scalability, the embedding generation and indexing are supported by the Major TOM standard, which provides a unified geospatial reference framework based on a global grid of points. Major TOM enables sampling across EO missions avoiding destructive preprocessing thus preserving raw, undistorted pixel values.  

Efficient similarity search over tens of millions of high-dimensional embeddings is achieved through FAISS vector indexing techniques, enabling immediate query results for global scale datasets. Foundation Model embeddings, combined with standardized geospatial indexing and high-performance vector search, form a practical and scalable foundation for next-generation EO data discovery. 

 

How to cite: Augustyn, B., Kluczek, M., Bojanowski, J., and Czerkawski, M.: Similarity Search of Earth Observation Data Using Foundation Model Embeddings, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-14376, https://doi.org/10.5194/egusphere-egu26-14376, 2026.

Applications, domain shift & label efficiency (operational EO)
X4.57
|
EGU26-548
|
ECS
Noopur Srivastava and Kamal Jain

Semantic Change Detection (SCD) plays a crucial role in understanding land surface dynamics, from urban expansion and deforestation to disaster impact assessment. However, despite the success of deep learning models in SCD tasks, their real-world deployment faces two critical challenges: significant domain shifts between geographically distinct regions and the prohibitive cost of data annotation for new locations. Models trained on public benchmarks, predominantly from developed countries, experience substantial performance degradation when applied to regions with different characteristics, such as Indian cities, due to inter-domain variance in sensors, atmospheric conditions, and landscapes. Additionally, substantial intra-domain variance within target regions compounds this problem, necessitating robust solutions that operate with limited labels.

To address these challenges, we propose SSLCD-Adapt, a novel hierarchical framework for label-efficient, cross-domain SCD that tackles inter-domain, intra-domain, and label constraints through a three-stage process. First, we employ Change-Enhanced Self-Supervised Pre-Training, where change representations are learned directly from unlabeled bi-temporal image pairs from the source domain using the FSC-180K benchmark dataset. Through the application of a Barlow Twins objective to fuse features from distorted views, the model learns invariant characteristics of change without manual annotation, providing superior initialization compared to ImageNet pre-trained models, which differ significantly from remote sensing imagery.

Second, Domain Alignment bridges the data distribution gap between the source (FSC-180K) and the target (six Indian cities) domains. The source encoder remains frozen while the target encoder undergoes training in only three layers within an adversarial setup. We employ a Domain-Adversarial Neural Network (DANN) that incorporates a Gradient Reversal Layer (GRL) with a Maximum Mean Discrepancy (MMD) loss to align feature distributions without requiring target labels. The domain classifier maximizes the H-divergence between the source and target domains, while GRL reverses the gradients, forcing target encoders to generate features similar to those of the source encoders, thereby achieving alignment in feature space and minimizing inter-domain variance.

Third, the trained target encoder undergoes Progressive Domain-specific fine-tuning using limited target labeled data. The encoder trains for one-third of the epochs on target data, then for two-thirds of the epochs using city-specific batches with domain-specific batch normalization for each city, effectively minimizing intra-domain variance between the six Indian cities. Figure 1 demonstrates the complete SSLCD-Adapt architecture.

Figure 1: Proposed SSLCD-Adapt Architecture

How to cite: Srivastava, N. and Jain, K.: SSLCD-ADAPT: A hierarchical framework for label-efficient cross-domain semantic change detection in complex environments, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-548, https://doi.org/10.5194/egusphere-egu26-548, 2026.

X4.58
|
EGU26-7472
|
ECS
Harald Kristen, Daniel Kulmer, and Manuela Hirschmugl

Effective protected area management requires frequent habitat monitoring to respond to rapid climate change and disturbances, yet traditional manual mapping methods cannot provide the temporal resolution needed for evidence-based policy decisions. We present a practical implementation of AI-driven change detection developed in collaboration with Gesaeuse National Park administration, Austria, to support operational habitat monitoring and management planning.

We address critical challenges in deploying AI technologies for complex environmental contexts: fuzzy class boundaries in natural habitats, highly imbalanced classes, and limited training data typical of protected areas. Using 15 years of high-resolution multimodal data (RGB, NIR, LiDAR, terrain attributes) covering 4,480 documented habitat changes across 15.3 km², we compare emerging geospatial foundation models (Clay v1.0, Prithvi-EO-2.0) against established U-Net architectures to identify the most robust approach for real-world application.

Results demonstrate that foundation models show superior cross-temporal robustness (Clay: 33% accuracy vs U-Net: 23% on unseen temporal data), a critical factor for operational monitoring systems. Integrating LiDAR improves detection accuracy from 30% to 50%. While overall accuracies are lower than in homogeneous agricultural landscapes, they reflect realistic performance for complex alpine environments and provide actionable information for park management.

To further enhance practical applicability for environmental agencies, we integrate object-based post-processing and physical constraints to filter misclassifications, making outputs directly usable for management decisions. This case study demonstrates practical strategies for implementing AI technologies in complex environmental monitoring contexts where traditional approaches face significant challenges. Building upon this work from the Habitalp 2.0 project, the BioDivAI project will extend these habitat mapping approaches to predict biodiversity impacts under various land use and land cover change scenarios, providing decision-makers with tools to assess trade-offs between economic activities and ecosystem protection.

How to cite: Kristen, H., Kulmer, D., and Hirschmugl, M.: Habitat and Land Cover Change Detection in Alpine Protected Areas: A Comparison of AI Architectures, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7472, https://doi.org/10.5194/egusphere-egu26-7472, 2026.

X4.59
|
EGU26-9080
|
ECS
Yunci Xu and Lizhen Lu

Accurate and scalable mapping of mariculture facilities is essential for coastal resource management, environmental monitoring, and sustainable aquaculture development. However, existing remote sensing–based segmentation approaches heavily rely on large amounts of annotated data or manual interaction, limiting their scalability and generalization. Recently, foundation models such as the Segment Anything Model (SAM) have demonstrated strong generalization ability across diverse visual domains. Nevertheless, SAM’s performance in remote sensing applications remains constrained by its reliance on manually selected prompts, which is impractical for large-scale or automated mapping tasks.

In this study, we propose an AutoPrompt-enhanced SAM framework (AutoPrompt-SAM) for the automated segmentation of mariculture facilities, specifically floating rafts and net cages, from high-resolution PlanetScope imagery. The proposed framework eliminates the need for human-provided prompts by introducing an AutoPrompt module that automatically generates high-quality point prompts for SAM, enabling prompt-free semantic segmentation in a fully automated manner.

As a foundation for this work, we construct a large-scale, high-quality mariculture facility segmentation dataset consisting of more than 1,000 manually annotated PlanetScope image patches with a spatial resolution of 3 m. Each sample is cropped to 256 × 256 pixels and includes pixel-level labels for floating rafts, net cages, and background. To the best of our knowledge, this dataset represents one of the first publicly usable high-resolution semantic segmentation benchmarks for mariculture facilities based on PlanetScope imagery.

The proposed AutoPrompt module learns to generate representative prompt points directly from image features, without requiring any human interaction during inference. These automatically generated prompts are then fed into SAM to produce segmentation masks. By leveraging SAM’s powerful pre-trained visual representations, our method effectively combines the generalization capability of foundation models with task-specific structural cues learned by the AutoPrompt module. Experimental results demonstrate that AutoPrompt-SAM achieves competitive performance compared with manually prompted SAM, while completely removing the need for human intervention.

Beyond mariculture mapping, we further investigate the transferability of the proposed framework. Without additional labeled data, AutoPrompt-SAM shows strong generalization performance when applied to other remote sensing segmentation scenarios, indicating that the learned prompt generation strategy captures transferable spatial and structural patterns. This highlights the potential of AutoPrompt-SAM as a label-efficient and domain-adaptive segmentation framework, capable of extending SAM to broader remote sensing applications.

Overall, this work makes three key contributions: (1) the construction of a large-scale, high-resolution PlanetScope mariculture facility segmentation dataset; (2) the proposal of an AutoPrompt-driven SAM framework that enables fully automated, prompt-free semantic segmentation while effectively exploiting SAM’s pre-trained knowledge; and (3) a demonstration of the framework’s strong transferability, offering a new pathway for reducing human intervention and annotation dependency in remote sensing segmentation tasks. The proposed approach provides a practical solution for adapting foundation models to large-scale Earth observation applications and paves the way toward more autonomous and scalable remote sensing analysis.

How to cite: Xu, Y. and Lu, L.: Towards Prompt-Free Segmentation of Mariculture Facilities Using an AutoPrompt-Enhanced Segment Anything Model, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9080, https://doi.org/10.5194/egusphere-egu26-9080, 2026.

X4.60
|
EGU26-11247
|
ECS
Damien Robert and Jan Dirk Wegner

The growing availability of Earth Observation (EO) data enables monitoring of terrestrial ecosystems at unprecedented spatio-temporal resolutions. In practice, however, effective use of EO data remains constrained by substantial technical barriers. Working with raw, multi-modal EO imagery requires specialised domain expertise, large data transfers, access to high-performance computing infrastructure, and advanced machine learning (ML) skills. These requirements limit the accessibility of EO-based analytics for many downstream applications.

Geospatial foundation models (GFMs) provide a promising alternative by learning general-purpose representations from large volumes of unlabeled EO data. By decoupling representation learning from downstream task modelling, GFMs allow users to exploit expressive features from modern deep learning models with limited EO or deep learning expertise and modest computational resources.

In this work, we investigate embedding-based GFM workflows for forest disturbance monitoring, where timely inference and regional customisation are often more critical than maximising absolute predictive accuracy. Forest disturbances such as logging, windthrows, fires, pests, and diseases can occur abruptly and require rapid detection to support conservation, policy-making, and risk-management efforts.

Machine learning methods for forest disturbance detection from EO data are well established and have shown strong performance on regional benchmarks. However, much of this work remains confined to academic demonstrations and is rarely translated into operational monitoring systems. Existing forest monitoring tools, including those aggregated by Global Forest Watch, typically rely on region- and sensor-specific models with limited feature expressivity. These systems may benefit from the rich multi-modal and spatio-temporal representations learned by GFMs, provided such embeddings can be accessed through scalable and practical deployment pipelines.

We build on a pipeline designed to deliver on-demand, location- and time-specific geospatial embeddings as a service. Embeddings are generated server-side from raw EO data, compressed, and distributed as lightweight representations. End users interact only with these embeddings, which can be analysed using simple models such as linear probes or small decoders. This approach removes the need for the user to manipulate raw EO data, download large multi-modal datasets, or train and deploy large deep learning models, enabling rapid adaptation to local contexts with limited annotations and modest computational resources.

We present preliminary results demonstrating the feasibility of this approach for forest disturbance detection and discuss its strengths and limitations relative to bespoke, fully supervised image-based models. While GFMs may not be optimal for applications with abundant annotations and stringent accuracy requirements, embedding-based services are particularly well suited to time-sensitive and regionally adaptive monitoring scenarios. Overall, this work illustrates how releasing geospatial embeddings as a product or service can lower barriers to EO-based forest monitoring and support faster, more inclusive environmental decision-making.

How to cite: Robert, D. and Wegner, J. D.: Forest Disturbance Monitoring with Geospatial Foundation Models, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-11247, https://doi.org/10.5194/egusphere-egu26-11247, 2026.

X4.61
|
EGU26-13546
|
ECS
Sarah Brood, Iris Dumeur, Jérémy Anger, Aurélien de Truchis, Ewan Sean, Ibrahim Fayad, Alexandre d'Aspremont, and Philippe Ciais


In a context of rapid environmental change, delivering robust tree species mapping is essential.  It enables better quantification of forest biomass, facilitates climate change adaptation through better forest management and supports biodiversity preservation. However, the scarce existing ground-truth datasets suffer from geographic sparsity, semantic inconsistencies and class imbalance, making current methods overfit to context and unsuitable for accurate large-scale tree species mapping. Therefore, it is imperative to design methods that learn spatially invariant representations for tree species mapping.

The surge of Earth Observation missions has unlocked vast amounts of Satellite Image Time Series (SITS) which capture phenology and spectral dynamics that are an asset for tree species classification. Leveraging this data, an increasing number of Foundation Models (FM) pre-trained using Self-Supervised Learning (SSL) have been introduced.  Yet, due to the prevalence of patch-level annotations in tree species datasets,  FMs are primarily evaluated on classification tasks instead of segmentation, preventing the production of pixel-level maps. Furthermore, spatial generalization remains largely unexplored, partially explained by the geographic sparsity of the labels. As a result, current models often overfit to local context: they perform well on training areas but fail to generalize to new spatial domains.  Therefore this work focuses on rigorous spatial generalization evaluation and the development of methods to produce large-scale pixel-level tree species maps overcoming current spatial domain shifts. 

To quantify this generalization gap, we propose a spatial zero-shot domain adaptation evaluation protocol, where frozen FMs are linearly probed through a segmentation task on a geographical region and tested on geographically distinct, unseen regions. We aligned 3 datasets in Europe (TreeSatAI, PureForest and a regional dataset covering Poland) into 6 classes to benchmark state-of-the-art FMs (AnySat, ALISE, Presto) pre-trained on SITS and introduce a new architecture addressing current limitations.
We propose a SSL framework based on the TimeSFormer backbone. It captures complex spatio-temporal dynamics using divided space and time attention. The model is pre-trained as a Masked Auto-Encoder on a European-scale unlabeled Sentinel-2 dataset to learn robust phenological features. To mitigate the observed spatial generalization gap, we investigate different strategies such as auxiliary conditioning and thermal temporal positional encoding. 

Our evaluation protocol reveals a significant accuracy drop of state-of-the-art models when applied to unseen regions. This decline suggests that current FMs capture geographically-dependent features rather than intrinsic tree species characteristics, resulting in a spatial generalization gap. 
Experiments confirm that the proposed architecture learns semantically rich features, evidenced by its high capacity to reconstruct missing time steps of satellite time series.  

By quantifying the spatial domain shift, proposing a resilient SSL architecture and applying domain adaptation strategies this work addresses the important challenge of generalization in label-scarce regimes. It supports high-resolution forest monitoring, a prerequisite for precise carbon accounting and forest biodiversity conservation.

How to cite: Brood, S., Dumeur, I., Anger, J., de Truchis, A., Sean, E., Fayad, I., d'Aspremont, A., and Ciais, P.: Addressing Geographical Domain Shift in Tree Species Mapping via Foundation Models using Satellite Image Time Series, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-13546, https://doi.org/10.5194/egusphere-egu26-13546, 2026.

X4.62
|
EGU26-19446
|
ECS
Rim Sleimi, Joao Vinholi, Florian Werner, and Albert Abelló

Field boundary delineation (FBD) is a foundational task in Earth Observation (EO), supporting a wide range of agricultural and environmental applications. Accurate, parcel-level boundaries enable field-level reporting, water productivity monitoring, and scalable, decision-support systems. However, extracting reliable field geometries from medium-resolution satellite imagery remains challenging, particularly at 10 m resolution where boundaries are thin, low-contrast, and often visually ambiguous. Adjacent parcels can appear similar; supervision data is frequently sparse or inconsistent across regions, and agricultural practices vary widely introducing domain shifts that undermine generalizability. These factors make naïve “extent-only” approaches prone to merging neighboring fields, while “boundary-only” methods often fail to produce closed, stable instances when separators are weak or missing. Geospatial foundation models (FMs), pre-trained on large, multi-modal satellite archives, offer a promising solution by enabling transferable visual representations for EO tasks with limited supervision. Yet their application to geometry-sensitive tasks like FBD remains, to the best of our knowledge, unexplored. 

This work presents a boundary-centric field delineation pipeline that demonstrates one of the first operational deployments of geospatial FMs for parcel mapping using Sentinel-2 imagery. At its core, the model leverages TerraMind, a modality-aware, self-supervised EO foundation model, as the feature encoder. This FM backbone enables the system to learn transferable, generic spatial representations from large-scale EO data. To enhance generalization across regions and seasons, the encoder is explicitly conditioned in both time and space. Temporal context is provided through a Day-of-Year (DOY) sinusoidal embedding, capturing phenological variability and seasonal appearance shifts across acquisitions. Spatial context is introduced via SatCLIP-based coordinate embeddings, which transform geographic patch-center coordinates into rich, location-aware priors using a frozen SatCLIP backbone and lightweight projection.  

Built atop the TerraMind feature hierarchy is a Fractal ResUNet-style decoder that reconstructs fine boundary details while preserving global parcel topology. Operating over a multi-scale pyramidal representation, the decoder reshapes latent token embeddings into spatial maps and progressively upsamples them through skip-connected blocks. This design effectively balances fine-grained localization and broad contextual reasoning—essential at 10m resolution where boundaries are thin and adjacent parcels are visually similar. The model produces three interrelated outputs through a coupled multi-task formulation: a probability map for field extent, a boundary likelihood map capturing separator ridges, and a continuous distance-to-boundary field that encodes interiorness. These outputs are supervised jointly, encouraging geometric coherence across predictions.  

To quantify performance, we evaluate delineation quality on a multi-country European validation set built from parcel-level labels RapidCrops. Across countries, the model reaches boundary and extent IoU in the ~0.75–0.91 range, with higher scores in landscapes dominated by larger, well-separated parcels and lower scores in regions characterized by small fields, weak visual separators, or incomplete ground truth. This variability highlights both the scalability enabled by FM features and the remaining performance ceiling imposed by 10 m resolution and label quality. 

How to cite: Sleimi, R., Vinholi, J., Werner, F., and Abelló, A.: A Geometry-Aware Multi-Task Framework for Parcel Delineation with Geospatial Foundation Models , EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19446, https://doi.org/10.5194/egusphere-egu26-19446, 2026.

Posters virtual: Wed, 6 May, 14:00–18:00 | vPoster spot 1b

The posters scheduled for virtual presentation are given in a hybrid format for on-site presentation, followed by virtual discussion on Zoom. Attendees are asked to meet the authors during the scheduled presentation & discussion time for live video chats; onsite attendees are invited to visit the virtual poster sessions at the vPoster spots (equal to PICO spots). If authors uploaded their presentation files, these files are also linked from the abstracts below. The button to access the Zoom meeting appears 15 minutes before the time block starts.
Discussion time: Wed, 6 May, 16:15–18:00
Display time: Wed, 6 May, 14:00–18:00
Chairperson: Andrea Barone

EGU26-3619 | ECS | Posters virtual | VPS22

Democratizing landslide detection for vulnerable regions beyond resource-intensive foundation models 

Rodrigo Uribe-Ventura, Willem Viveen, Ferdinand Pineda-Ancco, and César Beltrán-Castañon
Wed, 06 May, 14:00–14:03 (CEST)   vPoster spot 1b

Landslides claim thousands of lives and cause billions in economic losses annually, with impacts disproportionately concentrated in developing regions across Asia, Africa, and Latin America. Paradoxically, the current trajectory of artificial intelligence in geohazard detection—characterized by billion-parameter foundation models requiring substantial computational infrastructure—risks widening, rather than closing, the gap between technological capability and operational deployment where it is needed most. We argue that this paradigm requires fundamental reconsideration, proposing domain adaptation on strategically curated geological datasets as a more equitable and effective path toward globally accessible landslide detection systems.

Foundation models like the Segment Anything Model (SAM), pre-trained on over one billion masks, demand computational resources—312 million parameters, 1,376 GFLOPs per inference, specialized GPU infrastructure—that remain inaccessible to disaster management agencies in resource-constrained regions. Beyond these practical constraints, we contend that the apparent generalization capabilities of such models reflect pattern coverage in training data rather than emergent understanding transferable to geological contexts. The SA-1B dataset, despite its scale, was not curated to systematically represent landslide morphological diversity, creating coverage gaps for rare failure types, unusual triggering mechanisms, and underrepresented terrain configurations precisely where robust detection is operationally critical.

Given these limitations, we propose that effective generalization for geological applications emerges not from architectural scale but from strategic coverage of domain-relevant pattern space. We developed and tested GeoNeXt, a lightweight architecture that exploits the hierarchical transferability of geological features through targeted domain adaptation. Low-level representations (edges, spectral gradients) transfer universally across sensors and terrain; mid-level patterns (drainage networks, slope morphology) require adaptation to local expressions; and high-level configurations (failure geometries, trigger signatures) demand targeted training. Our results showed that this approach outperformed SAM-based methods across three independent benchmarks while requiring 10× fewer parameters (32.2M versus 312.5M) and a 62% reduction in computational cost. Zero-shot transferability to geographically distinct test sites (74–78% F1 score) emerged from the training dataset's systematic morphological diversity rather than parameter count. Inference at 10.6 frames per second on standard hardware, versus 3.0 frames per second for foundation model alternatives, transforms theoretical capability into deployable technology for resource-constrained environments. These findings suggest that strategic domain adaptation, rather than architectural scale, offers the most viable path toward operational landslide detection in vulnerable regions.

How to cite: Uribe-Ventura, R., Viveen, W., Pineda-Ancco, F., and Beltrán-Castañon, C.: Democratizing landslide detection for vulnerable regions beyond resource-intensive foundation models, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-3619, https://doi.org/10.5194/egusphere-egu26-3619, 2026.

Login failed. Please check your login data. Lost login?