ESSI1.2 | Safe and Effective Use of Large Language Models in Scientific Research
Safe and Effective Use of Large Language Models in Scientific Research
Convener: Juan Bernabe Moreno | Co-conveners: Movina Moses, Rahul Ramachandran
Posters on site
| Attendance Tue, 05 May, 16:15–18:00 (CEST) | Display Tue, 05 May, 14:00–18:00
 
Hall X4
Tue, 16:15
Large language models (LLMs) and agentic workflows are rapidly transforming scientific research by enabling new capabilities in literature and data discovery, analysis, coding and insight generation. At the same time, their deployment requires rigorous attention to safety, reliability and trustworthiness in scientific contexts.

This session will highlight both the transformative applications and the critical challenges of using LLMs in science. Key topics include developing specialized guardrails against hallucination and bias; creating robust evaluation frameworks, including uncertainty quantification; ensuring scientific integrity, data governance and reproducibility; and addressing unique scientific risks.

We invite submissions on novel scientific applications of LLMs and agentic workflows, methods that ensure integrity and reproducibility, safety mechanisms (e.g., guardrails, risk mitigation, alignment), responsible AI frameworks (including human-in-the-loop design, fairness, and ethics) and lessons learned from real-world deployments. Our goal is to foster discussion on pathways toward safe, effective and trustworthy use of LLMs for advancing science.

Posters on site: Tue, 5 May, 16:15–18:00 | Hall X4

The posters scheduled for on-site presentation are only visible in the poster hall in Vienna. If authors uploaded their presentation files, these files are linked from the abstracts below.
Display time: Tue, 5 May, 14:00–18:00
Chairpersons: Juan Bernabe Moreno, Rahul Ramachandran, Movina Moses
X4.29
|
EGU26-22068
Rahul Ramachandran, Nidhi Jha, and Muthukumaran Ramasubramanian

We present Collaborative Agent Reasoning Engineering (CARE), a disciplined methodology for engineering Large Language Model (LLM) agents in scientific domains. Unlike ad-hoc trial-and-error approaches, CARE specifies behavior, grounding, tool orchestration, and verification through reusable artifacts and systematic, stage-gated phases. The methodology employs a three-party workflow involving Subject-Matter Experts (SMEs), developers, and LLM-based helper agents. These helper agents function as facilitation infrastructure, transforming informal domain intent into structured, reviewable specifications for human approval at defined gates. CARE addresses the "jagged technological frontier", characterized by uneven LLM performance, by bridging the gap between novice and expert analysts regarding domain constraints and verification practices. By generating concrete artifacts, including interaction requirements, reasoning policies, and evaluation criteria, CARE ensures agent behavior is specifiable, testable, and maintainable. Evaluation results from a scientific use case demonstrate that this stage-gated, artifact-driven methodology yields measurable improvements in development efficiency and complex-query performance.

How to cite: Ramachandran, R., Jha, N., and Ramasubramanian, M.: Collaborative Agent Reasoning Engineering (CARE): A Structured Methodology for Systematically Engineering AI Agents for Science, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-22068, https://doi.org/10.5194/egusphere-egu26-22068, 2026.

X4.30
|
EGU26-7020
Alexander Wolodkin, Claus Weiland, Jonas Grieb, and Robert Brylka

Senckenberg’s natural history collections encompass over 45 million physical specimens distributed across 11 facilities, with 1.6 million digitized records accessible in 124 collections. Additional digital objects are stored in various infrastructures, such as Edaphobase, a digital repository for harmonized soil information (physical, chemical, and biological), and the WildlIVE Portal, a platform for FAIR (Findable, Accessible, Interoperable, and Reusable) data sharing of biodiversity monitoring with edge sensors such as camera traps. Managing this heterogeneous landscape, ranging from legacy specimen data from the pre-digital era to newly digitized objects, presents significant challenges regarding legal compliance, data sovereignty, and the implementation of FAIR principles.

To address this volume, the FAIRenrich workflow automates the semantic annotation and maintenance of existing digital collection data. Handling the complexity of such enrichment requires AI models suited for partially non-deterministic tasks, incorporating an optional human-in-the-loop mechanism. By executing these workflows on a distributed network of stationary and mobile edge computing devices ('last-mile AI'), the architecture ensures strict adherence to data sovereignty and privacy requirements.

Beyond data curation, FAIRenrich's distributed architecture enables systemic efficiency through resource pooling informed by industry practice. Institutional edge infrastructure is rarely fully utilized; by networking heterogeneous devices, the system dynamically reallocates idle capacity to enrichment tasks. This mirrors industry practice: major technology operators, such as Google, systematically redeploy hardware from their refresh cycles into secondary-use programs.

FAIRenrich extends this model to legacy hardware: rather than treating end-of-life equipment as waste, the system enables cost-effective redeployment for delay-tolerant semantic enrichment tasks, such as inference workloads without strict latency requirements. By aligning workload scheduling to renewable peaks (e.g., photovoltaic installations), the approach implements carbon-aware scheduling principles used by major technology operators, achieving both infrastructure cost reduction and extended hardware lifecycles. This creates a circular-economy model for research institutions, transforming refresh-cycle surplus into productive scientific infrastructure.

This contribution demonstrates how FAIRenrich enables sustainable semantic annotation through a distributed edge architecture that simultaneously ensures data sovereignty, optimizes infrastructure utilization, and can realize cost-effective redeployment of legacy hardware. The approach exemplifies a scalable blueprint for research institutions seeking to decouple semantic enrichment from project-resource limitations through parallelization, temporal flexibility, and circular infrastructure practices.

How to cite: Wolodkin, A., Weiland, C., Grieb, J., and Brylka, R.: FAIRenrich: Distributed semantic annotation at the repository edge, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7020, https://doi.org/10.5194/egusphere-egu26-7020, 2026.

X4.31
|
EGU26-12877
|
ECS
Àlex R. Atrio, Antonio Lopez, Jino Rohit, Yassine Elhouadi, Marcello Politi, Vijayasri Iyer, Sébastien Bratières, Umar Jamil, and Nicolas Longépé

Recent advances in Large Language Models (LLMs) have created opportunities to support reasoning, discovery, and synthesis in Earth Observation (EO) and Earth Sciences, provided domain specificity and reliability can be ensured. In this work, we introduce Earth Virtual Expert (EVE), a comprehensive open-source initiative to develop, evaluate, and deploy a domain-specialized LLM for EO. EVE serves as a testbed for studying domain-adaptive training, grounded generation, and evaluation strategies tailored to scientific use, rather than general-purpose conversational performance.

As part of this initiative, we present EVE-instruct, a text-only, instruction-tuned and aligned LLM specialized for EO. Built on Mistral Small 3.2 (24B parameters) with a 128k context window, it focuses on domain-specific reasoning, question answering, and retrieval– and hallucination-aware generation, without significant tradeoff of general capabilities. We release all data used to train and evaluate EVE-instruct: a large-scale curated EO corpus of 3B tokens, synthetically generated fine-tuning datasets derived from this corpus (4B tokens), and manually-created EO-specific evaluation test sets comprising 7500 samples across multiple-choice and open-ended question answering, and factuality test sets.

To support trustworthy usage and deployment, we further develop a Retrieval-Augmented Generation (RAG) database from the curated corpus and a hallucination-detection module focused on factual consistency and scientific grounding. These components are integrated with EVE-instruct and deployed with a graphical user interface and accessible via API, currently supporting more than 300 users from the EO research and industry field.

All models, datasets, and code are publicly released at: https://huggingface.co/eve-esa and https://github.com/eve-esa.

How to cite: R. Atrio, À., Lopez, A., Rohit, J., Elhouadi, Y., Politi, M., Iyer, V., Bratières, S., Jamil, U., and Longépé, N.: EVE: An Open Source Earth Science LLM for Researchers, Policymakers, and the Public, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-12877, https://doi.org/10.5194/egusphere-egu26-12877, 2026.

X4.32
|
EGU26-15955
|
ECS
Ahmed Derdouri and Yoshifumi Masago

Large language models (LLMs) promise to make systematic reviews more scalable and less costly, but the validity of LLM-assisted evidence synthesis depends not only on accuracy, but also on which parts of the literature are effectively visible to a deployed model and how reliably they are interpreted. We report a large-scale, domain-specific evaluation of an end-to-end LLM-assisted workflow for a systematic review of national-scale land use/land cover (LULC) prediction research (11,817 records; 11,688 after de-duplication), using a single hosted LLM deployment (Qwen Max) as a concrete case study. At title–abstract screening, the model behaved as a recall-oriented filter, excluding 9,891/11,688 records (84.7%) and routing 1,797 records for human follow-up; compared with the human baseline, it excluded fewer studies (84.7% vs 91.8%) and shifted more records into OK and POSSIBLE (OK: 4.2% vs 1.5%; POSSIBLE: 10.6% vs 5.5%). For full-text extraction, structured fields showed high agreement with expert coding across 342 benchmark papers (mean scores: 0.84 categorical, 0.85 temporal, 0.87 set-based), whereas free-text summaries were more variable (mean 0.79 overall; cosine similarity 0.51–0.87 across narrative fields despite high BERT-F1). In our case study, the workflow was completed in approximately one day on a single workstation for ~US$106 in API costs. Critically, full-text processing also produced explicit refusals: 7/2,084 candidate papers in deep screening and 2/345 papers targeted for insight extraction were blocked as “sensitive” geopolitical content. Although rare, these refusals were non-random and concentrated in contested regions, illustrating how LLM-specific constraints can introduce structured missingness that systematically removes or misinterprets evidence in precisely those settings where land-use conflict and governance are most salient. LLM-assisted reviews can therefore make previously prohibitive syntheses tractable. However, they must be embedded in transparent, human-led workflows that monitor and log model failures including refusals, omissions, and misreadings, and apply targeted auditing to detect and correct systematic blind spots.

How to cite: Derdouri, A. and Masago, Y.: LLM Workflow for Land-Use Prediction Evidence Synthesis: Efficient Screening, Selective Refusals, Reportable Gaps, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-15955, https://doi.org/10.5194/egusphere-egu26-15955, 2026.

X4.33
|
EGU26-19791
|
ECS
Marco De Carlo, Maria Mirto, Italo Epicoco, Paola Nassisi, and Maria Vincenza Chiriacò

Large Language Models (LLMs) offer transformative capabilities for scientific workflows, enabling scalable analysis, evidence synthesis, and insight generation. We present an agentic LLM workflow, applied to the EU-funded SWITCH project, which investigates the environmental impact of dietary choices, including CO₂ emissions associated with food consumption.

Validated questionnaire responses from SWITCH participants are securely anonymized and processed using machine learning methods, including clustering and classification, interpreted using ExaplainableAI (XAI) to ensure transparency of feature contributions, to generate food behavioral profiles, including nutritional and environmental habits, while preserving individual privacy. These outputs guide the construction of structured agent directives, enriching contextual information and constraining the LLM to provide scientifically grounded answers and data-driven insights.

Responses are generated within a Retrieval-Augmented Generation (RAG) framework over a curated Data Lake of revisioned documents, including project deliverables, scientific reports, and nutrition-environment datasets covering sustainable diets, CO₂ emissions, and European food policy. The combination of ML-generated profiles and the RAG context acts as a set of constraints, ensuring that LLM outputs remain traceable, grounded, and aligned with verified evidence.

Human-in-the-loop review ensures the quality and correctness of the ML-generated profiles, the construction of agent directives, the LLM outputs, and the revisioned documents used in the RAG framework, while metadata and traceability mechanisms ensure auditability, reproducibility, and risk mitigation.

Our results demonstrate that combining classical machine learning, structured agent directives guided by clustering and classification, RAG grounding, metadata and traceability, and human oversight enables trustworthy, effective, and transformative scientific analysis, highlighting the potential of agentic LLMs for scalable, insight-driven applications in research while ensuring responsible AI deployment.

How to cite: De Carlo, M., Mirto, M., Epicoco, I., Nassisi, P., and Chiriacò, M. V.: Explaining dietary CO₂ impact with trustworthy agentic LLMs, ML, and XAI, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19791, https://doi.org/10.5194/egusphere-egu26-19791, 2026.

X4.34
|
EGU26-19885
|
Highlight
Daniel Wiesmann, Jonas Solvsteen, Adam Pain, Alyssa Barrett, Ciaran Sweet, Ricardo Mestre, Daniel da Silva, Firza Riany, Fausto Pérez, Lane Goodman, Marc Farra, Soumya Ranjan, Tarashish Mishra, Sanjay Bhangar, and Sajjad Anwar

In this talk we will outline our learnings from building two production grade agentic systems for data discovery, retrieval and analysis. For both applications trustworthiness, reliability and reproducibility were key criteria that we took into account from the start.

Producing insights and surfacing data in a repeatable and transparent way is not simply a nice-to-have feature, it is indispensable for adoption. In practice, even if the answer from a chatbot is scientifically correct, users will only rely on the outputs for decision making if they trust the system. Users by now have enough experience with chatbots to know that LLMs tend to exaggerate and hallucinate. This has created a healthy skepticism with regard to output from agentic systems. In scientific domains, it is therefore not sufficient to guarantee a correct result, it also has to be presented in a way that is transparent and reproducible. Despite all advances in the LLM domain, this continues to be a challenge in building agentic systems.

We tackled this challenge by adopting a series of techniques that we will illustrate with concrete examples from building production ready agentic applications. One of the main principles is that we rely on the LLM mainly for orchestration of well known tools instead of relying on the generative capabilities of the models. For analysis we built the systems in a way that the transformations on the original data are reproducible. One technique is to use LLMs to write code for analyis that can be stored and used to reproduce the results. This allows end-to-end tracing of where the data is coming from and how it was transformed to produce insights through statistics and charts. We will also mention our approach to evaluation of the agent and share insights from early user research already performed for these systems.

We will illustrate these principles with concrete examples from the two agentic systems outlined below. 

The Destination Earth Digital Assistant, built in collaboration with ECMWF, provides general information about Destination Earth and helps users to discover and retrieve data from the DestinE Digital Twins.

Global Nature Watch, built in collaboration with WRI and the Land and Carbon Lab, provides governments, companies, and communities with trusted, open data and intelligence-driven insights on land conditions and land-use change to enable efficient and evidence-based decisions for nature protection and restoration.

GNW is publicly accessible today and the DestinE Assistant is also planned to be launched publicly before EGU26.

How to cite: Wiesmann, D., Solvsteen, J., Pain, A., Barrett, A., Sweet, C., Mestre, R., da Silva, D., Riany, F., Pérez, F., Goodman, L., Farra, M., Ranjan, S., Mishra, T., Bhangar, S., and Anwar, S.: Making Scientific Data Accessible with LLMs while Preserving Authority and Reliability: Lessons Learned from Building Production Grade Agentic Systems, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19885, https://doi.org/10.5194/egusphere-egu26-19885, 2026.

X4.35
|
EGU26-20948
|
ECS
Chen Wang, Gerald Corzo, Clarine Van Oel, and Chris Zevenbergen

The exponential expansion of academic literature across complex environmental domains has created a gap where the volume of research outpaces human capacity for effective integration. While Large Language Models (LLMs) offer a transformative solution to bridge this gap, their deployment in rigorous scientific inquiry is frequently compromised by model stochasticity, potential for hallucination, and the opacity of automated reasoning. Addressing the critical imperative for dependable and reproducible AI, this study presents a robust workflow designed to ensure methodological rigor and evidential integrity in the rapid and reliable synthesis of large-scale scientific literature.

We operationalised this framework within the domain of urban water management, specifically to analyse complex Knowledge Transfer (KT) strategies from a corpus of over 1,500 unstructured articles. To mitigate the risks inherent in generative AI, we developed a multi-layered validation protocol. First, we deployed an AI-assisted screening mechanism to filter the initial corpus down to 115 highly relevant articles, ensuring data relevance. Second, we implemented a Human-in-the-Loop design to iteratively synthesise a comprehensive analytical framework. By refining LLM-generated insights against domain expertise, we consolidated 24 operational attributes that specifically characterise the operational mechanisms of learning strategies from the corpus, preventing ungrounded inference while capturing emerging learning dynamics. Third, we addressed model variability through iterative Multi-LLM Triangulation (utilising Gemini, ChatGPT, and Deepseek). By repeatedly coding the 115 articles with the framework, we quantified qualitative insights to analyse how distinct learning strategies manifest their operational mechanisms. Finally, we employed Multiple Correspondence Analysis (MCA) and Hierarchical Clustering (HAC) to analyse the quantified results, categorising the eight identified learning strategies into three distinct clusters based on their functions and usage contexts, thereby effectively harnessing the LLM-generated insights.

Beyond this specific application, this research contributes a methodological blueprint for responsible AI integration in scientific inquiry. It demonstrates that combining theory-driven constraints with statistical verification is essential to elevate LLM-generated insights to the standard of reproducible scientific evidence.

How to cite: Wang, C., Corzo, G., Van Oel, C., and Zevenbergen, C.: Toward trustworthy AI in systematic reviews: a statistically validated AI-augmented framework for analysing knowledge transfer strategies in urban water management, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-20948, https://doi.org/10.5194/egusphere-egu26-20948, 2026.

X4.36
|
EGU26-20967
Zahra Fardhosseini, Andrea Ackermann, and Beate Oerder

Problem:

The elicitation of user requirements represents a critical first step in successful planning software projects, particularly when a diverse set of use cases must be considered. In practice, this process is often carried out through oral interviews and handwritten notes, which is time-consuming, error-prone, and makes structured processing and subsequent analysis difficult.

Approach:

We present an integrated, automated pipeline framework that supports the systematic collection and analysis of user requirements, from capturing user perspectives, to model-supported analysis. The goal is to gather requirements consistently across different stakeholder roles, including project lead, technical staff, and data scientists. In the first step, requirements are collected using a web-based, structured questionnaire. In the second step, the questionnaire serves as a guideline in follow-up interviews for detailed case description to further refine the requirements.

Figure 1. the pipeline for requirements elicitation, LLM-based analysis, and human-in-the-loop review.

The interviews are recorded and automatically transcribed using a domain-adapted Language Recognition Component (LRC) based on open-source Automatic Speech Recognition (ASR) models. The resulting transcripts are combined with questionnaire responses and initial analysis artifacts, such as charts and diagrams, and processed within a Large Language Model (LLM) pipeline. After requirements have been collected, the pipeline supports the systematic inspection of individual requirements and their consideration in project planning.

Using a dedicated prompt schema, the LLM-based analysis supports the identification of functional and non-functional requirements, highlights open needs, clusters related issues, and organizes the results according to relevant work contexts. A human-in-the-loop review module enables targeted corrections, quality assurance, and iterative improvement of the analysis results.

 

Implementation test:

To validate our end‑to‑end requirements‑engineering pipeline, we applied it to the IACS‑AI Data‑Management‑Remodeling project. A web‑based survey (Nov–Dec 2024) yielded 53 responses, giving an initial, structured view of user requirements. Subsequently, we held seven interviews with 18 participants (project managers, engineers, data‑scientists), producing > 460 min of video afterward transcribed with the freeware tool Scraibe. 

Prompt‑engineering routine fed these inputs to a Large Language Model (Llama 3.3), which detected semantic clusters, classified requirements, and identified problem statements. For each requirement, we kept the highest‑probability class for further review.  The resulting insight shaped the next milestone: the design and implementation of a data‑pipeline architecture that fulfills the extracted functional and non‑functional requirements. 

Conclusion:

The reproducible design of the pipeline ensures traceability by documenting when, by whom, and in which context requirements were expressed, as well as how project decisions are derived from them. This results in a lightweight yet structured approach to requirements elicitation that improves transparency as well as reproducibility and reduces manual effort and errors.

Because the pipeline is generic, it is ideal for contexts with many stakeholders, heterogeneous use cases, and strong documentation‑traceability needs therefore besides our scientific implementation test it can also be utilized in the field of enterprise software, AI‑driven data projects, e‑government systems, and regulated domains such as healthcare or finance. 

 

How to cite: Fardhosseini, Z., Ackermann, A., and Oerder, B.: Automating Requirements Elicitation using Large Language Models and Speech Processing, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-20967, https://doi.org/10.5194/egusphere-egu26-20967, 2026.

Please check your login data.