Modern Earth system research relies on integrating heterogeneous datasets such as reanalysis, satellite observations, in situ measurements, climate model ensembles, and reforecasts, yet these data are often stored in fragmented, inconsistent, and difficult to reuse forms. This limits reproducibility, slows modelling workflows, and constrains the development of operational digital twins for water and climate risk management.
This contribution presents a scalable, FAIR aligned data lake architecture implemented on the EVE high performance computing environment. The system transforms a large, unstructured source pool of more than two million files into a curated, duplication free, metadata rich repository designed for hydrological modelling, machine learning, and climate analytics. The architecture follows a four stage lifecycle: raw, curated, database ready, and ancillary GIS layers, reflecting data governance practices used by major climate centres.
A reproducible ingestion workflow classifies, deduplicates, and standardizes datasets from ERA5, ERA5 Land, MERRA 2, PRISM, E OBS, GPM IMERG, CMIP6, ISIMIP3, ECMWF reforecasts, MODIS, CHIRPS, GFED, GRDC, GSIM, and other sources. A Python based metadata extractor, built on CF convention standards, automatically captures variables, units, dimensions, spatial resolution, temporal coverage, coordinate reference systems, and checksums. Metadata are stored both as dataset level JSON and as a global inventory, enabling transparent provenance tracking and rapid dataset discovery.
The curated data hub is implemented under /data/db/earth_system and organized by scientific domain, temporal resolution, spatial extent, and processing stage. The system supports SLURM based workflows, HPC native processing, and cloud optimized formats such as Zarr.
This work demonstrates how a single researcher can design and operationalize a modern, HPC native data infrastructure that accelerates hydro climate research and forms the backbone of an emerging Digital Hydro Twin. The approach is transferable to institutions seeking to modernize their data ecosystems and improve reproducibility in environmental modelling.