Invited talk

Speaker

Johannes Blaschke

Johannes Blaschke, Scientific Computing at EIT Oxford, England

Johannes Blaschke is the Head of Scientific Computing for the Generative Biology Institute at EIT Oxford. His group is building the next generation hybrid On-Prem + Cloud HPC system for biology and life-science researchers. Prior to joining EIT, Johannes led the NERSC Science Acceleration Program (NESAP) at Lawrence Berkeley National Lab and developed Exascale cross-facility workflows for 3 generations of HPC systems. Johannes is a zealous advocate for user-centric HPC. His research interests include integrating HPC systems into laboratory operations, as well as urgent and interactive HPC.

A tale of two cities: bridging Laboratory workflows and exascale HPC

In many ways, experimental science and high-performance computing might as well come from completely different worlds. Historical examples (such as the CERN and the WLCG, or the US Department of Energy’s Superfacility Project) show that one of the main differences stems from how data and compute are orchestrated. Laboratory workflows are driven by data that is imperfect, time sensitive, and unpredictable; conventional HPC, by contrast, assumes a bulk-synchronous SPMD execution model in which the runtime can specialize around an a priori communication structure. The challenge of real-time data processing from light sources [1] is a good guiding example: data arrives in unpredictable bursts, each data set is inherently different, and every idle second wastes precious beamtime.

While advances in cloud-like HPC have allowed us to make great strides in “connecting” Labs and HPC [2], running exascale data analysis workflows remains a challenge [3]. This stems mainly from the fact that large scale data analysis workflows are irregular, dynamically discovered DAGs that are difficult to express as the monolithic, statically-scheduled SPMD applications for which MPI+X is designed. Here we present our experiences interfacing real-world Laboratory science workflows with HPC systems, and discuss several research projects that aim to make HPC resource management, schedulers, and networks accessible to asynchronous many-task systems (such as Legion, Dask, and Dagger.jl). We also explore challenges in overall system design (e.g., using Kubernetes and Slinky), Slurm preemption for responsive queueing, and elastic event-driven jobs.

[1] Giannakou, A., Blaschke, J. P., Bard, D. & Ramakrishnan, L. Experiences with Cross-Facility Real-Time Light Source Data Analysis Workflows. 2021 IEEE/ACM HPC Urgent Decis. Mak. (Urgent.) 00, 45–53 (2021). https://doi.org/10.1109/UrgentHPC54802.2021.00011

[2] Antypas, K. B. et al. Enabling discovery data science through cross-facility workflows. 2021 IEEE Int. Conf. Big Data (Big Data) 00, 3671–3680 (2021). https://doi.org/10.1109/BigData52589.2021.9671421

[3] Blaschke, J. P. et al. ExaFEL: extreme-scale real-time data processing for X-ray free electron laser science. Front. High Perform. Comput. 2, 1414569 (2024). https://doi.org/10.3389/fhpcp.2024.1414569