Asynchronous Many-Task systems for Exascale 2025

Logo

AMTE 2025
August 26, 2025

Held in conjunction with Euro-Par 2025
Dresden, Germany

Hosted on GitHub Pages — Theme by orderedlist

LA-UR-25-23069

Keynote talk

Speaker

Dr. Peter Thoman, Assistant Professor, University of Innsbruck, Austria

Peter Thoman is an assistant professor with the Distributed and Parallel Systems working group at the University of Innsbruck, as well as the CTO and co-founder of PH3 GmbH. He is one of the original developers and designers of the Celerity runtime system for accelerator clusters and the Insieme C++ research compiler, and was involved in multiple international research projects in this capacity. In particular, he led core technical work packages in the the Ligate FET HPC project, the Horizon 2020 AllScale project, and the D-A-CH Celerity project.

Peter’s research interests include API and runtime system design for parallelism, accelerator computing, fine-grained task parallelism, and compiler-supported optimizations. He is a representative for the University of Innsbruck in Khronos, and a SYCL working group member. Currently, he is leading a project investigating automatic and transparent full-stack data compression in GPU cluster computing.

So, I Heard You Like Graphs: The Journey to Strong Scaling on Hundreds of GPUs in Celerity

Programming efficient software for distributed memory accelerator clusters remains a complex challenge. The prevalent MPI+X model confronts application software developers with architectural details, and a need to understand and manage multiple layers of parallelism, memory hierarchies, and data distribution. Furthermore, even when this challenge is overcome - often through the sheer willpower of generations of PhD candidates - the resulting software is frequently rigid, and does not enable exploring algorithmic changes at scale without a substantial rewriting effort.

High-level runtime systems and programming models seek to abstract away much of this complexity, but almost invariably, this comes at a cost to either performance, generality, or both. Various design and implementation choices and optimizations have been explored aiming to minimize this cost, as the need for more flexible and user-friendly programming models in HPC remains pressing.

This talk approaches these topics from the perspective of the Celerity programming model and runtime system, which is designed to enable high-performance distributed memory accelerator computing. It provides a single, unified, high-level C++-based abstraction across both distributed memory clusters and accelerator devices. When Celerity was first introduced in 2019, it was capable of scaling small benchmarks with rather simple data access patterns up to 8 GPUs, with its true claim to fame being exceedingly low programming overhead over a single-GPU SYCL program. Since then, we have aimed to maintain the advantage in development efficacy afforded by Celerity, while dramatically improving the design and scalability of its runtime system, and enabling the programming model to support a wider range of applications.