AMTE 2025
August 26, 2025
Held in conjunction with Euro-Par 2025
Dresden, Germany
Hosted on GitHub Pages — Theme by orderedlist
LA-UR-25-23069
Peter Thoman is an assistant professor with the Distributed and Parallel Systems working group at the University of Innsbruck, as well as the CTO and co-founder of PH3 GmbH. He is one of the original developers and designers of the Celerity runtime system for accelerator clusters and the Insieme C++ research compiler, and was involved in multiple international research projects in this capacity. In particular, he led core technical work packages in the the Ligate FET HPC project, the Horizon 2020 AllScale project, and the D-A-CH Celerity project.
Peter’s research interests include API and runtime system design for parallelism, accelerator computing, fine-grained task parallelism, and compiler-supported optimizations. He is a representative for the University of Innsbruck in Khronos, and a SYCL working group member. Currently, he is leading a project investigating automatic and transparent full-stack data compression in GPU cluster computing.
Programming efficient software for distributed memory accelerator clusters remains a complex challenge. The prevalent MPI+X model confronts application software developers with architectural details, and a need to understand and manage multiple layers of parallelism, memory hierarchies, and data distribution. Furthermore, even when this challenge is overcome - often through the sheer willpower of generations of PhD candidates - the resulting software is frequently rigid, and does not enable exploring algorithmic changes at scale without a substantial rewriting effort.
High-level runtime systems and programming models seek to abstract away much of this complexity, but almost invariably, this comes at a cost to either performance, generality, or both. Various design and implementation choices and optimizations have been explored aiming to minimize this cost, as the need for more flexible and user-friendly programming models in HPC remains pressing.
This talk approaches these topics from the perspective of the Celerity programming model and runtime system, which is designed to enable high-performance distributed memory accelerator computing. It provides a single, unified, high-level C++-based abstraction across both distributed memory clusters and accelerator devices. When Celerity was first introduced in 2019, it was capable of scaling small benchmarks with rather simple data access patterns up to 8 GPUs, with its true claim to fame being exceedingly low programming overhead over a single-GPU SYCL program. Since then, we have aimed to maintain the advantage in development efficacy afforded by Celerity, while dramatically improving the design and scalability of its runtime system, and enabling the programming model to support a wider range of applications.