Asynchronous Many-Task Systems for Exascale 2026

Logo

AMTE 2026
August 30 - September 2, 2026

Held in conjunction with PPAM 2026
PoznaƄ, Poland

Hosted on GitHub Pages — Theme by orderedlist

LA-UR-25-23069

Accepted papers

Work Stealing for the 2D-Mesh Topology of Satellite Constellations in Low Earth Orbit

Authors: Mia Reitz, Dorian Chenet, Jonas Posner

Abstract: Asynchronous Many-Task (AMT) is a parallel programming model used in High Performance Computing (HPC). An AMT runtime can distribute fine-grained tasks across processing units called workers, through work stealing: when a worker has no tasks left to process, it tries to steal tasks from other workers. Workers are not restricted to a single compute node but can also be distributed across multiple nodes of an HPC cluster. Existing AMT runtimes assume a fully connected network with low, uniform latency and perform global work stealing, selecting another worker at random from all workers in the system. Space Edge Computing (SEC) uses constellations of satellites in Low Earth Orbit (LEO) as distributed compute clusters. Unlike HPC clusters, LEO satellites communicate through inter-satellite links that form a sparse mesh topology. Reaching a distant satellite requires multiple hops, each adding latency. As a step toward adapting AMT to SEC, this paper proposes a neighbor-only work stealing strategy in which workers steal exclusively from directly connected neighbors, avoiding multi-hop communication. An analytical model shows that restricting stealing this way yields a per-attempt latency advantage that grows with constellation size. Experiments on an HPC cluster with an emulated mesh over uniform low-latency links isolate the effect of victim selection: the neighbor-only strategy performs within +/-2.2% of global stealing on both balanced and irregular workloads, indicating that restricting the victim set does not harm load balancing. Taken together, the experiments show that neighbor-only stealing is on a par with global stealing, and the model shows that neighbor-only stealing is superior at scale.

Protecting Futures against Silent Data Corruption - Efficient Task Replication for Dynamic Data Dependencies

Authors: Rudiger Nather, Claudia Fohry, Mia Reitz

Abstract: As the size of computational problems grows, so does the likelihood of Silent Data Corruptions (SDCs). A common defense is replication, where the computation is repeated and correct results are determined by majority voting. Asynchronous Many-Task (AMT) runtimes are generally well suited for this approach, since the inputs and outputs of task replicas can be compared, and the tasks can be recomputed if necessary. Most existing SDC protection schemes assume static tasks and dependencies. Dynamic settings are more challenging, especially in clusters, since the tasks/data must be tracked for the comparisons. This paper considers a particularly dynamic setting with task spawning at runtime, task communication through C++11-like promises/futures, conditional touches, and cluster-wide load balancing via work-first work stealing. We propose an approach that closely couples original and replica computations by cross-validating all outgoing effects when interacting with the runtime system. The approach selectively recomputes affected tasks only. We implemented the approach in the ItoyoriFBC runtime system and conducted experiments with Fibonacci and H-matrix LU decomposition benchmarks. Results show a factor of less than two increase of failure-free running times, despite full replication, which is mainly due to improved opportunities for load balancing resulting from the higher number of tasks. The overhead for failure correction was about 0.5% of the overall running time per SDC.

From Fork-Join to Asynchronous Tasks: Parallelizing Tiled Cholesky Decomposition with OpenMP and HPX

Authors: Alexander Strack, Alexander Van Craen, Dirk Pfluger

Abstract: Fork-join parallelism, popularized by OpenMP, remains the dominant model for shared-memory parallel programming, but its implicit synchronization barriers can penalize algorithms with inhomogeneous workloads. Asynchronous many-task (AMT) runtimes sidestep these barriers by expressing work as a dependency graph of fine-grained tasks. Yet, the actual performance benefit over a carefully written fork-join baseline is rarely quantified. In this work, we introduce Cholesky-Bench and use it to revisit the tiled Cholesky decomposition, a canonical irregular kernel, comparing four parallelization variants of the right-looking algorithm across two runtimes: the OpenMP implementations shipped with GCC and LLVM, and the HPX AMT runtime. The variants span classical fork-join, a collapsed fork-join that exposes additional inner-loop parallelism, synchronous tasking, and asynchronous tasking with explicit data dependencies. We benchmark all eight combinations on a dual-socket 128-core AMD Zen 2 node across multiple tile sizes and problem sizes. Our results show that across all variants, HPX outperforms OpenMP at the optimal tile size by 16%-39%. Specifically, asynchronous HPX tasks are up to 26% faster than their OpenMP counterparts, largely due to lower task-management overhead. Furthermore, the collapsed fork-join variants close most of the gap to synchronous tasking. Removing redundant synchronization barriers yields an additional improvement of 9% (OpenMP) to 15% (HPX). A GCC-versus-LLVM comparison further reveals compiler-specific differences in fork-join scheduling and task-creation overheads.

Generated, Parallel, Scalable? A Study of Agentic AI-Generated Julia Code on Supercomputers

Authors: Linus Bantel, Anna-Lena Roth, Jonas Posner, Dirk Pfluger

Abstract: Julia is increasingly used in High-Performance Computing (HPC) as a single-language alternative to combining high-level scripting with low-level systems languages, but achieving scalable performance still requires expertise in parallel programming. Large Language Models (LLM) are increasingly used for code generation and are advancing rapidly with each new version. Yet, existing studies focus on single-shot prompting rather than agentic settings, in which an LLM autonomously plans, generates, and refines code through tool use. Using an OpenCode-based agent extended with a Julia-documentation Model Context Protocol (MCP) server, we study agentic generation of parallel Julia code, focusing on task-based execution with Dagger.jl. We evaluate three LLMs, OpenAI GPT-5.5, Anthropic Claude Opus 4.7, and the open-weight Qwen3-Coder-Next, on three problems with distinct parallel structures: PI approximation, tiled general matrix multiplication, and tiled Cholesky decomposition. The generated Dagger.jl implementations are compared against agent-generated Base.Threads and MPI.jl baselines, with shared-memory experiments scaling to 192 cores and distributed-memory experiments on two nodes. The agents reliably produce executable code for small inputs but fail at larger scales due to deadlocks, oversubscription, or out-of-memory errors, with the open-weight model affected most severely. The two commercial models scale comparably on Base.Threads and MPI.jl, while their Dagger.jl implementations expose recurring weaknesses in task dependencies, granularity, and scheduling. Agentic AI is promising for producing parallel Julia code, but generating robust, performance-aware implementations for large-scale HPC systems remains an open challenge.

(Un)Prompted Futures - Unifying LLM Inference and Agent Orchestration with Tasks and Dynamic Data Dependencies

Authors: Rudiger Nather

Abstract: Agentic AI workflows, systems where multiple AI agents collaborate, invoke external tools, and make decisions through sequences of language model calls, exhibit structural properties familiar to the High-Performance Computing community: parallelism, data dependencies, dynamic task spawning, and heterogeneous resources. Yet current frameworks employ ad-hoc coordination mechanisms, such as shared mutable state, callbacks, or polling, without leveraging established Asynchronous Many-Task primitives. Compounding this, the field has evolved a layered architecture that separates inference engines from agent orchestration frameworks. This separation structurally limits both the design space, and prevents optimizations that require transparency between inference internals and orchestration layer. We propose a unified future-based task graph model that dissolves this layering. In our model, both high-level workflows and low-level inference operations are expressed as composable graph patterns with explicit data dependencies. Futures serve as immutable, typed data carriers that track dependencies automatically, enabling arbitrary combinations of prompting techniques and orchestration patterns. We describe the model, show example task graph constructions for common patterns, and discuss optimization opportunities enabled by this unified approach.