Towards performance analysis of GPU-aware MPI over Angara interconnect

Abstract

One of the most important aspects of supercomputer development in the post-Moore era is the interconnect technologies that allow one to unite a multitude of processing elements into a well-synchronized computing system. Novel types of supercomputer interconnect require careful benchmarking and compliance with the requirements of modern hardware trends. GPU-based heterogeneous computing is one of the most important current avenues for building high performance computing systems, and the support of GPU-aware MPI technology is a requirement for any competitive interconnect. In this paper, we describe a UCX API based GPU-aware MPI implementation for the Angara interconnect. Performance analysis for peer-to-peer, MPI_Bcast and MPI_Reduce operations is presented, as well as for the rocHPL benchmark and for a typical biomolecular model within the LAMMPS molecular dynamics code. The deployment of the Desmos supercomputer equipped with both Angara and InfiniBand FDR allows us to make an accurate comparison of these two types of interconnect using the latter as a reference.

Keywords

low-latency communication RDMA broadcast HPL LAMMPS performance measurement

Introduction

The growth in the computing power of supercomputers is provided not so much by the processor frequency as by the increase in the number of computational nodes and the number of cores. For this reason, the contribution of high-speed interconnect to the maximum computing performance of a supercomputer permanently increases. This trend makes it promising to create high-performance computing systems with interconnects that provide the lowest latency and the highest throughput.

There exist several interconnects that provide low latency (about 1μs) and high throughput (several tens of GBytes per second). Among the former leaders of this industry, we can mention Quadrics Petrini et al. (2002) (1996-2009), Myrinet Boden et al. (1995) (since 1995) and Birrittella et al. (2015) (2015-2019). Currently, the market is dominated by InfiniBand (since 2000) with a small share of other types of interconnects.

InfiniBand is an open standard for high performance network communications. The most commonly used InfiniBand implementations rely on NVIDIA Mellanox hardware, with switches typically arranged in a fat tree topology. HDR 200 Gb/s is the sixth generation of the NVIDIA InfiniBand architecture. HDR has 0.6μs low-level latency, the obtained MPI latency is 1μs Ruhela et al. (2020).

Cray (HPE) released the Slingshot interconnection network De Sensi et al. (2020). Slingshot uses an optimized Ethernet protocol, which allows it to be interoperable with standard Ethernet devices while providing high performance to HPC applications. Slingshot switches have ports with 200 Gbit/s each and support arbitrary network topologies, the default topology is Dragonfly Kim et al. (2008). The low-level latency of Slingshot is 1.85μs between two nodes.

Among other types of low-latency interconnects, we can mention the Extoll interconnect Nüssle et al. (2009); Neuwirth (2022), the Tofu interconnect Ajima et al. (2018) and the BullSequana eXascale Interconnect (BXI) interconnect Emmanuel et al. (2021). Currently, other types of interconnect aimed at exascale computing are in development Lu et al. (2022); Bossard (2023); Ammendola et al. (2024).

At the moment, we can refer to two types of interconnect in development in Russia: the Angara interconnect was developed by JSC NICEVT Mukosey et al. (2015); Simonov and Brekhov (2020) and the SMPO-10G is under development in the Russian Federal Nuclear Center – All-Russian Research Institute of Experimental Physics Basalov and Vyalukhin (2012). During the last few years, the Angara interconnect has had a history of practical deployment Akimov et al. (2018); Stegailov et al. (2019a); Goncharuk et al. (2022); Mukosey et al. (2024).

The rise of GPU-computing increases the role of RDMA-based technologies that enables data exchanges among GPUs without explicit copying the data from the GPU memory into the host memory. Such an RDMA-based technologies is therefore the basis for GPU-aware MPI implementation. In this paper, we present the working prototype of such a technology for the Angara interconnect.

For the development of both low-level software and parallel algorithms, it is very beneficial to have tools for detailed benchmarking of their runtime behavior. Another instructive guide can be provided by a carefully designed comparison of new technologies with existing solutions. To use these design concepts in this work, we use the OSU benchmarks and the Score-P infrastructure to collect the low-level performance data. The performance of the MPI_Bcast collective operation is considered. We deploy the GPU-aware MPI based rocHPL implementation of the High Performance Linpack benchmark as our performance target and InfiniBand FDR interconnect as a reference in assessing the runtime behavior of the Angara interconnect.

Among different types of HPC application, molecular dynamics (MD) modeling is one of the most important use cases where GPU acceleration is very effective (e.g. Kondratyuk et al. (2021); Pavlov et al. (2024)). In this work, we use one of the most versatile MD codes LAMMPS Thompson et al. (2022) with its Kokkos-based GPU-backend to validate the capability of GPU-aware MPI over Angara interconnect.

Related work

In the seminal paper of Ang Li et al. Li et al. (2019) the authors characterized and evaluated six types of modern GPU interconnects, including PCIe, NVLink-V1, NVLink-V2, NV-SLI, NVSwitch, and InfiniBand with GPUDirect-RDMA, using the Tartan Benchmark Suite over six GPU servers and HPC platforms: NVIDIA’s P100-DGX-1, V100-DGX-1, DGX-2, RTX2080-SLI systems, and ORNL’s SummitDev and Summit supercomputers, covering both Peer-to-Peer and Collective communication patterns.

Performance analysis tools for GPU-oriented HPC applications is a growing topic in the literature Zhou et al. (2020, 2021, 2022); Darche and Dagenais (2023). Visualization of profiling and tracing in CPU-GPU programs is a subject of current development Fiorini and Dagenais (2022), as well as analysis of the advantages and disadvantages of various GPU-related data-transfer modes Potluri et al. (2013); Li et al. (2023) in the broad context of the performance of HPC applications Mills et al. (2021); Azad et al. (2023); Tronge et al. (2023).

The development of the Score-P infrastructure was described by Dieter Mey et al. Mey et al. (2012) and its latest features aimed at performance-portable accelerated computing are discussed by Robert Dietrich and co-authors Dietrich et al. (2021). The HPCToolkit software exemplifies an alternative set of tools for versatile performance analysis of software running of supercomputers Adhianto et al. (2024).

The optimization of the collective broadcast operation is an active field of research Hoefler et al. (2007); Awan et al. (2018); Chu et al. (2018); Awan et al. (2019). MPI_Bcast optimization is shown to be beneficial for applications such as LAMMPS running on large-scale HPC systems Qi et al. (2025).

Using moderate-scale prototype supercomputers to benchmark novel hardware and software technologies was an idea behind the DEEP projects Kreuzer et al. (2018).

In this paper, we extend our preliminary recent work Ismagilov et al. (2025) and further demonstrate the results for larger number of computing nodes over Angara interconnect for the rocHPL benchmark. Here we consider the LAMMPS MD code as a real-life application that uses GPU-aware MPI. Comparison with InfiniBand FDR is extended to MPI_Bcast and MPI_Reduce collective operations. Two types of parallel performance analysis tools are used: Score-P and HPCToolkit. The results of microbenchmarks are compared with the real-life applications behavior.

Hardware

The hybrid Desmos supercomputer in JIHT RAS was the first supercomputer based on the Angara network with a detailed analysis of its performance. In September 2018, Desmos (equipped with AMD FirePro S9150 GPUs) was ranked No.45 in the Top50 list of supercomputers in Russia and CIS (the open-source HPL-GPU benchmark based on OpenCL Rohr et al. (2015) was used to run the HPL benchmark). The upgraded Desmos with AMD MI50 GPUs obtained No. 39 of Top50 in March 2021 with the proprietary AMD HPL-GPU implementation and reached No. 37 in March 2023 with the open-source rocHPL implementation (in the last case the InfiniBand FDR interconnect was used, which was added to Desmos as an alternative interconnect to Angara).

The main details of the current configuration of the Desmos supercomputer used in this work are given in Table 1 and illustrated in Figure 1. A unique aspect of Desmos is the possibility to compare two different types of interconnect with all other hardware and software being identical. Despite the fact that the FDR generation of InfiniBand is not a modern one, it is instructive to use FDR as a reference since FDR links have a very similar data transfer speed to Angara links.

Table 1.

The main characteristics of the Desmos supercomputer.

Compute nodes	host [01–32]
Chassis	Supermicro 1018GR-T
Processor/Memory	Xeon E5-1650v3 6c/64 GB
GPU/ROCm version	AMD MI50 32 GB/5.3.3
Interconnect	Angara 4D-tor/InfiniBand FDR
OS	openSUSE 15.3
Kernel version	5.3.18-150300.59.106-default
MPI for Angara	OpenMPI 4.1.1 with UCX 1.10.0 over Angara API
MPI for InfiniBand FDR	OpenMPI 5.0 with UCX 1.14.1 over mlnx_ofed 4.9

Figure 1.

The scheme of the Desmos supercomputer with 32 nodes equipped with Angara, InfiniBand FDR and AMD MI50 GPUs. PCIe connectivity within a computational node is shown.

The benchmark results of this study have been obtained either with the default frequency policy of AMD MI50 GPUs computing units (that uses dynamic voltage and frequency scaling), or at a fixed computing units frequency of 1282 MHz (set via the rocm-smi options ‘setsclk 4’ and ‘setprofile COMPUTE’).

GPU-aware MPI over Angara

Low-level host-device data transmission mechanism

MPI is a standard API for distributed and parallel application development that can scale to multi-node clusters. To facilitate porting of applications to clusters with GPUs, the AMD ROCm software framework enables various technologies. Using these technologies, one can add GPU memory pointers to MPI calls and enable ROCm-aware MPI libraries to deliver optimal performance for both intra-node and inter-node GPU-to-GPU communication.

The AMD kernel driver exposes remote direct memory access (RDMA) through the PeerDirect interfaces. This allows network interface cards (NICs) to read and write directly to RDMA-capable GPUs device memory, resulting in high-speed direct memory access (DMA) transfers between GPUs and NICs. These interfaces are used for optimization of inter-node MPI message communications.

The Unified Communication Framework (UCX) is an open-source cross-platform framework that offers a unified set of communication interfaces for a variety of network programming models and interfaces. UCX utilizes ROCm technologies to execute different primitives of network operation. UCX is the standard communication library for InfiniBand and RDMA over Converged Ethernet (RoCE) network interconnects. To enhance data transfer performance, various MPI libraries (e.g. OpenMPI) can utilize UCX internally.

Khalilov et al. (2022) describes the first prototype implementation of an extended Angara API in UCX to support remote GPU memory reads and writes. For this purpose, the Linux kernel module angara_memreg was implemented to register both host and device memory. It allows users to pin an allocated memory and get its physical address. This module was initially based on ROCm v3.7 Khalilov et al. (2022), but since ROCm v4.2 the ROCmRDMAmodule has been integrated into the ROCK kernel driver and has undergone some changes. As part of this work, the angara_memreg module has been adapted for ROCm v5.3.3.

In addition, in Khalilov et al. (2022), the implementations of 3 transfer protocols were presented for different message sizes: Eager, Segmentation-and-Reassembly (SAR) and Large Message Transfer (LMT). The Eager and SAR protocols, which are used for small and medium-sized messages, have a similar implementation using the angara_put blocking operation, which uses the CPU on a critical path when transferring data between GPUs. This is done by using an optimized version of the memcpy function to copy data between the device and host memory. The implementation of the LMT protocol, on the other hand, is based on put_zcopy/get_zcopy schemes using non-blocking versions of Put/Get operations (angara_offload_put/angara_offload_get), which allow sending data in a zero-copy fashion between GPUs without using a CPU and intermediate buffering. This operation is based on the angara_get non-blocking operation that implements data transmission from a source offset in one physical address space to a destination offset in another physical address space. If the destination of angara_get belongs to the processing element (PE) of the process that sent the request, it will be an angara_offload_get operation. In contrast, if the source of angara_get belongs to the PE of the process that sent the request, then it will be an angara_offload_put operation.

The data transmission confirmation mechanism for the put_zcopy/get_zcopy schemes presented in Khalilov et al. (2022) was found to be incorrect. Its deficiency resulted in data corruption or incomplete data transmission events. Figure 2 presents an updated mechanism for Angara, which determines the completion of sending and receiving operations based on Completion Markers (CMs). At first, a special CM memory is allocated on each node. This CM memory stores a predefined value. For the LMT protocol, using the get_zcopy scheme, a Get CM request is sent (via angara_offload_get) after all Get Data requests have been sent (also via angara_offload_get). The CM will be received only in the case that all data on the node have been received, because angara_offload_get operations are performed in the order in which they were called. In the put_zcopy scheme, data is sent to another PE by the angara_offload_put operation. The CM marker is sent to itself by the angara_offload_put operation to confirm that all data have been sent. The CM will be received only if all data have been sent, because angara_offload_put operations are performed in the order in which they were called.

Figure 2.

The updated transmission confirmation mechanism for put_zcopy and get_zcopy data transfer schemes for Angara. Each Angara network endpoint is numbered and called a Processing Element (PE).

The get_zcopy scheme has a peculiarity: it is impossible to perform the angara_offload_get operation in its own memory. To resolve this issue, one of the ports on the Angara network adapter has been configured in debug mode, which returns all the data sent to that port (loopback). The Angara network software stack has been adapted to enable this functionality.

These corrections allowed the OSU Micro Benchmarks 7.5 communication tests, the rocHPL benchmark and LAMMPS to work using the GPU-aware MPI as implemented in OpenMPI with UCX.

MPI_Bcast and MPI_Reduce benchmarks

In the previous paper Ismagilov et al. (2025), a comparison of latency and bandwidth was presented for Host-to-Host and Device-to-Device peer-to-peer packet transfers over Angara and over InfiniBand FDR. Both interconnects showed similar results on Desmos. Here, we extend this comparison to the MPI_Bcast and MPI_Reduce collective operations.

OpenMPI includes different optimized algorithms for collective operations Bernholdt et al. (2024) (under the ‘coll tuned’ subset of the OpenMPI’s modular component architecture). Nine ‘coll tuned’ options for MPI_Bcast and seven options for MPI_Reduce have been compared both for Angara and for InfiniBand FDR (Figure 3). Moreover, Nvidia provides an optimized standalone hcoll library for collective communication over InfiniBand interconnect that can be integrated into any MPI. Figure 3 shows that using OpenMPI with hcoll with its default parameters gives the best performance for MPI_Bcast at small and large message sizes. However, for MPI_Reduce, the performance of hcoll is slightly worse than that of the algorithms in_order_binary (6) and rabenseifner (7).

Figure 3.

The benchmark results of osu_latency, osu_bw, osu_bcast and osu_reduce for Device-to-Device peer-to-peer packet transfers over Angara and over InfiniBand FDR (dashed lines show message sizes of about 100 Mb typical for broadcast operations in rocHPL runs considered and the 63 kB message size typical for our example model in LAMMPS).

Designing a collective communication algorithm that comprehensively takes into account hardware realities, such as network topology and support for network features, is inherently complex and goes beyond the scope of this work. Previously, we have estimated that topology-aware algorithms for the Angara interconnect will be beneficial at the number of nodes significantly higher that 32 available in Desmos. In this study, we assess available implementations with their default settings.

MPI analysis: Score-P and HPCToolkit

In the rapidly growing field of high-performance computing (HPC), optimizing application performance is critical to achieving efficient use of computing resources. Performance analysis tools are essential for identifying bottlenecks, improving code efficiency, and ensuring scalability. Score-P (Scalable Performance Measurement Infrastructure for Parallel Codes) is one such tool that has gained significant traction in the HPC community. It is a universal and scalable performance measurement infrastructure designed to support a wide range of HPC applications. Score-P combines profiling, tracing and online parallel program analysis capabilities to provide a comprehensive environment for performance tuning.

Score-P supports various instrumentation techniques to suit different application needs and levels of user expertise. Automatic source code instrumentation uses tools like OPARI2 and PDT (Program Database Toolkit) to insert probes into the source code with minimal manual intervention. Compiler-based instrumentation directs the compiler to include performance measurement probes during the compilation process, ensuring precise and low-overhead instrumentation. Manual instrumentation allows advanced users to insert Score-P API calls into their code to measure specific regions of interest, providing fine-grained control over the performance data collected.

Another performance and measurement tool we have used in our work is HPCToolkit, version 2023.03. While Score-P primarily uses automated source and binary instrumentation to collect detailed event traces, HPCToolkit relies on so called asynchronous sampling of the program’s performance metrics, which is a fundamentally different approach.

When a hardware performance counter overflows or a timer event occurs, HPCToolkit interrupts the program execution, captures and unwinds the call stack, attributes the performance cost to the calling context, and then resumes the program. Later, these data can be analyzed to create a hierarchical profile.

The main advantage of this approach is its low overhead Froyd et al. (2005); Lehr et al. (2017). It does not require any binary modification or recompilation for measurement. One only needs to compile the program with debug symbols, which can be done without disabling compiler optimizations, allowing for the analysis of highly optimized code.

One of the disadvantages is that, if the sampling rate is too low, it can miss short-lived performance events or infrequent routines. In contrast, Score-P’s instrumentation can provide precise counts for such events.

We have built the latest development version of Score-P (9.0-dev) due to multiple bugs related to ROCm found in older versions. For rocHPL and LAMMPS instrumentation we have built this tool with OpenMPI 4.1.1/5.0, ROCm 5.3.3 and AMD clang v.15.0.0 compiler. To instrument rocHPL with Score-P support, we have edited the install. sh script (provided by the rocHPL developers for cmake). For compatibility reasons in the stage of cmake configuration of rocHPL we should add -gdwarf-4 to CMAKE_CXX_FLAGS. At the stage of using make for compilation of rocHPL the following Score-P instrumenter flags are used –hip –thread=omp:ompt –mpp=mpi. And –hip –kokkos flags are used for LAMMPS. On the Desmos supercomputer, the HPCToolkit version 2023.03-stable is deployed for analysis.

rocHPL performance analysis

The HPL benchmark plays an significant role in the evaluation of HPC systems performance. However, its open-source GPU-oriented implementations are scarce. There was an open-source HPL-GPU implementation based on OpenCL Rohr et al. (2015). This implementation was based on the GPU-offloading paradigm, and very careful tuning of the CPU-GPU workload distribution was required in order to reach maximum performance. AMD has published a new implementation of rocHPL with the focus on HIP technology. This implementation uses GPUs-aware MPI capabilities for data transfers between different GPUs. In this work, we compare the performance of rocHPL execution on Desmos over InfiniBand FDR and over Angara interconnects, use Score-P and HPCToolkit in order to collect the execution traces and use python toolchains for analysis of traces in OTF2 format.

rocHPL has two variants for broadcast data transfers in its HPL_bcast function: through peer-to-peer operations or through the MPI_Bcast collective operation. In the former case, one can choose one of the six algorithms (in HPL.dat configuration file for rocHPL). In the latter case, an algorithm of MPI_Bcast should be chosen among the variants provided by an MPI implementation (e.g. OpenMPI in our tests). rocHPL performance dependence on the GPU frequency (sclk) is shown in Figure 4.

Figure 4.

The total performance of rocHPL on 32 nodes of Desmos in TFlops for different sclk settings (0–8) of the AMD MI50 GPU driver (sclk level 4 is marked by dashed lines).

Benchmarking results

The rocHPL benchmark can be configured using the HPL.dat input file. The most significant parameters in HPL.dat are P, Q, N and NB. P and Q represent the number of rows and columns in the MPI grid. Product P × Q represents the number of MPI processes used by HPL. To achieve better efficiency, P should be configured to be smaller than or equal to Q. Setting P and Q as close to a perfect square as possible may also improve performance. N represents the total number of rows and columns in the global matrix, with the best performance coming from matrices that use about 80% or more of total memory. The value of NB, which represents the panel size in the blocking algorithm, can be configured on the basis of CPU architecture. We configure NB = 384 for our tests. In addition, the internal implementation of the broadcasting algorithm within the HPL_bcast function can be chosen. This algorithm can be configured as a collective MPI_Bcast if the HPL_USE_COLLECTIVES macro is defined, otherwise it will be based on MPI peer-to-peer transfers. The type of peer-to-peer algorithm depends on the value of the BCASTs parameter in HPL.dat: 1ring, 1ringM, 2ring, 2ringM, Blong and BlongM. We will compare the collective and peer-to-peer implementations of this algorithm. Also, for Angara, the UCX_RNDV_THRESH parameter is set to 2048. For InfiniBand FDR, this parameter is left at its default value (auto).

UCX_RNDV_THRESH specifies the threshold at which the UCX communication framework transitions to the rendezvous protocol for data transfer. In the case of Angara, this parameter was determined empirically through comparative throughput measurements. Given that the efficiency of Eager and SAR protocols are strongly influenced by the CPU, the optimal threshold value is not universal and may vary significantly across different hardware.

The first results of the rocHPL runs are summarized in Table 2. These results were obtained for all 32 computing nodes with 6 OpenMP threads per MPI process. The selection of matrix size N was based on the fact that more than 95% of the VRAM on each node should be filled. For all tests on this table, the VRAM were filled in 96%. Also, BCASTs parameter was set as BlongM. It can be seen that the performance results for all node combinations are similar for InfiniBand FDR and Angara.

Table 2.

Results of the rocHPL runs in TFlops at the default GPU driver settings (the subsets of Desmos nodes were identical for the runs via Angara and via InfiniBand FDR).

MPI ranks	P	Q	N	InfiniBand FDR	Angara
1	1	1	63360	4.8	4.8
2	1	2	89472	9.1	9.2
4	2	2	125952	17.5	17.6
8	2	4	178944	31.9	32.4
16	4	4	251904	64.1	64.9
32	4	8	356352	119.5	124.0

In case of collective MPI_Bcast implementation in the HPL_bcast function, several OpenMPI –mca options can be tuned in an attempt to increase performance. There is the coll_tuned_bcast_algorithm option, which defines what kind of broadcast algorithm to use if the coll_tune_use_dynamic_rules option is enabled. These algorithms include: ignore, basic_linear, chain, pipeline, split_binary_tree, binary_tree, binomial, knomial, scatter_allgather, and scatter_allgather_ring. It is also possible to switch the coll_hcoll_enable option for InfiniBand FDR. This option enables the use of the hcoll (Hierarchical Collectives) library, which accelerates collective blocking and non-blocking MPI operations. For Angara, hcoll is not supported by the hardware. We use the default hcoll options because the default settings for hcoll should be optimal for most InfiniBand systems. The results of the rocHPL runs for collective and peer-to-peer implementations of the broadcasting algorithm in the HPL_bcast function with coll tuned and hcoll OpenMPI options are summarized in Table 3. The table shows that the best performance is provided with a peer-to-peer implementation of the algorithm (1ring). This is followed by an MPI_Bcast implementation with hcoll enabled, and then an MPI_Bcast implementation with coll tuned algorithm using scatter_allgather. For peer-to-peer implementation, the lowest performance is provided by Blong and BlongM algorithms, and for collective implementation, the worst performance is provided by ignore and basic_linear OpenMPI coll tuned options. In addition, for all GPUs, the sclk parameter was fixed using the rocm-smi tool to 1282 MHz, and the profile was set to COMPUTE, which helps synchronize GPUs by setting the kernel clock frequency to a constant value. This avoids thermal throttling and synchronizes communication steps in the HPL algorithm (see below), leading to decreased fluctuations in final results.

Table 3.

Results of the rocHPL runs in TFlops for collective and peer-to-peer implementations of the broadcasting algorithm in the HPL_bcast function with OpenMPI using either different algorithms within ‘coll tuned‘ or the hcoll library (32 nodes with sclk = 4, P = 4, Q = 8, N = 356352, NB = 384).

HPL_bcast algorithm	InfiniBand FDR	Angara
Peer-to-peer transfers
1ring	123.4	123.2
1ringM	122.5	123.1
2ring	122.6	123.1
2ringM	122.3	123.1
Blong	120.5	120.4
BlongM	119.2	121.1
MPI_Bcast with OpenMPI (coll tuned)
ignore (0)	68.8	101.4
basic_linear (1)	68.8	101.6
chain (2)	85.5	115.7
pipeline (3)	109.8	112.4
split_binary_tree (4)	116.3	118.0
binary_tree (5)	97.1	109.8
binomial (6)	85.4	110.8
knomial (7)	102.3	112.5
scatter_allgather (8)	120.9	121.6
scatter_allgather_ring (9)	120.5	119.2
MPI_Bcast with OpenMPI (hcoll)
default hcoll settings	122.1	—

The use of bold in Table is justified as it shows the best result in each group of algorithms.

If we want to relate the osu_bcast results with the performance of rocHPL, it is necessary to notice that: (1) the communication transmitted in rocHPL is determined by the parameter Q, which is why we present MPI_Bcast results for 8 nodes in Figure 3, and (2) the typical message sizes broadcasted in the rocHPL runs are about 100 Mb. The rocHPL benchmarks described allow us to draw the following conclusions.

(1) There is a strong correlation between the performance of osu_bcast and the performance of rocHPL with respect to the choice of the communication algorithm. The scatter_allgather (8) algorithm provides the best results for Angara and for InfiniBand FDR.

(2) The scatter of rocHPL performance with respect to different communication algorithms is less pronounced for Angara interconnect than for InfiniBand FDR. In general, we can explain this fact by the better connectivity of nodes for the Angara case, where each node has 5 links that go to neighboring nodes.

In Table 4 are summarized the results of the rocHPL runs for Angara with and without Score-P, using some of the best parameters to achieve maximum performance (scatter_allgather for collective and 1ringM for peer-to-peer). These results were also obtained using the GPU clock frequency and profile configuration. It is notable that Score-P instrumentation has very little influence on rocHPL performance. The HPCToolkit provides even less pronounced influence below one percent, which is less than statistical variations. That is why it is quite reasonable to discuss the statistical analysis of the rocHPL execution traces with respect to the real rocHPL performance.

Table 4.

Results of the rocHPL runs in TFlops for Angara with and without Score-P instrumentation (32 nodes with sclk = 4, P = 4, Q = 8, N = 356352, NB = 384).

HPL_bcast algorithm	With Score-P	Without Score-P
1ringM	121.3	123.2
scatter_allgather (8)	119.2	121.6

rocHPL execution traces

The fragments of the rocHPL execution traces obtained through Angara are presented in Figure 5. In order to analyze the dependence of the interconnect behavior with rocHPL performance, we focus our attention on the HPL_bcast function of rocHPL. Figure 5 illustrates HPL_bcast calls among 32 MPI ranks in two steps of rocHPL execution (approximately 10 seconds after its start). We analyze the distribution of the HPL_bcast duration values (D). Figure 6 shows the distributions of the D values.

Figure 5.

The fragments of two rocHPL runs over Angara: the upper trace corresponds to the peer-to-peer 1ringM variant and the lower trace stands for the collective variant of HPL_bcast. The duration values of HPL_bcast for MPI-ranks 22 and 28 at the same HPL step (D₂₂ and D₂₈) are shown by arrows. For the peer-to-peer case different stages of HPL_bcast are marked: (a) — MPI_Waitany, (b) — MPI_Send + MPI_Waitany, (c) — MPI_Recv.

Figure 6.

The distributions of durations (D) collected for the whole rocHPL runs (32 nodes, P = 4, Q = 8, N = 356352, NB = 384) for the peer-to-peer case (left) and for the MPI_Bcast case (right). In each case, the comparison with the fixed GPU frequency (sclk = 4) is shown.

We could propose that the difference in HPL_bcast entry times is connected with the variation in the frequency of the GPU. This proposal is in agreement with the distributions obtained.

In both cases, the number of broadcasts that are not completed by 0.38 seconds is reduced approximately twice when the GPU clock is fixed. Therefore, frequency control could contribute not only to energy savings directly but indirectly as well via improvement of synchronization (e.g., see Bratek et al. (2023)).

LAMMPS performance analysis

As an example of a real-life problem, we consider molecular dynamics modeling in LAMMPS. The model is a biomolecular system: hEGFR Dimer of 1IVO and 1NQL, HECBioSim (2025). Different types of atoms are present in the model: 21,749 protein atoms, 134,268 lipid atoms, 309,087 water atoms and 295 ions. The total number of atoms is 465,399. The model combines both short-range interatomic potentials and long-range electrostatics. The Kokkos GPU backend is used in LAMMPS that is based on GPU-aware MPI technology Thompson et al. (2022); Hagerty et al. (2023).

Here we consider this model in LAMMPS executed on 16 nodes of Desmos with fixed GPU frequencies (sclk 4, profile COMPUTE). Figure 7 shows two trace fragments of 200 μs (that is much shorter than one MD step) in the LAMMPS calculations considered. Vampir is used for visualization Williams and Brunst (2023). MPI communications in LAMMPS are dominated by peer-to-peer transfers. With the chosen system size and the number of nodes, the prevailing message size is 63 kB, where osu_bw shows the largest difference of InfiniBand FDR and Angara (Figure 3). Interestingly, we see in these 200 μs fragments that the data transfer rates given by osu_bw for this message size for InfiniBand FDR and for Angara are not reached. The peer-to-peer communication pattern is rather stochastic, and there are extended periods of time between MPI_Send starts sending and MPI_Irecv starts. Therefore, we can conclude that in the particular case considered, the data transfer rates measured by Score-P are determined by the LAMMPS data transfers algorithm and not by the interconnect itself. Figure 8 shows the maximum and average data transfer rates for the same MD runs on a scale of 13 s. Here we see that the average data transfer rate is identical. However, Angara demonstrates somewhat high maximum data transfer rates in agreement with the osu_bw results for 63 kB.

Figure 7.

Two 200 μs fragments of LAMMPS traces with MPI communications visualized by Vampir for the biomolecular model considered (hEGFR Dimer of 1IVO and 1NQL) running on 16 nodes of Desmos via InfiniBand FDR and via Angara. In each case, the lower bar chart is synchronized with the trace timeline and shows the data transfer rate statistics (the pink chart shows the maximum rates and the orange chart shows the average rates in the corresponding time bin).

Figure 8.

Two 13 s fragments of LAMMPS traces for the same MD runs via InfiniBand FDR and via Angara as in Figure 7. Similarly, the pink and orange charts show the maximum and average data transfer rates.

Statistical averaging over 5 MD runs gives the following calculation speed values: 4.77 ± 0.03 ns/day for FDR and 4.79 ± 0.02 ns/day for Angara. The percentage of communication time reported by LAMMPS is 8.44 ± 0.06% for FDR and 6.38 ± 0.02% for Angara. These small differences agree with the observed fact that for the case considered, the peer-to-peer exchanges in LAMMPS are limited neither by InfiniBand FDR nor by Angara data transfer speeds.

Conclusions

The implementation of GPU-aware MPI technology for the Angara interconnect based on the UCX API has been described in detail.

In this work, our previous peer-to-peer communications tests Ismagilov et al. (2025) have been extended to collective MPI_Bcast and MPI_Reduce operations. All broadcast and reduce algorithms implemented in OpenMPI have been considered for both Angara and InfiniBand FDR. The scatter_allgather algorithm shows the smallest latency for both interconnects, with Angara being slightly faster for large message sizes. The hcoll optimized broadcast implementation has also been tested for InfiniBand FDR and showed the smallest latency for large message sizes. For MPI_Reduce the hcoll implementation does not provide the lowest latencies for InfiniBand FDR.

The rocHPL runs with default parameters of the GPU driver and OpenMPI show quite similar performance for both types of interconnect, with Angara giving slightly more TFlops. After fixing the GPU frequency, different variants of the broadcast implementation in rocHPL have been compared: the 1ring implementation based on peer-to-peer operations gives the best results for Angara and for InfiniBand FDR. The hierarchy of rocHPL runs with the broadcast implementation based on the collective MPI_Bcast matches quite well with the hierarchy of the osu_bcast latencies for large message sizes: the hcoll gives the highest score, Angara with scatter_allgather is the second, and InfiniBand FDR with scatter_allgather is the last.

The rocHPL application has been instrumented with Score-P infrastructure and additionally traced by HPCToolkit, and its performance has been studied for Angara and for InfiniBand FDR in terms of runtime traces for all 32 nodes of Desmos. The analysis shows that the overhead of using Score-P infrastructure is rather small for rocHPL.

Statistical analysis of the rocHPL traces confirms that fixing of GPU operation frequency improves synchronization and decreases duration of the longest collective data transfers.

An example of a biomolecular model in LAMMPS with the Kokkos GPU backend running on 16 nodes of Desmos has been considered as well. The peer-to-peer communication pattern in this case is rather stochastic and the data transfer rates measured by Score-P are determined by the LAMMPS data transfer algorithm and not by the interconnect itself. The small difference of LAMMPS performance in this case agrees with the observed fact that for the case considered the peer-to-peer exchange rates in LAMMPS are not limited by InfiniBand FDR or Angara.

Footnotes

ORCID iDs

Timur Ismagilov

Felix Smirnov

Vladimir Stegailov

Alexey Timofeev

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work was supported by the Russian Science Foundation grant No. 20-71-10127. The instrumentation of rocHPL and the analysis of its traces were performed by Felix Smirnov within the framework of the HSE University Basic Research Program. The analysis of LAMMPS performance was performed with the support of the Ministry of Science and Higher Education of the Russian Federation (State Assignment No. 075-00269-25-00).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author biographies

Timur Ismagilov graduated from the Lomonosov Moscow State University in 2011, received a PhD degree in 2017 and is a lead software engineer at the Joint Institute for High Temperatures of the Russian Academy of Sciences. His research interests include high-performance computing, software development, and low-level programming.

Anatoly Mukosey graduated from the Lomonosov Moscow State University in 2014, received a PhD degree in 2023, and is a software engineer at the Joint Institute for High Temperatures of the Russian Academy of Sciences. His research interests include high-performance computing, software development, and low-level programming.

Felix Smirnov graduated from the Moscow Institute of Electronics and Mathematics of the HSE University in 2025 and is an intern researcher in the International Laboratory for Supercomputer Atomistic Modeling and Multi-scale Analysis of the HSE University. His research interests include high-performance computing, low-level programming, and mathematical modeling.

Vladislav Galigerov graduated from the Moscow Institute of Electronics and Mathematics of the HSE University in 2025 and is now a junior researcher at the Joint Institute for High Temperatures of the Russian Academy of Sciences, Moscow, Russia. His research interests include high-performance computing, general-purpose GPU computing, and atomistic simulations.

Yuri Grishichkin graduated from the Moscow Institute of Electronics and Mathematics in 1997 and is a high-performance computing engineer at the Joint Institute for High Temperatures of the Russian Academy of Sciences. His research interests include low-level performance and fault tolerance tuning of computing infrastructure.

Vladimir Stegailov graduated from the Moscow Institute of Physics and Technology in 2004, received a PhD degree in 2005 and a DrSc degree in 2012. He is Head of Department at the Joint Institute for High Temperatures of the Russian Academy of Sciences, Moscow, Russia. He is a professor of HSE University and a leading researcher in the International Laboratory for Supercomputer Atomistic Modeling and Multi-scale Analysis. He is a professor in the Moscow Institute of Physics and Technology, where he leads the Laboratory of Supercomputer Methods in Condensed Matter Physics. His research interests include high-performance computing and atomistic and multiscale modeling and simulation.

Alexey Timofeev graduated from the Moscow Institute of Physics and Technology (MIPT) in 2010 and received his PhD in 2011. He currently serves as Head of Laboratory and Deputy Director for Research at the Joint Institute for High Temperatures of the Russian Academy of Sciences, as well as Associate Professor at MIPT and the HSE University. Also, he is a leading research fellow at the International Laboratory for Supercomputer Atomistic Modeling and Multi-scale Analysis in HSE University. His research interests include multi-scale modeling and high-performance computing.

References

Adhianto

Anderson

Barnett

, et al. (2024) Refining HPCToolkit for application performance analysis at exascale. The International Journal of High Performance Computing Applications 38(6): 612–632.

Ajima

Kawashima

Okamoto

, et al. (2018) The Tofu interconnect D. 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 646–654.

Akimov

Silaev

Aksenov

, et al. (2018) Flowvision scalability on supercomputers with angara interconnect. Lobachevskii Journal of Mathematics 39(9): 1159–1169.

Ammendola

Biagioni

Chiarini

, et al. (2024) Outlines in hardware and software for new generations of exascale interconnects. In: EPJ Web of Conferences. EDP Sciences, Vol. 295, 10006.

Awan

Chu

Subramoni

, et al. (2018) Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL? Proceedings of the 25th European MPI Users’ Group Meeting. ACM, 1–9.

Awan

Manian

Chu

, et al. (2019) Optimized large-message broadcast for deep learning workloads: MPI, MPI+ NCCL, or NCCL2? Parallel Computing 85: 141–152.

Azad

MAK

Iqbal

Hassan

, et al. (2023) An empirical study of high performance computing (hpc) performance bugs In: 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR). IEEE, 194–206.

Basalov

Vyalukhin

(2012) Adaptive routing system for the domestic interconnect SMPO-10G. VANT. Ser.: Mat. Mod. Fiz. Proc 3: 64–70.

Bernholdt

Bosilca

Bouteiller

, et al. (2024) Taking the MPI standard and the open MPI library to exascale. The International Journal of High Performance Computing Applications 38(5): 491–507.

10.

Birrittella

Debbage

Huggahalli

, et al. (2015) Intel® omni-path architecture: enabling scalable, high performance fabrics In: 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects. IEEE, 1–9.

11.

Boden

Cohen

Felderman

, et al. (1995) Myrinet: a gigabit-per-second local area network. IEEE Micro 15(1): 29–36.

12.

Bossard

(2023) Torus-connected toroids: an efficient topology for interconnection networks. Computers 12(9): 173.

13.

Bratek

Szustak

Wyrzykowski

, et al. (2023) Reducing energy consumption using heterogeneous voltage frequency scaling of data-parallel applications for multicore systems. Journal of Parallel and Distributed Computing 175: 121–133.

14.

Chu

Awan

, et al. (2018) Exploiting hardware multicast and GPUDirect RDMA for efficient broadcast. IEEE Transactions on Parallel and Distributed Systems 30(3): 575–588.

15.

Darche

Dagenais

(2023) Low-overhead trace collection and profiling on GPU compute kernels. ACM Transactions on Parallel Computing 11: 1–24.

16.

De Sensi

Di Girolamo

McMahon

, et al. (2020) An in-depth analysis of the slingshot interconnect In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14.

17.

Dietrich

Winkler

Tschüter

, et al. (2021) Enabling performance analysis of kokkos applications with Score-P In: Tools for High Performance Computing 2018/2019: Proceedings of the 12th and of the 13th International Workshop on Parallel Tools for High Performance Computing, Stuttgart, Germany, September 2018, and Dresden, Germany, September 2019. Springer, 169–182.

18.

Emmanuel

Moy

Henrio

, et al. (2021) S4BXI: the MPI-Ready portals 4 simulator In: 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 1–8.

19.

Fiorini

Dagenais

(2022) Visualization of profiling and tracing in cpu-gpu programs. Concurrency and Computation: Practice and Experience 34(23): e7188.

20.

Froyd

Mellor-Crummey

Fowler

(2005) Low-overhead call path profiling of unmodified, optimized code. Proceedings of the 19th Annual International Conference on Supercomputing. ACM, 81–90.

21.

Goncharuk

Grishichkin

Semenov

, et al. (2022) Evaluation of the angara interconnect prototype TCP/IP software stack: implementation, basic tests and BeeGFS benchmarks. In: Voevodin

Sobolev

Yakobovskiy

, et al. (eds) Supercomputing: Springer International Publishing, 423–435.

22.

Hagerty

Melesse Vergara

Tharrington

(2023) Studying performance portability of LAMMPS across diverse GPU-Based platforms. Concurrency and Computation: Practice and Experience 35(28): e7895.

23.

HECBioSim (2025) HECBioSim HPC Benchmarking Suite. https://www.hecbiosim.ac.uk/access-hpc/hpc-benchmarking-suite. Online; accessed 20-October-2025.

24.

Hoefler

Siebert

Rehm

(2007) A practically constant-time MPI broadcast algorithm for large-scale InfiniBand clusters with multicast In: 2007 IEEE International Parallel and Distributed Processing Symposium. IEEE, 1–8.

25.

Ismagilov

Mukosey

Smirnov

, et al. (2025) Tracing of GPU-Aware MPI applications: first benchmarks for the angara interconnect. Proceegins of International Conference Parallel Processing and Applied Mathematics 2024 (PPAM-2024). Springer Nature, 256–270.

26.

Khalilov

Timofeev

Polyakov

(2022) Towards openucx and gpudirect technology support for the angara interconnect. In: Voevodin

Sobolev

Yakobovskiy

, et al. (eds) Supercomputing. Springer International Publishing, 591–603.

27.

Kim

Dally

Scott

, et al. (2008) Technology-driven, highly-scalable dragonfly topology In: 2008 International Symposium on Computer Architecture. IEEE, 77–88.

28.

Kondratyuk

Nikolskiy

Pavlov

, et al. (2021) GPU-Accelerated molecular dynamics: state-Of-Art software performance and porting from nvidia CUDA to AMD HIP. The International Journal of High Performance Computing Applications 35(4): 312–324.

29.

Kreuzer

Eicker

Amaya

, et al. (2018) Application performance on a cluster-booster system In: 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 69–78.

30.

Lehr

Iwainsky

Bischof

(2017) The influence of HPCToolkit and Score-p on hardware performance counters. Proceedings of the 4th ACM SIGPLAN International Workshop on Software Engineering for Parallel Systems. ACM, 21–30.

31.

Song

Chen

, et al. (2019) Evaluating modern gpu interconnect: PCIe, NVlink, NV-SLI, NVSwitch and GPUdirect. IEEE Transactions on Parallel and Distributed Systems 31(1): 94–110.

32.

Yadav

, et al. (2023) Performance implications of async memcpy and uvm: a tale of two data transfer modes In: 2023 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 115–127.

33.

Lai

Chang

(2022) A survey of high-performance interconnection networks in high-performance computer systems. Electronics 11(9): 1369.

34.

Mey

Biersdorf

Bischof

, et al. (2012) Score-p: a unified performance measurement system for petascale applications In: Competence in High Performance Computing 2010: Proceedings of an International Conference on Competence in High Performance Computing, June 2010, Schloss Schwetzingen. Springer, 85–97.

35.

Mills

Adams

Balay

, et al. (2021) Toward performance-portable petsc for gpu-based exascale systems. Parallel Computing 108: 102831.

36.

Mukosey

Semenov

Simonov

(2015) Simulation of collective operations hardware support for angara interconnect. Vestnik Yuzhno-Ural’skogo Gosudarstvennogo Universiteta. Seriya” Vychislitelnaya Matematika i Informatika 4(3): 40–55.

37.

Mukosey

Semenov

Tretiakov

(2024) Graph based routing algorithm for torus topology and its evaluation for the angara interconnect. Journal of Parallel and Distributed Computing 183: 104765.

38.

Neuwirth

(2022) Assessment of the I/O and storage subsystem in modular supercomputing architectures In: 2022 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 589–596.

39.

Nüssle

Geib

Fröning

, et al. (2009) An fpga-based custom high performance interconnection network In: 2009 International Conference on Reconfigurable Computing and FPGAs. IEEE, 113–118.

40.

Pavlov

Galigerov

Kolotinskii

, et al. (2024) GPU-Based molecular dynamics of fluid flows: reaching for turbulence. The International Journal of High Performance Computing Applications 38(1): 34–49.

41.

Petrini

Feng

Hoisie

, et al. (2002) The quadrics network: high-Performance clustering technology. IEEE Micro 22(1): 46–57.

42.

Potluri

Hamidouche

Venkatesh

, et al. (2013) Efficient inter-node MPI communication using GPUDirect RDMA for InfiniBand clusters with NVIDIA GPUs In: 2013 42nd International Conference on Parallel Processing. IEEE, 80–89.

43.

Wang

Huang

, et al. (2025) Improving LAMMPS performance for molecular dynamic simulation on large-scale HPC systems. The Computer Journal 68(6): 706–716.

44.

Rohr

Neskovic

Lindenstruth

(2015) The l-csc cluster: optimizing power efficiency to become the greenest supercomputer in the world in the green500 list of November 2014. Supercomput. Front. Innov.: International Journal 2(3): 41–48.

45.

Ruhela

Manian

, et al. (2020) Analyzing and understanding the impact of interconnect performance on hpc, big data, and deep learning applications: a case study with infiniband edr and hdr In: 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 869–878.

46.

Simonov

Brekhov

(2020) Architecture and functionality of the collective operations subnet of the angara interconnect. In: Vishnevskiy

Samouylov

Kozyrev

(eds) Distributed Computer and Communication Networks. Springer International Publishing, 209–219.

47.

Stegailov

Dlinnova

Ismagilov

, et al. (2019) Angara interconnect makes GPU-based Desmos supercomputer an efficient tool for molecular dynamics calculations. The International Journal of High Performance Computing Applications 33(3): 507–521.

48.

Thompson

Aktulga

Berger

, et al. (2022) Lammps-a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications 271: 108171.

49.

Tronge

Chen

Grubel

, et al. (2023) An hpc-container based continuous integration tool for detecting scaling and performance issues in hpc applications. IEEE Transactions on Services Computing 17: 156–168.

50.

Williams

Brunst

(2023) Parallel performance engineering using Score-P and Vampir. Companion of the 2023 ACM/SPEC International Conference on Performance Engineering. ACM, 121–125.

51.

Zhou

Krentel

Mellor-Crummey

(2020) Tools for top-down performance analysis of GPU-accelerated applications. Proceedings of the 34th ACM International Conference on Supercomputing. ACM, 1–12.

52.

Zhou

Adhianto

Anderson

, et al. (2021) Measurement and analysis of gpu-accelerated applications with hpctoolkit. Parallel Computing 108: 102837.

53.

Zhou

Anderson

Meng

, et al. (2022) Low overhead and context sensitive profiling of GPU-accelerated applications. Proceedings of the 36th ACM International Conference on Supercomputing. ACM, 1–13.