623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores

Abstract

In this article, we present a new hybrid algorithm to enable and scale the high-performance conjugate gradients (HPCG) benchmark on large-scale heterogeneous systems such as the Tianhe-2. Based on an inner–outer subdomain partitioning strategy, the data distribution between host and device can be balanced adaptively. The overhead of data movement from both the MPI communication and the PCI-E transfer can be significantly reduced by carefully rearranging and fusing operations. A variety of parallelization and optimization techniques for performance-critical kernels are exploited and analyzed to maximize the performance gain on both host and device. We carry out experiments on both a small heterogeneous computer and the world’s largest one, the Tianhe-2. On the small system, a thorough comparison and analysis has been presented to select from different optimization choices. On Tianhe-2, the optimized implementation scales to the full-system level of 3.12 million heterogeneous cores, with an aggregated performance of 623 Tflop/s and a parallel efficiency of 81.2%.

Keywords

Tianhe-2 HPCG conjugate gradients MIC heterogeneous computing

1. Introduction

The ever growing demand in extreme scale computing continues to push the performance limit of supercomputers to higher levels. Nowadays, heterogeneous architecture has become a competitive choice to build supercomputing systems that run at desired computational rates and fit within an affordable power envelope. Such systems are equipped with both multi-core CPUs and many-core accelerators; examples include Tianhe-2, which is comprised of both Intel Xeon CPUs and Intel Xeon Phi coprocessors, and Titan, which consists of both AMD Opteron CPUs and NVIDIA Tesla K20x GPUs. In order to fully exploit the computing capacity of a heterogeneous computer, people need to study on how to keep the load balance among different processing units, reduce data movement between heterogeneous resources, and conduct vectorization/parallelization in the many-core accelerator, among other considerations.

The high-performance conjugate gradients (HPCG) benchmark, recently introduced to the high-performance computing world, aims to better correlate computation and data access patterns found in many applications today. Compared to the high-performance LINPACK (HPL), which is the de facto standard to rank supercomputing systems on the TOP500 List, HPCG pays closer attention to real-world applications and contains more typical scientific computing operations, such as the Krylov subspace iteration, the sparse matrix–vector multiplication, the overlapping additive Schwarz preconditioner, the geometric multigrid V-cycle, and the symmetric Gauss–Seidel relaxation. It is a challenging task to optimize these operations on a leadership heterogeneous system with high computing power, yet relatively low data-moving capability.

Announced in June 2013, and taking the top place on the TOP500 List since then, Tianhe-2 remains the largest heterogeneous system to date. There are over 3 million hybrid cores available for Tianhe-2, with less than 10% coming from the Intel Xeon CPUs and the rest from the Intel Xeon Phi accelerators based on the many integrated core (MIC) architecture. Questions have arisen on how to leverage these millions of cores in extreme scale computing. We believe methodologies and technologies applied in optimizing HPCG on Tianhe-2 may shed some light on possible answers. After making a first attempt to optimize and scale HPCG on Tianhe-2 without using the MIC accelerators (Zhang et al., 2014), we proposed a hybrid CPU-MIC algorithm (Liu et al., 2014) for single-node HPCG optimization. Recently, we further polished the hybrid algorithm and pushed forward to enable a full-system run on Tianhe-2, achieving 623 Tflop/s, a new performance record on the HPCG List of 2014.

In this article, we will provide a detailed overview of the proposed hybrid algorithms and optimization methodologies applied in optimizing HPCG on Tianhe-2. First, based on an inner–outer subdomain partitioning strategy, we propose a new data partition algorithm in order to make a balanced utilization of host and device and reduce the data movement overhead at the same time. Second, different parallelization and optimization methods are exploited on the most time-consuming kernel in HPCG. Third, we list the various optimization techniques on both CPU and MIC for other performance-critical kernels. Finally, performance results on both a small heterogeneous system and the Tianhe-2 supercomputer are presented and analyzed.

2. Basic HPCG algorithm

Originally proposed by Dongarra and Heroux (2013), the HPCG benchmark focuses on measuring the capability of a supercomputer in solving a large sparse linear system. As detailed in Dongarra and Luszczek (2013), the linear system solved in HPCG is obtained from discretizing the Poisson equation $- Δ u = f$ , defined in a 3-D regular cubic domain with homogeneous Dirichlet boundary conditions. Based on a semistructured mesh with equidistant mesh length in each dimension, the discretization employed in HPCG leads to a second-order 27-point stencil as illustrated in Figure 1. To achieve high performance, it is allowed to use any data formats of both sparse matrices and dense vectors in HPCG, but it is prohibited from taking advantage of the specific structure of the problem to avoid the indirect memory access patterns in all sparse matrix kernels. We use the latest version of HPCG, v2.4, in our research.

Figure 1.

The 3-D 27-point stencil in HPCG.

In HPCG, the linear system $Ax = b$ is solved using a preconditioned conjugate gradients (CG) algorithm as described in Algorithm 1, where M⁻¹ is the preconditioner. In the main loop of Algorithm 1, the major cost consists of a sparse matrix–vector multiplication (SpMV) in line 7; a multigrid preconditioner (MG) in line 3; three vector dot products in lines 4, 7, and 10; and three vector updates (WAXPBY) in lines 6, 8, and 9.

Algorithm 1 Preconditioned CG for $Ax = b$
InputA, b, x₀, it_max, ε
1: $r_{0} \leftarrow b - A x_{0}$
2: $for i = 0, 1, \dots, i t_{\max} do$
3: $z_{i} \leftarrow M^{- 1} r_{i}$
4: $s_{i} \leftarrow (r_{i}, z_{i})$
5: $if (i = 0) p_{i} \leftarrow z_{i}$
6: $else p_{i} \leftarrow z_{i} + (s_{i} / s_{i} {s_{i}}_{- 1} {s_{i}}_{- 1}) {p_{i}}_{- 1}$
7: $α_{i} \leftarrow s_{i} / s_{i} (p_{i}, A p_{i}) (p_{i}, A p_{i})$
8: $x_{i + 1} \leftarrow x_{i} + α_{i} p_{i}$
9: $r_{i + 1} \leftarrow r_{i} + α_{i} p_{i}$
10: $if (∥ r_{i + 1} ∥_{2} / ∥ r_{i + 1} ∥_{2} ∥ r_{0} ∥_{2} \leq ε ∥ r_{0} ∥_{2} \leq ε) break$ ;
11: end for
Output $x_{i + 1}$

The preconditioner M⁻¹ used in HPCG is based on a V-cycle geometric multigrid method, as shown in Algorithm 2. In the algorithm, a symmetric Gauss-Seidel (SymGS) method is used as the coarsest level solver and the pre- and post-smoothers as well. Here SymGS(x, r) denotes a SymGS iteration, where r is the right-hand side and x is the initial guess of the solution vector. More specifically, SymGS(x, r) is equivalent to:

x \leftarrow {(L + D)}^{- 1} (r - U_{x}), / / forward sweep and

(1)

x \leftarrow (U + D)^{- 1} (r - Lx), / / backward sweep,

(2)

where D is the diagonal of A, and L and U are the strict lower and upper triangular parts of A, respectively. It is worth noting that the ordering of unknowns affects the behavior of the SymGS smoother. By default, a natural ordering of the grid points is used in HPCG. A user-defined ordering is allowed, including the red–black reordering techniques to improve the locality or increase the concurrency of the SymGS smoother. But caution has to be applied when using these methods because both the time cost on reordering the matrix and the possibly increased number of CG iterations will be counted as performance penalties.

Algorithm 2

V-cycle for $x^{h} = MG (r^{h})$ .

Input: $r^{h}$
1: if on the coarse level then
2: $x^{h} = SymGS (0, r^{h})$	⊲Coarse-level solver
3: else
4: $x^{h} = SymGS (0, r^{h})$	⊲Pre-smoothing
5: $r^{2 h} = I_{2 h}^{h} (r^{h} - A^{h} x^{h})$	⊲Restriction
6: $x^{2 h} = MG (r^{2 h})$	⊲Recursion
7: $x^{h} = x^{h} + I_{2 h}^{h} x^{2 h}$	⊲Prolongation
8: $x^{h} = SymGS (x^{h}, r^{h})$	⊲Post-smoothing
9: end if
Output: $x^{h}$

HPCG uses a 3-D domain decomposition strategy for distributed parallel computing. The whole domain is partitioned into smaller subdomains, each of which is assigned to a message passing interface (MPI) process. Two types of MPI communications exist in HPCG. One is the MPI allreduce communication required in each vector dot product. The other is the neighboring communication in the SpMV and SymGS sparse matrix kernels used to exchange the boundary data (i.e. halos) of each subdomain with its neighbors.

3. Hybrid CPU-MIC algorithm

To maximize the utilization of different computing resources in a heterogeneous system, a hybrid algorithm for data partitioning is required. The algorithm is not only responsible for assigning tasks among the host and device but also used to coordinate the asynchronous computation/communication between them. In this section, we propose and analyze a new data partitioning strategy, along with corresponding methods to reduce the data movement overhead.

3.1. Inner–outer subdomain partition

In each subdomain, there are two types of grid points. For points dependent on the halo data, the computation cannot be started until finishing the halo exchange procedure. The rest are, however, halo-independent points that can be processed locally without waiting on any MPI communication. Based on these facts, we propose an inner–outer partitioning method. In the new partitioning method, a subdomain is divided into one or several regular inner tasks and an irregular outer task. Each inner task is processed by an accelerator, while the whole outer part is processed by the CPU. Figure 2(a) shows a schematic diagram of the inner–outer partition for the case of three MIC devices.

Figure 2.

The inner–outer subdomain partitioning methods for the case of three MIC devices. Only the x–z cross section is shown for simplicity.

There are several advantages of the inner–outer partitioning strategy.

The inner task assigned to each accelerator is halo independent, therefore the computation done on the accelerator can be started immediately without waiting for the halo data from remote nodes to arrive.

The ratio of the inner–outer tasks can be adjusted according to the computing capacity of the host and device to improve the load balance between them.

Direct data transfer between accelerators, which may not be available on some platforms (e.g. for MIC devices when running on offload mode), is avoided by leaving slices for the CPU between inner tasks.

The required volume of transferred data between the host and device is roughly in proportion to size of the inner–outer interface, which is small as compared to the whole subdomain.

On Tianhe-2, since there are three MIC accelerators in each computing node, we have two possible ways to perform the inner–outer partition. One is the node-wise partition, as shown in Figure 2(a), in which each computing node is assigned one MPI process corresponding to a subdomain that is further partitioned to three MIC tasks and one outer task. The other method uses the accelerator-wise partition, where each computing node is assigned three MPI processes, with each one corresponding to a subdomain. The subdomain in the accelerator-wise parition is smaller as compared to the one in the node-wise partition, with each one partitioned to only one inner task for the MIC device and one outer task for the CPU. A schematic diagram of the accelerator-wise inner–outer partition is shown in Figure 2(b). There is a trade-off in selecting from the two alternatives of the inner–outer partitioning method. The node-wise version, successfully applied stencil computations (Xue et al., 2014) and later for the single-node optimization of HPCG (Liu et al., 2014), seems preferable to the accelerator-wise version because the total number of MPI processes is minimized. But further analysis of the cost of MPI communication suggests the opposite.

Suppose there are p computing nodes available, then the total number of MPI processes is p for the node-wise partition and 3p for the accelerator-wise partition. Further assume that in either case, the total number of points in the MIC task is the same, which is usually proportional to the local memory capacity of the MIC device. Then for the accelerator-wise partition, it is natural to set the size of the MIC task to be a regular cube with length n to minimize communication overhead. And for the node-wise partition, we denote the length of a MIC task to be n_x, n_y, and n_z in each dimension. Since the volume of the MIC task is the same in the two versions, we have $n_{x} n_{y} n_{z} = n^{3}$ . It is worth noting that the thickness of the subdomain boundary is much smaller than the length of the MIC task; we therefore ignore it in the following analysis. And for simplicity, we only count the communication of any given subdomain to its six immediate neighbors, since the amount of communication to all other neighbors is negligible. We then compare the communication amount from each subdomain in the two cases, denoted as S₁ and S₂, respectively. We have $S_{1} = 6 n_{x} (n_{y} + n_{z}) + 2 n_{y} n_{z}$ and $S_{2} = 6 n^{2}$ . Solving the optimization problem:

R = min \frac{S_{1}}{S_{2}} = min (\frac{n}{n_{y}} + \frac{n}{n_{z}} + \frac{1}{3} \frac{n_{y}}{n} \frac{n_{z}}{n})

(3)

leads to $n_{x} = 3^{- 2 / 3} n$ , $n_{y} = n_{z} = 3^{1 / 3} n$ , and $R = 3^{2 / 3} \approx 2.08$ . The above analysis shows that when comparing the node-wise partition with the accelerator-wise version, although the total communication amount over all MPI processes can be reduced by as much as $1 - (p S_{1} p S_{1} 3 p S_{2} - 3 p S_{2}) = 1 - R / R 3 - 3 \approx 31 %$ , the averaged communication amount per MPI process is increased by at least $R - 1 \approx 108 %$ . Note that in either case, the volume of the MIC task is assumed to be the same, which means that the time spent on computation is nearly a constant. The rapidly increased amount of MPI communication in the node-wise partition may eventually introduce substantial overhead that cannot be overlapped by local computation.

Another disadvantage of the node-wise partition comes from the allocation overhead, which introduces penalty to the final HPCG performance. In a previous work (Liu et al., 2014), we found that in the current offload programming environment, the allocation procedure on the three MIC devices cannot be completely done in a truly multi-threaded asynchronous way. The measured cost of doing so grows almost linearly as the number of MIC devices increases. When using the accelerator-wise alternative instead, allocating on the three MIC devices can be done independently with almost no extra cost. Based on the above facts, we choose to use the accelerator-wise version of the inner–outer partitioning method in what follows.

3.2. Load balance between host and device

In the inner–outer partitioning method, all the interior boundaries of a subdomain are of the same thickness on each grid level of the V-cycle multigrid preconditioner. Denote the length of the subdomain to be N_x, N_y, and N_z in the three dimensions, respectively. Let L be the thickness of the outer part at the finest grid level. The number of points in the MIC task is then $(N_{x} - 2 L) (N_{y} - 2 L) (N_{z} - 2 L)$ . Further denoting R_effective as the effective ratio of the number of points assigned to MIC and R_theoretical as the theoretical ratio of the peak performance contributed by MIC, we have:

R_{effective} = \frac{(N_{x} - 2 L) (N_{y} - 2 L) (N_{z} - 2 L)}{N_{x} N_{y} N_{z}}

(4)

and

R_{theoretical} = \frac{flop s_{mic}}{flop s_{cpu} + flop s_{mic}} .

(5)

To achieve an ideal load balance between CPU and MIC, we expect that the effective ratio is larger than, but close to, the theoretical ratio; that is, $R_{effective} ≳ R_{theoretical}$ . Otherwise, the work assigned to MIC will be inadequate. On Tianhe-2, since there are a total of 24 CPU cores and 3 MIC devices in each computing node, we may assign as many as 8 CPU cores to work along with each MIC card for each subdomain. The contributed performance from the 8 CPU cores is 140.8 Gflop/s and that from the MIC device is 1003.2 Gflop/s. Substituting these values into equation (5), we get $R_{theoretical} \approx 87.7 %$ . In practice, we leave at least one CPU core to handle the MPI communication of each subdomain, and the actual value of R_theoretical is slightly larger than 87.7%. We remark that if the memory bandwidth measured by the STREAM benchmark is used instead of the peak performance in the calculation of R_theoretical, the result will be around 87.8%, which does not change the conclusion we will make in the sequel.

Figure 3 lists some typical values of R_effective calculated according to equation (4) in the case that the subdomain is a regular cube with length N and boundary thickness L. From Figure 3, we observe that larger values of N help increase R_effective to the ideal value, but larger values of L dramatically worsen it. When the subdomain is not regular, then it is easy to verify that the effective ratio is smaller than R_effective in equation (4), with $N = {(N_{x} N_{y} N_{z})}^{1 / 3}$ . So the above observations still apply.

Figure 3.

The effective ratio (R_effective) for different subdomain size (N³) and boundary thickness (L).

For the boundary thickness L, we need to set it carefully to obtain a consistent partition on each level while maintaining a good load balance between the host and device at the same time. Since HPCG uses a four-level MG with a fine to coarse ratio of 2, it is natural to consider setting L on the finest level to be a multiple of 8. Figure 4(a) shows the data partition done on each level of the V-cycle MG when $L = 8, 16, 32, \dots$ . In practice, considering the memory capacity of the MIC device, the size of the subdomain usually falls within the range of $128^{3} \leq N_{x} N_{y} N_{z} \leq 256^{3}$ . From Figure 3, we can see that even for the smallest allowable value of the boundary thickness of L = 8, the effective ratio of an MIC task is still below the ideal. In order to improve the load balance, we have to sacrifice the grid consistency on the coarsest level by setting L = 4. In this case, a typical subdomain size such as 176×176×176 results in an effective ratio of 0.87, which is very close to expected. The revised task partitioning scheme is shown in Figure 4(b). A CPU-only partition is employed on the coarsest level instead. As a trade-off, some extra overhead is introduced to transfer data between CPU and MIC on the coarsest level. However, the overhead is negligible because of the relatively small data size and can be further reduced by carefully exploiting the data dependency and the execution order of operations done on the coarsest level.

Figure 4.

The V-cycle multigrid preconditioner and the corresponding hybrid inner–outer partitioning methods for different choices of the boundary thickness L on the finest level.

3.3. Reducing data movement overhead

There are two types of data movement that may significantly affect the overall performance of HPCG. One is the MPI neighboring communication for halo exchange among different computing nodes. The other is the local peripheral component interconnect express (PCI-E) transfer between CPU and MIC within each heterogeneous node. Both types of data movement can benefit from the proposed inner–outer subdomain partitioning method due to the reduced amount of data dependency between host and device. To further reduce the overhead of the data movement, we conduct some additional optimizations as follows. First, we notice that both types of data movement can be safely removed in the SymGS kernels for pre-smoothing and coarsest-level solving because the initial guess is 0. Then we go through all kernels that still require the two types of data movement and list them in Table 1. In order to seek an opportunity to overlap the data movement with computation, we also list in the table the preceding kernels that could be fused with the current ones. Without the loss of generality, in the following discussion, we take SpMV in the main CG loop as an example of the current kernel, which can be fused with the preceding kernel WAXPBY.

Table 1.

List of all kernel pairs that can be fused to reduce data movement overhead in HPCG.

Preceding kernel	Current kernel	Calling position
WAXPBY	SpMV	CG
SymGS (pre-smoother)	SpMV	MG (level 0–1)
Prolongation	SymGS (post-smoother)	MG (level 0–2)

For analysis, we divide each type of the data movement into four steps: (1) packing data, (2) starting data movement, (3) moving data, and (4) finishing data movement. Note that when using the blocking or synchronized data movement subroutines on the MPI or PCI-E level, the last three steps are not separated. In practice, we use the nonblocking and asynchronous functions to seek a better chance to reduce the overhead of data movement.

3.3.1. The unfused schedule

In the reference implementation of HPCG, the exchange of halo data is done at the beginning of the current kernel, and the computation is not started until the MPI communication is finished. Based on the reference implementation of HPCG, a simple hybrid schedule, as shown in Figure 5(a), can be designed by taking into account the four-step decomposition of the MPI communication. After packing the halo data in step (1), nonblocking point-to-point communications are posted in step (2). During the MPI communication of step (3), intra-node data transfers between host and device via duplex PCI-E transfer are inserted in an asynchronous way. Finally, in step (4), a synchronization point is introduced to finish the nonblocking MPI point-to-point communication. This simple schedule successfully allows all the heterogeneous units to work simultaneously and overlaps a portion of MPI neighboring communication with computation. However, at the beginning of the current kernel, before doing computation on the inner task, the MIC device has to wait for the end of a number of operations, including the packing and starting of the MPI communication, and all four steps to transfer data from host to device. The overhead introduced by these operations may highly degrade the sustained performance on the MIC accelerator.

Figure 5.

Heterogeneous schedules to reduce the data movement overhead from both MPI neighboring communication and PCI-E data transfer. The length of the dashed line with double-sided arrows represents the total consumed time. The “#” marked blocks on the CPU represent the time spent waiting for the MIC to finish the computation, and the blank space on the MIC indicates the idle time.

3.3.2. The fused schedule

Assuming that the computation done on CPU and MIC requires exactly the same time, when the unfused schedule in Figure 5(a) is used, MIC has to wait for the CPU to finish processing the SpMV kernel because the computation of the outer part is started later than the inner one. That is why we set $R_{effective} ≳ R_{theoretical}$ in Section 3.2 to keep load balance, which leads to the corresponding waiting time on the CPU side. The waiting time can be shortened by adjusting the ratio of the inner and outer parts to lessen the load on CPU. By properly fusing the current kernel with the preceding one, we then have a chance to exploit the CPU waiting time in the preceding kernel to pack and transfer the data from CPU to MIC and start the neighboring MPI communication as early as possible. The fused schedule is presented in Figure 5(b). In the fused schedule, the non-computing time on MIC is greatly reduced, which only includes packing and starting the data transfer from device to host, and possibly some extra time to wait for the data from the host to arrive. All other data movement related operators on the CPU are fused into the preceding kernel.

4. Parallelization and optimization of SymGS

The SymGS kernel in the MG preconditioner consumes a large portion of the running time of HPCG. Two major difficulties exist in optimizing and parallelizing SymGS. One is the strong data dependency required to update values of grid points in the SymGS iteration, and the other is the large memory footprint required to repeatedly access the large sparse matrix and the dense vectors. The strong data dependency makes SymGS hard to parallelize. Although a red–black relaxation can be used to enhance the concurrency, it often comes along with a performance penalty due to the increased number of iterations. The high cost of memory access in SymGS, on the other hand, calls for a carefully designed algorithm, possibly in combination with the red–black reordering method, to fuse certain operations such that the locality can be effectively improved.

4.1. Forward–backward fused SymGS

There are two sweeps in the SymGS smoother, namely, the forward and the backward sweeps as defined in equations (1) and (2), respectively. For a grid with N unknowns, the number of nonzeros in the sparse matrix A is almost 27N in total. On a system with 64-bit doubles and 32-bit integers, each sweep of SymGS involves about 12 ×27N bytes memory access from A and at least 16N bytes from the two vectors r and x, but only performs about 54N floating-point operations. The resulting computation to memory ratio of SymGS is less than 0.17 flop/byte, which makes HPCG seriously memory bound.

A key to improving the performance of SymGS is to increase the temporal locality of the data access. Because the forward sweep is immediately followed by the backward sweep in SymGS, it is possible to reuse the data of the sparse matrix by properly fusing the two sweeps. The fusion technique can be incorporated with any red–black relaxation in SymGS. For example, we do a red–black reordering only along dimension z. The update of the z_i plane only depends on the z_i plane itself and the two adjacent planes of the other color. We draw a dependency graph in Figure 6, where each node represents the update of a colored plane in the z dimension. The processing steps of the original red–black relaxation for SymGS are also shown in the figure. In each step of the red–black SymGS, concurrency is increased because all planes of the same color in the same “horizontal” level of the graph can be processed in parallel. To increase the temporal locality, we expect to reuse the result of the forward sweeps as early as possible. In the dependency graph, the data reuse can be improved by exploiting the graph in the “vertical” direction. For example, after finishing the forward sweeps of z₀ and z₂ red planes, we may perform the forward sweep of z₁ black plane, immediately followed by the backward sweep of the same z₁ black plane, then the backward sweep of the z₀ red plane. In this way, the computed result in the z₀ plane can be obtained as early as possible and the data reuse distance can be greatly shortened. Based on this observation, we propose a forward–backward fused SymGS (Zhang et al., 2014) as presented in Algorithm 3. It is worth noting that the fusing technique for the SymGS in HPCG, with certain generalizations, can be applied in other applications of stencil computing.

Figure 6.

Dependency graph among xy planes in SymGS after a red–black relaxation along dimension z.

Algorithm 3

Forward–backward fused SymGS

1 Forward sweep along

z_{0}

for i = 2, \dots, N_{z} - 2, step 2 do

3: Forward sweep along

z_{i}

4: Forward sweep along

z_{i - 1}

5: Backward sweep along

z_{i - 1}

6: Backward sweep along

z_{i - 2}

7: end for

8: Forward sweep along

z_{N_{z} - 1}

9: Backward sweep along

z_{N_{z} - 1}

10: Backward sweep along

z_{N_{z} - 2}

4.2. Parallelization of fused SymGS

Based on the forward–backward fused SymGS in Algorithm 3, we propose some architecture-aware parallelization methods, with the first two for the inner task of size $n_{x} \times n_{y} \times n_{z}$ and the last one for the outer task of size $N_{x} \times N_{y} \times N_{z}$ and boundary thickness L.

4.2.1. fusez-rby

This method employs another red–black relaxation along dimension y to enable multi-threading parallelization. Each thread performs computation in a number of successive x pencils, as shown in Figure 7(a), where circled numbers indicate the updating order. In practice, we spare one physical core in the MIC accelerator for the Intel Many-core Platform Software Stack (MPSS) and offload-related services and enable the four hardware threads on all the rest of the cores to maximize the performance. Considering the Intel Xeon Phi 31S1P in Tianhe-2, $n_{y} / n_{y} 2 2$ parallelism is obtained with a working set size of $3 n_{x} n_{y} / 3 n_{x} n_{y} 224 224$ per thread and $3 n_{x} n_{y} / 3 n_{x} n_{y} 56 56$ per core. The size of the last-level cache is expected to be large enough to hold all the needed data of the sparse matrix and vectors. For example, when $n_{x} = n_{y} = n_{z} = 192$ , the last-level cache should be around 671 KB per core at least, which is greater than the available capacity in the Intel Xeon Phi 31S1P device. Because a data dependency between the update of xy planes and x pencils exists, a total of 4n_z barriers are needed when implementing the algorithm in OpenMP. The overhead of synchronization among such a large number of threads is substantial. Due to the fact that the update of points assigned to a given thread only depend on neighboring points owned by adjacent threads, it is not necessary to perform a complete synchronization among all threads. We design a lightweight tool to conduct efficient thread communications like this. This tool supports two types of thread synchronizations. The first one, designed for the synchronization among neighboring threads, can be easily implemented since each thread only depends on its two neighbors. The other one, designed for the group synchronization to be used later, is implemented based on the tournament barrier algorithm that can be found in Graunke and Thakkar (1990) and Mellor-Crummey and Scott (1991).

Figure 7.

Different parallelization methods of SymGS with red–black relaxing and forward–backward sweep fusing in dimension z.

4.2.2. fusez-rby-2level

For a task size with a small n_y, fusez-rby suffers from both limited parallelism and load imbalance. On an Intel Xeon Phi 31S1P, there are 224 threads available on the platform, with every four running concurrently on the same core. We thus propose a two-level parallelization method, referred to as fusez-rby-2level, to split threads into groups according to their affinity. Let the two-tuple (gid, tid) identify a certain thread, where gid is the group id and tid is the local thread id in the thread group. Since the four hardware threads of the same core share the on-chip resources such as L2 cache and L2 translation lookaside buffer (TLB), we may assign them to the same group by setting $gid = id / id 4 4$ and $t i d = i d % 4$ , where id is the global id of the thread. Therefore, we can unroll along the z dimension four times and spread the threads as shown in Figure 7(b). In the two-level method, different xy planes are assigned to threads within a group, and the update of an xy plane is distributed across different thread groups instead. In this case, 2n_y parallelism is obtained at the expense of a significantly increased working set, the size of which is $3 n_{x} n_{y} / 3 n_{x} n_{y} 56 56$ per thread and $9 n_{x} n_{y} / 9 n_{x} n_{y} 56 56$ per core. By organizing more threads to a group, the parallelism can be further increased. Correspondingly, it requires a larger cache size to hold the working set, otherwise cache overflow or conflict may occur. In terms of synchronization, the number of OpenMP barriers in fusez-rby-2level is n_z, which is much smaller than that in the fusez-rby method. Besides, the overhead of the synchronization required both within and among thread groups can be also reduced using our customized thread communication tool.

4.2.3. fusez-rby for the outer task

Different from the MIC accelerator, the CPU has a very large L3 cache to handle the increased working set resulting from the fusion of forward–backward sweeps in SymGS. But Algorithm 3 cannot be directly applied due to the irregular shape of the outer task. To apply it, we divide the outer part into three regular parts. The top and bottom parts are regular cubic blocks. The medium part is the most complex, of which each xy plane is a hollowed rectangular. We store the outer part in a contiguous region in row-major order and propose a variant of Algorithm 3 to fuse the forward and backward sweep among the consecutive three parts. The simplified pseudo code is shown in Figure 8, along with the parallelization method used in each plane. Analogously, a red–black relaxation along dimension y is employed to enable parallelization for both types of planes. On the hollowed plane, we distribute the computational loads evenly by the number of data points among all CPU threads.

Figure 8.

Fused forward–backward sweeps in SymGS when processing the outer part. Here L is the boundary thickness of inner–outer partition.

4.3. Block multicolor reordering for SymGS

The application of red–black reordering destroys the inherent dependency among data points according to a certain pattern, thus leading to a convergence penalty. For example, when applying the previously mentioned red–black coloring along dimension z in SymGS, the number of CG iterations increases from 50 to about 67. When using a red–black reordering in dimension y without fusion, denoted as rby, the exploited parallelism is usually too low for the MIC platform and often suffers from load imbalance. To assess the trade-off between the improved concurrency and the convergence penalty, we propose a variant of the block multicolor reordering (Iwashita and Shimasaki 2003). The proposed variant is also based on the two-level parallelization idea, and it is feasible to choose another group size. As shown in Figure 9, we first distribute the tasks along dimension y among thread groups and then along dimension z among threads within a group. Afterward, a block four-color reordering is used within a thread to evenly split the data points into four parts along two dimensions and designate four colors. Points with the same color are smoothed one after another as the original order in the sequential SymGS. In this way, the required number of iterations are around 57 in most cases. The corresponding part in sparse matrices and vectors related to a thread can be stored contiguously to further improve the locality. Similar fast synchronization functions are also developed for this case.

Figure 9.

Block multicolor reordering relaxation applied in SymGS on the MIC platform.

All of the above parallelization methods are implemented in our optimized HPCG code. We choose, by experiment, the most appropriate version for each grid level to adapt with different task sizes.

5. Other optimization details

In addition to the parallelization and optimization of SymGS, we also exploit various optimization techniques for those performance-critical kernels on both CPU and MIC. These techniques are broadly employed in many references, therefore we only mention them briefly.

5.1. Optimizations on MIC

5.1.1. Sparse matrix storage format

The standard Compressed Sparse Row (CSR) format is employed in the HPCG reference implementation. The format is widely used but usually inefficient and unable to automatically exploit the regular patterns in the sparse matrices. Among various available storage formats, the ELLPACK format and its variants, such as the sliced ELLPACK (SELLPACK) format, are commonly applied on both graphics processing unit (GPU) and MIC (Monakov et al., 2010; Liu et al. (2013), because they not only save the array to store the row index of nonzeros but also are more efficient for single instruction, multiple data (SIMD) vectorization. To optimize sparse matrix kernels on MIC, we use a simplified SELLPACK format with a slice height of 16 to exploit the potential locality of the input vector in SpMV, where the slice height is determined by the 32-bit integer SIMD width of the MIC architecture. The matrix within a slice is stored in a column-major order. Note that we only make use of the SELLPACK format to split the dimension x into several blocks, and padding is done if the size of dimension x is not a multiple of 16. We can manually vectorize the code with the 512-bit wide SIMD intrinsics.

5.1.2. Vectorization of SpMV

In SpMV, elements of the input vector are collected by a gather instruction according to the column indices. All the arrays are well aligned on cache line boundaries so that aligned load and store instructions are used in most cases. The output vector is overwritten completely by the computed results in SpMV. Thus, the streaming store instructions are used instead to avoid wasting memory bandwidth by reading the original content from memory to cache.

5.1.3. Vectorization of SymGS

The optimized code of SymGS resembles greatly with that of SpMV. The points along dimension x are preserved for SIMDization, which is the reason we only parallelize SymGS along dimensions y and z in Section 4. In order to vectorize SymGS, we further apply a red–black relaxation along dimension x and rearrange the points with the same color contiguously in memory. As a result, every 16 points with the same color are grouped into a slice and smoothed simultaneously by a series of vectorized instructions, 236 exactly in our implementation.

5.2 Optimizations on CPU

5.2.1. Sparse matrix storage format

As for the outer part, the corresponding sparse matrix is stored in the ELLPACK format in row-major order because of the nonuniform length of dimension x. However, the stored number of nonzeros per row is padded to be a multiple of the double precision SIMD width (4 for AVX) to ensure the usage of efficient load and store instructions.

5.2.2. Vectorization

Due to the absence of the gather instruction on AVX-based architectures, we access the elements of the input vector using scalar instructions and then compose them to an SIMD vector. The reduction among the four doubles in a vector register is done by scalar instructions as well. The above facts suggest that vectorization on CPU will be less efficient than on MIC, which is consistent with our test results to be shown in Section 6.1. Apart from this, the vectorization in other cases is more or less the same as the one used on the MIC architecture.

5.3 Other common optimizations

In SpMV and all dense vector kernels, there is no data dependency in the computing order, and very little opportunity for data reuse. Parallelizations of these kernels are simply implemented by OpenMP pragma statements with a static schedule strategy. Each thread is bounded to a specified core according to the compact affinity. On the MIC side, we enable the 2MB pages, instead of the default 4 KB, to improve the TLB hit rate as well as reduce the overhead of data transfer and memory allocation.

In addition, we observe that the residual computed by SpMV in the MG is immediately used in a restriction. As a result, only 1/8 of the residual vector is transferred to and used by the coarser grid. Therefore, we fuse these two operations into one to eliminate the redundant 7/8 calculations as done in (Williams et al., 2012). The major advantage of this technique is the reduction of the computational cost. And the major disadvantage, on the other hand, is that it results in a more complicated memory access pattern (i.e. stride access instead of contiguous) on the fine levels. Optimizations such as loop unrolling and manual prefetching are employed to further enhance the overall performance.

6. Performance and analysis

Two heterogeneous systems are used in the tests, including a smaller one for single-node performance evaluation and, a much larger one, the Tianhe-2, for million-core scalability runs. Table 2 lists the detailed configuration of the two systems. On the small cluster, 8 CPU threads from the same socket and 240 MIC threads are working together on an inner–outer subdomain. On Tianhe-2, we enable 224 threads on each MIC accelerator and assign 6 CPU cores to cooperate with it. Consequently, the rest of the CPU cores in each computing node are left to handle the MPI communication.

Table 2.

The configuration of the two heterogeneous systems.

	Small heterogeneous computer		Tianhe-2
	Host	Device	Host	Device
Name	Intel Xeon E5-2650	Intel Xeon Phi 7110P	Intel Xeon E5-2692	Intel Xeon Phi 31S1P
Frequency	2.0 GHz	1.1 GHz	2.2 GHz	1.1 GHz
Sockets/cores/threads	2/8/16	1/61/244	2/12/24	1/57/228
SIMD width	4 double/8 integer	8 double/16 integer	4 double/8 integer	8 double/16 integer
Peak Gflop/s in D.P.	256	1073.6	422.4	1003.2
L1/L2/L3 cache	32 KB/256 KB/30 MB	32 KB/512 KB/–	32 KB/256 KB/30 MB	32 KB/512 KB/–
Memory capacity	32 GB	8 GB	64 GB	8 GB

D.P.: double-precision.

6.1. Effects of parallelizing SymGS

To compare the various parallelization and optimization choices of SymGS in Section 4, we conduct some experiments to assess their effects here. A simple four-color reordering in dimension y and z, rbzy, is also implemented and tested for comparison. The testing results are shown in Figure 10.

Figure 10.

Raw performance of SymGS on the finest grid when using various parallelization methods. The labels with suffix “: inner” represent the corresponding parallelization method used for the inner part while “: outer” for the outer part.

We observe that fusez-rby always outperforms rby for either the regular inner part or the irregular outer part. The averaged speedup of fusez-rby over rby is around 1.16× on MIC and around 1.37× on CPU. The higher speedup on CPU is expected because the CPU has a large L3 cache and MIC does not. From the algorithmic perspective, the fused algorithm can reduce the memory traffic by half, but the performance gain on the MIC side is greatly limited by the small L2 cache size.

A further observation indicates that in the case where the subdomain has a small length in dimension y, fusez-rby may suffer from the low degree of parallelism and result in load imbalance, even when a balanced loop distribution among different cores is enforced. The two-level parallelization method fusez-rby-2level can be used to improve the load balance. However, when the length in dimension y is large enough, the extra unrolling along dimension z in the two-level method leads to a larger working set, which may overflow the L2 cache.

Comparing in terms of parallelism, the general four-color reordering method rbzy provides the highest degree of parallelism, which is $n_{y} n_{z} / n_{y} n_{z} 4 4$ for any task size, among all tested methods. For rbzy, the data distribution among threads is more even than the block multicolor reordering method in Section 4.3, which is referred to as block-rbzy in the figure. However, the locality of rbzy is the worst among all methods because it requires plenty of discontiguous data access and is incapable of taking advantage of the task structure. Besides, not all the barriers in rbzy can be optimized with our customized thread communication tool. That is why it delivers a lower performance than block-rbzy does for most test cases.

In view of the convergence rate, two of the five tested methods, rby and block-rbzy, are able to achieve the expected convergence in around 57 iterations, and the rest all lead to much larger numbers of iterations. Among the two, block-rbzy can be considered as the two-level parallelization version of rby. Because of the fast convergence rate, the two parallelization methods are chosen to parallelize SymGS on the finest grid level for MIC. The other three methods are used to adapt with the size of the problem on other levels. As for the outer task on the CPU side, the fused method, fusez-rby, is always used since it barely affects the overall convergence rate. Because of the small data amount, a basic red–black relaxation method of SymGS along dimension z, rbz, is particularly applied to parallelize the coarsest-level SymGS. A complete list of the parallelization methods of SymGS on different grid levels for both MIC and CPU are presented in Table 3.

Table 3.

A complete list of the parallelization methods of SymGS applied on different grid levels for both MIC and CPU.

	MIC	CPU
Grid level 0	block-rbzy/rby	fusez-rby
Grid level 1	fusez-rby/fusez-rby-2level/rbzy	fusez-rby
Grid level 2	fusez-rby/fusez-rby-2level/rbzy	fusez-rby
Grid level 3	–	rbz

6.2. Single-node performance

We fix the subdomain size to 136 × 176 × 176 and evaluate the effect of different optimization techniques by incrementally adding them in our hybrid kernels. The results are illustrated in Figure 11. In the baseline implementation, the sparse matrix format is replaced with the optimized ones, but the SIMD vectorization is not applied. We analyze the results in the figures as follows.

Figure 11.

The raw performance breakdown in HPCG with different optimizations applied.

6.2.1. Vectorization (simd)

Compared to the baseline version, the SIMD vectorization leads to an averaged speedup of 4.92× on MIC and 1.23× on CPU. The speedup on MIC is high due to the vectorization-friendly SELLPACK sparse matrix format and the regular shape of the inner task. Note that some code transformations are done in sparse matrix kernels for efficient vectorizing.

6.2.2. OpenMP parallelization (omp)

Based on the red–black relaxation and other optimizations for the storage of matrices and vectors, the speedup due to OpenMP multi-threading is remarkable, which is averaged to 53.57× on MIC and 2.40× on CPU.

6.2.3. Customized thread communication (thread-comm-opt)

There is a notable 24% speedup on MIC when replacing the original OpenMP synchronization with our customized lightweight thread communicating tool in SymGS. It indicates that the synchronization overhead among hundreds of MIC threads is substantial and critical to the performance.

6.2.4. Fused schedule (schedule)

Using the fused schedule presented in Section 3.3, the data movement overhead is reduced, leading to another 4% improvement of the final performance.

6.3. Million-core performance

A full-system scale test is conducted on Tianhe-2 to 16,000 heterogeneous nodes. The results, including the aggregated HPCG performance and the parallel efficiency, are shown in Figure 12. In the test, the subdomain size is also fixed to 136 × 176 × 176. A single-node performance of 48 Gflop/s is achieved on Tianhe-2. Compared to the raw performance of 58 Gflop/s, the performance penalty is around 14% due to the convergence degradation and another around 5% due to the preparation of optimized data structures. In a previous work (Park et al., 2014b), a single-node performance of 46 Gflop/s on Tianhe-2 was reported, where an MIC-only approach was used. The performance improvement of our hybrid CPU-MIC algorithm over the MIC-only approach is below ideal due to the offload overhead for the cooperative heterogeneous computing and the fact that CPU is responsible for all MPI communication and therefore not working at the full capacity. But it is sufficient to help sustain a better performance at the extreme scale. The hybrid implementation successfully scales to 16,000 computing nodes, with an aggregate performance of 623 Tflop/s and a parallel efficiency of 81.2%.

Figure 12.

Aggregate performance and weak scalability of HPCG on Tianhe-2. Three MPI processes are running on a single node, with each one controlling a subdomain.

7. Related work

Stencil computation, often found in many scientific computing applications, is widely studied on x86 CPUs and emerging many-core architectures (Datta et al., 2008; Wellein et al., 2009; Wonnacott, 2000), of which the work mainly concentrates on methods such as temporal blocking and time skewing. A large number of optimization techniques, such as software pipelining, register blocking, row or column sorting, and dynamic load scheduling, are studied in Vuduc et al. (2005), Williams et al. (2009), Monakov et al. (2010), and Liu et al. (2013) to optimize SpMV, the basic operation in a sparse iterative solver. The above work plays an important role in guiding our optimizations of all sparse matrix kernels in HPCG.

Preconditioners are now commonly used to accelerate the convergence of iterative solvers. Thus, much work has been done on parallelizations and optimizations of preconditioners such as MG and SymGS on various platforms. For example, Kowarschik et al. (2000) and Wallin et al. (2006) employ fused updating of red and black points, temporally blocking, and fine-grained locking to optimize and parallelize Gauss–Seidel on multi-core CPUs. Courtecuisse and Allard (2009) focus on optimizing Gauss–Seidel on both multi-core CPUs and GPUs, while Williams et al. (2012) have made an important contribution on improving the performance of the geometric multigrid method on the Intel Xeon Phi coprocessor in the native execution mode. Their work is of great help in designing our optimized algorithms for HPCG, but special efforts must be made because SymGS in HPCG is based on sparse matrix kernels that have much more complex memory access patterns.

HPCG has drawn increasingly more attention from both academia and industry since it was announced. As a result, some preliminary research has emerged in the recent year, including Kumahata et al. (2014) on the K computer, Phillips and Fatica (2014) on NVIDIA GPU clusters, and Park et al. (2014a, 2014b) and Sridevi (2014) on Intel CPU and MIC clusters. Among them, by treating each Xeon Phi device as an independent node, the works of Park et al. (2014a, 2014b) and Sridevi (2014) support running HPCG on CPU-MIC heterogeneous systems in the symmetric programming mode. Overall, our work is distinct from the others in optimizing HPCG based on a heterogeneous CPU-MIC algorithm to enable the cooperative computing among hybrid resources.

8. Concluding remarks

In this work, we have proposed a hybrid CPU-MIC algorithm to enable and scale the HPCG benchmark on the world’s largest heterogeneous supercomputer, Tianhe-2. The hybrid algorithm is based on an inner–outer subdomain partitioning strategy that not only adapts to achieve a good load balance between host and device but also helps reduce the data movement overhead. Various parallelization and optimization techniques have been exploited and analyzed in detail to maximize the performance gain from both CPU and MIC. Among them, we pay special attention to the assessment of the trade-off between the locality and concurrency, especially in the symmetric Gauss–Seidel operation. A thorough performance analysis has been provided on a small CPU-MIC computer to compare among different optimization choices. The test results on Tianhe-2 show that our hybrid implementation of HPCG has successfully scaled to the full-system scale of 3.12 million heterogeneous cores, with an aggregated performance of 623 Tflop/s and a parallel efficiency of 81.2%, which is a new record on the HPCG List of 2014.

Footnotes

Acknowledgements

The authors would like to express their appreciations to the anonymous reviewers for the invaluable comments that greatly improved the quality of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work was supported in part by NSF China (grants# 91530103, 61170075) and the National High-Tech R&D Program of China (grant# 2015AA01A302).

Author biographies

Yiqun Liu received her PhD in computer science from the Institute of Software, Chinese Academy of Sciences (ISCAS) in 2015. Currently, she is a senior research and development engineer at IDL, Baidu Inc. Her research mainly focuses on code optimization and deep learning.

Chao Yang is a professor at ISCAS. He received his PhD in computer science from ISCAS in 2007. His research interests include numerical analysis and modeling, large-scale scientific computing, and parallel numerical software.

Fangfang Liu is an associate professor at ISCAS. Her research interests are iterative methods and preconditioning techniques for sparse linear systems and high-performance numerical library. She received her MS in mathematics from the Jilin University in 2006.

Xianyi Zhang is a research fellow at the University of Texas at Austin (UT-Austin). Before joining UT-Austin in 2015, he had been working at ISCAS since 2007. He received his PhD in computer sicence from ISCAS in 2014 and obtained his MSc and BS from the Beijing Institute of Technology in 2007 and 2005, respectively.

Yutong Lu is a professor at National University of Defense Technology (NUDT), China. She received her BS, MS, and PhD in computer science from NUDT. Her research interests include large-scale system management, high-speed communication, distributed file system, and advanced parallel programming environments.

Yunfei Du is an associate professor at NUDT. His research interests focus on parallel and distributed systems, fault tolerance, and scientific computing. He received his PhD from NUDT in 2008.

Canqun Yang received his MS and PhD in computer science from NUDT in 1995 and 2008, respectively. Currently, he is a professor at the university. His research interests include programming languages and compiler implementation.

Min Xie is a professor at NUDT. His research interests include high-speed interconnect, programming model, and parallel and distributed computing. He received his PhD in computer science from NUDT.

Xiangke Liao received the BS from Tsinghua University in 1985, and MS from NUDT in 1988, both in computer science. Currently, he is a professor and the dean of College of Computer Science, NUDT. His research interests include high-performance architecture, operating systems, software engineering, and parallel and distributed processing.

References

Courtecuisse

Allard

(2009) Parallel dense Gauss–Seidel algorithm on many-core processors. In: Proceeding 11th International Conference on High Performance Computing and Communications (HPCC’09), Seoul, Korea, 25–27 June 2009, pp. 139–147.

Datta

Murphy

Volkov

. (2008) Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: Proceedings of ACM/IEEE Conference on Supercomputing (SC’08), Austin, TX, 15–21 November 2008, pp. 4:1–4:12. Piscataway, NJ: IEEE.

Dongarra

Heroux

(2013) Toward a new metric for ranking high performance computing systems. Sandia Report SAND2013-4744. Albuquerque, NM: Sandia National Laboratories.

Dongarra

Luszczek

(2013) HPCG technical specification. Sandia Report SAND2013-8752. Albuquerque, NM: Sandia National Laboratories.

Graunke

Thakkar

(1990) Synchronization algorithms for shared-memory multiprocessors. Computer23(6): 60–69.

Iwashita

Shimasaki

(2003) Block red-black ordering: a new ordering strategy for parallelization of ICCG method. International Journal of Parallel Programming31(1): 55–75.

Kowarschik

Rüde

Weiss

. (2000) Cache-aware multigrid methods for solving Poisson’s equation in two dimensions. Computing64(4): 381–399.

Kumahata

Minami

Maruyama

(2014) HPCG on the K computer. ASCR HPCG Workshop, Bethesda, MD, 25 March 2014.

Liu

Smelyanskiy

Chow

. (2013) Efficient sparse matrix-vector multiplication on x86-based many-core processors. In: Proceedings of 27th International Conference on Supercomputing (ICS’13), Eugene, OR, 10–14 June 2013, pp. 273–282. New York: ACM.

10.

Liu

Zhang

Yang

. (2014) Accelerating HPCG on Tianhe-2: A hybrid CPU-MIC algorithm. In: Proceedings of 24th IEEE International Conference on Parallel and Distributed System ICPADS’14, Hsinchu, Taiwan, 16–19 December 2014, pp. 542–551.

11.

Mellor-Crummey

Scott

(1991) Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems9(1): 21–65.

12.

Monakov

Lokhmotov

Avetisyan

(2010) Automatically tuning sparse matrix-vector multiplication for GPU architectures. In: Patt

Foglia

Duesterwald

Faraboschi

Martorel (eds) High Performance Embedded Architectures and Compilers. Heidelberg: Springer, pp. 111–125.

13.

Park

Smelyanskiy

Sundaram

. (2014a) Sparsifying synchronization for high-performance shared-memory sparse triangular solver. In: Proceedings of 28th International Conference on Supercomputing (ICS’14), Leipzig, Germany, 22–26 June 2014, pp. 124–140. New York: Springer-Verlag.

14.

Park

Smelyanskiy

Vaidyanathan

. (2014b) Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14), New Orleans, Louisiana, 16–21 November 2014, pp. 945–955.

15.

Phillips

Fatica

(2014) A CUDA implementation of the High Performance Conjugate Gradient benchmark. In: Proceedings of the 5th International Workshop, PMBS 2014, New Orleans, LA, 16 November 2014, pp. 68–84.

16.

Sridevi (2014) Intel optimized technology preview for High Performance Conjugate Gradient benchmark. Technical report. Santa Clara: Intel.

17.

Vuduc

Demmel

Yelick

(2005) OSKI: a library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series16(1): 521.

18.

Wallin

Lof

Hagersten

. (2006) Multigrid and Gauss–Seidel smoothers revisited: parallelization on chip multiprocessors. In: Proceedings of 20th International Conference on Supercomputing (ICS’06), Queensland, Australia, 28 June–1 July 2006, pp. 145–155. New York: ACM.

19.

Wellein

Hager

Zeiser

. (2009) Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Proceedings of Computer Software and Applications Conference (COMPSAC’09), Vol 1, Seattle, WA, 20–24 July 2009, pp. 579–586.

20.

Williams

Kalamkar

Singh

. (2012) Optimization of geometric multigrid for emerging multi-and manycore processors. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12), Salt Lake City, Utah, 10–16 November 2012, pp. 96:1–96:11.

21.

Williams

Oliker

Vuduc

. (2009) Optimization of sparse matrix–vector multiplication on emerging multicore platforms. Parallel Computing35(3): 178–194.

22.

Wonnacott

(2000) Using time skewing to eliminate idle time due to memory bandwidth and network limitations. In: Proceedings of 14th IEEE International Parallel and Distributed Processing Symposium (IPDPS’00), Cancun, Mexico, 1–5 May 2000, pp. 171–180.

23.

Xue

Yang

. (2014) Enabling and scaling a global shallow-water atmospheric model on Tianhe-2. In: Proceedings of 28th IEEE International Parallel and Distributed Proceeding Symposium (IPDPS’14), Phoenix, AZ, 19–23 May 2014, pp. 745–754.

24.

Zhang

Yang

Liu

. (2014) Optimizing and scaling HPCG on Tianhe-2: early experience. In: Proceedings of 14th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP’14), Dalian, China, 24–27 August 2014, pp. 28–41. Switzerland: Springer.