Hybrid multi-projection method using sparse approximate inverses on GPU clusters

Abstract

The state-of-the-art supercomputing infrastructures are equipped with accelerators, such as graphics processing units (GPUs), that operate as coprocessors for each workstation of the distributed memory system. The multi-projection type methods are a class of algebraic domain decomposition methods based on semi-aggregation techniques. The multi-projection type methods have improved convergence behavior, as the number of subdomains increases, due to the corresponding augmentation of the semi-aggregated local linear systems with more coarse components, while the number of fine components is reduced. Moreover, limited amount of communications among the workstations is required by the proposed method. The utilization of the available GPUs allows an increase in the number of subdomains along with finer-grained parallelism, leading to improved performance. A load-balancing algorithm that ensures the concurrency of the computations on multicore processors and GPUs is proposed. Flexible parallel preconditioned Krylov subspace iterative methods enhanced with multi-projection type methods have been designed appropriately in order to have improved performance, compared to CPU-only or GPU-only executions, by exploiting the available CPUs and GPUs of the distributed memory system concurrently. The unsymmetric local linear systems are solved by the preconditioned Bi-Conjugate Gradient STABilized (BiCGSTAB) method enhanced with the modified generic factored approximate sparse inverse preconditioner, whereas the preconditioned conjugate gradient (CG) method along with the symmetric factored approximate sparse inverse preconditioner is used for the symmetric positive definite local coefficient matrices. Numerical results regarding the convergence behavior, the performance, and the scalability of the proposed method for several problems are given.

Keywords

Semi-aggregation algebraic domain decomposition high performance computing hybrid solver graphics processing units load balancing

1. Introduction

Let us consider the sparse linear system

A x = b

where A is the coefficient matrix, b is the right-hand-side vector, and x is the solution vector. Linear systems of the form (1) often arise from the discretization of partial differential equations (PDEs) using methods such as finite differences or finite elements. The preconditioned Krylov subspace iterative methods have been widely used for solving such linear systems due to their parallel performance and reduced memory requirements compared to the class of direct methods (Saad, 2003). The preconditioned form of equation (1) is given as follows

M A x = M b

where M is a suitable preconditioning matrix. Domain decomposition methods are efficient parallel preconditioning schemes that have been used extensively for solving linear systems on distributed memory systems (Chan and Mathew, 1994; Toselli and Widlund, 2005). The domain can be partitioned either geometrically (Saad, 2003; Smith et al., 2004; Toselli and Widlund, 2005) or algebraically (Ferronato et al., 2014; Manguoglu, 2012; Zhu and Sameh, 2017) using graph partitioning algorithms such as METIS (Karypis and Kumar, 1998). The METIS partitioning algorithm utilizes the algebraic properties of the graph corresponding to the coefficient matrix A partitioning it into disjoint subgraphs (subdomains) with minimum edge-cut. Algebraic domain decomposition methods can be combined with multilevel techniques, derived from the algebraic multigrid method, and have been shown to provide satisfactory convergence behavior and performance (Bank and Smith, 2002; Bank et al., 2015; Mitchell, 1997). The aforementioned methods compute fine–coarse subdomains and solve the local linear systems using multigrid cycle techniques. Aggregation-based multigrid (Notay, 2006, 2010) utilizes aggregated-coarse grids based on pairwise aggregation, retaining almost constant number of iterations for convergence to a prescribed tolerance and has satisfactory weak scaling.

The graphics processing units (GPUs) of distributed memory systems have been used to improve the performance of domain decomposition methods. The finite element tearing and interconnecting method (Farhat et al., 2000) has been implemented for hybrid (CPU-GPU) systems, along with a dynamic load-balancing technique (Papadrakakis et al., 2011), having satisfactory parallel performance. Moreover, the domain decomposition methods combined with a communication-avoiding Generalized Minimum RESidual restarted (GMRES(m)) method have been shown to be scalable on GPU clusters (Yamazaki et al., 2014a, 2014b). Different approaches for implementing domain decomposition methods on GPUs have been examined on problems from several scientific fields (Babich et al., 2011; Luo et al., 2011; Osaki and Ishikawa, 2010).

Recently, a class of algebraic domain decomposition-type methods, based on semi-aggregation techniques, namely multi-projection methods (MPMs), has been proposed. The proposed class has improved convergence behavior compared to the classical domain decomposition methods, especially as the number of subdomains increases, as well as satisfactory scalability (Moutafis et al., 2017a, 2017b). The improved convergence behavior of the MPM preconditioning scheme, resulting from the increase of the number of subdomains, has been presented in the context of symmetric positive definite (SPD) problems and specifically for the Poisson 3-D model problem in Moutafis et al. (2017a). The MPM preconditioning scheme requires solving local linear systems, which consist of fine and coarse (aggregated) components. The additional aggregated components require slightly more computational work for solving the local linear systems, however, they lead to less number of required iterations for convergence to the prescribed tolerance. In contrast to classical domain decomposition solvers (Smith et al., 2004), the aggregated components, related to the global information of the whole domain, provide better approximation to the exact solution of the fine components, therefore improved convergence behavior is expected, especially when the number of aggregates increases. The MPM can be described as an augmented restricted additive Schwarz (RAS) method (Cai and Sarkis, 1999), in which aggregated (coarse) components, instead of overlapping components, are involved in the local linear systems. The RAS method is based on overlapping domain decomposition techniques, whereas the MPM restricts the whole domain into coarse components by aggregating the fine components of each subdomain. A detailed comparison of the MPM preconditioning scheme with the RAS method, with various overlapping levels, and the Block Jacobi method has been given in Moutafis et al. (2017a). The proposed scheme has improved convergence behavior, performance, and scalability for solving various general large sparse linear systems compared to the aforementioned methods (Moutafis et al., 2017a). Aggregation-based multigrid (Notay, 2006, 2010) utilizes a hierarchy of aggregated (coarse) grids, whereas the MPM scheme requires the solution of multiple semi-aggregated (fine and coarse components) subdomains concurrently. The domain decomposition methods are suitable for GPU clusters; thus, the solution process of the local linear system of each subdomain can be further accelerated by the GPUs (Papadrakakis et al., 2011; Yamazaki et al., 2014b).

In the case of the MPM preconditioning scheme (Moutafis et al., 2017a), the local linear systems are solved by a direct method; however, the forward/backward substitution process required by direct methods cannot be efficiently parallelized on GPUs. In Anzt et al. (2015a), a block asynchronous Jacobi method has been used for approximating the forward/backward substitution process. Moreover, a recursive approach of the factorized sparse approximate inverse (FSAI) preconditioner (Kolotilina and Yeremin, 1993), providing improved parallel performance than the classical FSAI for banded matrices, has been proposed (Bergamaschi and Martínez, 2012). Alternatively, it can be used a preconditioned iterative method, in conjunction with a sparse approximate inverse matrix, which is suitable for parallelization on many-core architectures, since it is composed of sparse matrix–vector products, inner products, and vector norms.

A hybrid parallel MPM-type preconditioning scheme is proposed along with a load-balancing algorithm, aiming to exploit the computational resources of a GPU cluster. The Bi-Conjugate Gradient STABilized (BiCGSTAB) (Van der Vorst, 1992) enhanced with the Modified Generic Factored Approximate Sparse Inverse (MGenFAspI) preconditioner (Filelis-Papadopoulos and Gravvanis, 2016) has been used for solving the unsymmetric local linear systems, whereas the preconditioned conjugate gradient (CG) method (Hestenes and Stiefel, 1952), enhanced with the symmetric factored approximate sparse inverse (SFASI) preconditioner (Kyziropoulos et al., 2016), has been used for the SPD cases. Moreover, the hybrid parallel computing paradigm, using both multi-/many-core processing and GPUs, is expected to improve the performance compared to the CPU-only or the GPU-only cases. The load balancing of the workload affects significantly the parallel performance of the proposed method; therefore, an algorithm based on benchmark results has been designed and implemented to ensure evenly distributed workload. The flexible variants of the preconditioned CG (Golub and Ye, 1999) (inexact preconditioned CG (IPCG)) and the GMRES(m) (Saad, 1993) (flexible GMRES(m) (FGMRES(m))) methods have been used along with the MPM preconditioning scheme, allowing the solution of the local linear systems with decreased tolerance. In Section 2, the MPM preconditioning scheme and the MGenFAspI preconditioner are described. In Section 3, the load-balancing technique is presented. In Section 4, implementation details concerning the FGMRES(m) method and the IPCG method, enhanced with the MPM preconditioning scheme, are given along with the description of the parallelization on the CPU/GPU cluster. The convergence behavior, the performance, and the scalability of the proposed method have been examined by several numerical experiments, which are presented in Section 5. The proposed method is compared to the RAS method (Cai and Sarkis, 1999).

2. MPM enhanced with the MGenFAspI

2.1. Multi-projection method

The class of multi-projection-type methods (Moutafis et al., 2017a, 2017b) utilizes semi-aggregated techniques to project a linear system of the form (1) into “ $n_{d o m s}$ ” (number of subdomains) fine–coarse subdomains. The coarse components of each subdomain are the aggregates of the remaining subdomains, providing global information, in the form of aggregates, to each local linear system. Let us consider the domain $Ω$ and the nonoverlapping subdomains $Ω_{j}$ , which consist of $n c_{j}$ fine components such that

Ω = \cup_{j = 0}^{n_{d o m s} - 1} Ω_{j}

Moreover, let us also consider the fine–coarse subdomains Z_j , whose components are the fine components of $Ω_{j}$ and ( $n_{d o m s} - 1$ ) coarse components corresponding to the aggregate of each subdomain $Ω_{k}$ , with $k \neq j$ . Let us denote with $V_{j} \in ℝ^{n \times (n c_{j} + (n_{d o m s} - 1))} : Z_{j} \to Ω$ and $V_{j}^{T} \in ℝ^{(n c_{j} + (n_{d o m s} - 1)) \times n} : Ω \to Z_{j}$ the prolongation and the restriction operators of the subdomain Z_j , respectively. In Figure 1, an arbitrary graph, partitioned into three subdomains, and the corresponding fine–coarse subdomain Z ₀ are shown.

Figure 1.

(a) An arbitrary graph, partitioned into three subdomains, and (b) the corresponding fine–coarse subdomain Z ₀ are shown.

The local linear systems can be written as follows

V_{j}^{T} A V_{j} x_{j} = V_{j}^{T} b

or equivalently

A_{j} x_{j} = b_{j}

The prolongation matrices V_j are given by the following expression

{(V_{j})}_{:, i} = {\begin{matrix} e_{i}, & 0 \leq i < n c_{j} \\ p_{i - n c_{j}}, & n c_{j} \leq i < n c_{j} + j \\ p_{i - n c_{j} + 1}, & n c_{j} + j \leq i < n c_{j} + (n_{d o m s} - 1) \end{matrix}

where e_i is the suitable column vector of the $(n \times n)$ identity matrix, and p_k corresponds to each of the rest subdomains Z_k with $k \neq j$ , which is given by the following expression

{(p_{k})}_{l} = {\begin{matrix} 1 / n c_{k}, & if l \in Ω_{k} \\ 0, & if l \notin Ω_{k} \end{matrix}

The local linear systems (4) can be solved by the BiCGSTAB-CG method enhanced with an MGenFAspI-SFASI preconditioner. Factored approximate inverse matrices such as MGenFAspI or SFASI require only sparse matrix–vector products, in contrast to the forward/backward substitution, required by the traditional Incomplete LU (ILU)-preconditioning. Therefore, the preconditioned BiCGSTAB-CG enhanced with the MGenFAspI-SFASI matrix is expected to have improved parallel performance on GPUs compared to ILU-preconditioning. The local solution vector x_j can be reordered in a block form as

x_{j} = [\begin{matrix} x_{F_{j}} \\ x_{C_{j}} \end{matrix}]

where $x_{F_{j}}$ and $x_{C_{j}}$ correspond to the fine and the coarse components of the subdomain Z_j , respectively. The global vector x is updated only by the fine components $x_{F_{j}}$ , whereas the coarse (aggregated) components $x_{C_{j}}$ are discarded. Solving the augmented semi-coarse local linear systems (5) leads to improved approximation of the exact solution x, since dependencies of local components to components of other subdomains are approximated using the aggregated values. Thus, improved convergence behavior is expected especially for large number of subdomains. The prolongation matrices that map x_F to domain $Ω$ and discard the auxiliary coarse part x_C are $W_{j} \in ℝ^{n \times (n c_{j} + (n_{d o m s} - 1))} : Z_{j} \to Ω$ . The MPM preconditioning scheme using the MGenFAspI can be described by Algorithm 1 (Moutafis et al., 2017a).

Algorithm 1.

MPM preconditioning scheme algorithm using the MGenFAspI-SFASI

It should be stated that the MPM preconditioning scheme requires an all-gather process to form the coarse part of each local right-hand-side vector b_j . The required time for the aforementioned procedure is insignificant compared to the other parts of the algorithm, since the size of the coarse vector, which is sent, is only $n_{d o m s}$ . The MPM preconditioning scheme is used in conjunction with the FGMRES(m) method for unsymmetric coefficient matrices and with the flexible CG for SPD cases, expecting to be efficient for solving large linear systems on a distributed memory system.

2.2. Modified generic factored approximate sparse inverse

The MGenFAspI matrix (Filelis-Papadopoulos and Gravvanis, 2016) of a coefficient matrix A is of the form

M = G H

where G is the upper triangular factor and H is the lower triangular factor of the approximate inverse M. The MGenFAspI matrix is computed based on the L and U factors, obtained by incomplete LU factorization procedure of a coefficient matrix A (Saad, 2003)

A = L U + E

where E is the error matrix. The setup phase of the MGenFAspI matrix requires the a priori knowledge of the sparsity pattern of the factor matrices G and H (Chow, 2000, 2001). Hence, the factors L and U are sparsified using a predefined drop tolerance (droptol) and then raised to a predefined power (level of fill), that is, lfill (Chow, 2000, 2001), to obtain the sparsity patterns. The factors $G_{d r o p t o l}^{l f i l l}$ and $H_{d r o p t o l}^{l f i l l}$ of the MGenFAspI are computed by solving the following linear systems, using a restricted solution process guided by the sparsity pattern

{\begin{matrix} U g_{:, j} = e_{j} \\ L h_{:, j} = e_{j} \end{matrix}, 1 \leq j \leq n with \begin{matrix} g_{i, j} = 0, (i, j) \notin G_{d r o p t o l}^{l f i l l} \\ h_{i, j} = 0, (i, j) \notin H_{d r o p t o l}^{l f i l l} \end{matrix}

where n is the size of the coefficient matrix A, and $g_{:, j}$ , $h_{:, j}$ , and $e_{:, j}$ are the jth column of the matrices $G_{d r o p t o l}^{l f i l l}$ , $H_{d r o p t o l}^{l f i l l}$ , and the corresponding identity matrix, respectively.

It should be mentioned that the computation of the local MGenFAspI matrices is an inherently parallel process, since each core of the parallel system computes the corresponding MGenFAspI matrices (G_j and H_j ) for each subdomain concurrently.

The main differences between MGenFAspI (Filelis-Papadopoulos and Gravvanis, 2016) and FSAI matrices are as follows. (a) The sparsity pattern of MGenFAspI is computed based on powers of sparsified matrices (Chow, 2000) of the factors L and U computed from incomplete factorization procedures, while in the case of FSAI, the sparsity pattern is computed by powers of the lower triangular part of the coefficient matrix. (b) In the case of MGenFAspI, computation is performed by a column-wise decoupled restricted solution procedure using the explicitly known factors L and U computed from incomplete factorization procedures, while FSAI is based on minimization of the Frobenious norm $∥ I - G L ∥_{F}$ , where L denotes the exact lower triangular factor of the coefficient matrix known only implicitly. (c) The standard FSAI is applicable to SPD coefficient matrices, while MGenFAspI is generic and can be applied to symmetric and unsymmetric coefficient matrices.

Further details concerning the MGenFAspI matrix are given in Filelis-Papadopoulos and Gravvanis (2016). Moreover, the corresponding symmetric version of MGenFAspI, namely, SFASI that is used for SPD coefficient matrices is described extensively in Kyziropoulos et al. (2016).

3. Load balancing

The workload of the MPM preconditioning scheme should be evenly balanced among the available computational resources to allow concurrent computations. In the MPM preconditioning step, each workstation of the distributed memory system has to solve $n_{d o m s} / n_{w o r k s t a t i o n s}$ local linear systems. The number of subdomains “n_doms ” should be chosen appropriately to increase the granularity and to fully exploit the multicore processors and the GPUs of each workstation. Since the MPM preconditioning scheme has improved convergence behavior, as the number of subdomains is increased, the proposed method is suitable for large-scale distributed memory systems (Moutafis et al., 2017a). The increased number of subdomains leads to smaller local linear systems allowing for a more evenly balanced workload. Having denoted with $n c_{j}$ , the number of fine components in subdomain Z_j , and presuming that METIS partitioning algorithm divides the graph, corresponding to the coefficient matrix A, into equally sized partitions, then the following can be assumed

n c_{j} ≃ \frac{n}{n_{d o m s}}

where n is the size of the coefficient matrix A, and consequently, the total number of components in Z_j is given by

n_{l o c a l} ≃ \frac{n}{n_{d o m s}} + (n_{d o m s} - 1)

Furthermore, let us assume that the architecture of the GPU cluster has one GPU for each available multicore processor and the number of processors of the cluster is equal to the number of GPUs ( $n_{p r o c s} = n_{G P U s}$ ). Aiming to increase the concurrent computations in each workstation, the computations regarding each subdomain are mapped one-to-one to the available cores. However, one core from each processor does not participate in the solution process of the local linear systems, since it is responsible for controlling the corresponding GPU.

The number of subdomains assigned to the GPU $w \in ℕ^{*}$ should be chosen appropriately to increase concurrency. The required GPU time for solving w local linear systems ( $w \cdot T_{G P U}$ ) should be (almost) equal to the time required by each core ( $T_{c o r e}$ ) for solving one local linear system. The number of subdomains is then chosen according to the following expression

n_{d o m s} = n_{p r o c s} (n_{c o r e s} - 1 + w)

In Figure 2, the workload distribution in a compute node, having two multicore processors and two GPUs, is schematically presented. For designing a load-balancing algorithm, a benchmark test should be chosen to estimate $T_{G P U}$ and $T_{c o r e}$ . The preconditioning step of MPM requires solving the local linear systems by the BiCGSTAB method enhanced with an MGenFAspI preconditioner or the CG method enhanced with an SFASI preconditioner. The preconditioned Krylov subspace methods are composed of vector–vector and sparse matrix–vector operations. The sparse matrix–vector multiplication is the most computationally demanding operation of the preconditioned Krylov subspace methods. Therefore, the execution time of a sparse matrix–vector product was chosen to be the benchmark for comparing the performance of the GPU against the performance of one CPU core. The size of the local sparse matrices depends on the total number of subdomains, $n_{d o m s}$ , and the number of subdomains, assigned to the GPU, w, equations (13) and (14), which affect both $T_{G P U}$ and $T_{c o r e}$ . The benchmark test matrix should have size $n_{l o c a l}$ ; therefore, for every possible value of w, a different sparse matrix should be used as benchmark. The sparsity pattern of these matrices affects the performance of the sparse matrix–vector multiplication on CPU and GPU. A random sparsity pattern, based on the properties of the sparsity pattern corresponding to the initial sparse matrix A, is constructed. The nonzero ratio in every row and the bandwidth-to-size ratio of the system matrix A are preserved

\frac{n n z (A_{l o c a l})}{n_{l o c a l}} = \frac{n n z (A)}{n}

and

\frac{b a n d w i d t h (A_{l o c a l})}{n_{l o c a l}} = \frac{b a n d w i d t h (A)}{n}

Figure 2.

The distributed workload in a single node of the distributed memory system, having two multicore processors and two GPUs, is depicted. GPU: graphics processing unit.

It should be mentioned that the diagonal elements of the random matrix were chosen to be nonzero according to common sparsity patterns of coefficient matrices, derived from finite differences and finite elements methods. In Figure 3, the sparsity pattern of a random matrix with $n_{l o c a l} = 100$ , $\frac{n n z (A_{l o c a l})}{n_{l o c a l}} = 4$ , and $b a n d w i d t h (A_{l o c a l}) = 15$ is depicted.

Figure 3.

The sparsity pattern of a random matrix with $n_{l o c a l} = 100$ , $\frac{n n z (A_{l o c a l})}{n_{l o c a l}} = 4$ , and $b a n d w i d t h (A_{l o c a l}) = 15$ is shown.

Let us denote with $w_{b} \in ℝ_{+}^{*}$ as follows

w_{b} (n_{d o m s}) = \frac{T_{c o r e} (n_{d o m s})}{T_{G P U} (n_{d o m s})}

The value of w_b should not be larger than w, otherwise $w \cdot T_{G P U} > T_{C P U}$ , and the GPU would affect negatively the performance of the preconditioning step. Using $⌊ w_{b} ⌋$ instead of w_b ensures that $w \cdot T_{G P U} \leq T_{C P U}$ . Therefore, a suitable load balancing for the MPM preconditioning scheme is achieved when $n_{d o m s}$ is chosen such that

\underset{n_{d o m s}}{argmin} ‖ w (n_{d o m s}) - ⌊ w_{b} (n_{d o m s}) ⌋ ‖

Let us consider a distributed memory system, equipped with 32 processors (10× cores) and 32 GPUs, and a coefficient matrix A of dimension $n = 16, 777, 216$ , derived from the discretization of the 3-D Poisson problem with the finite difference method, then the $w (n_{d o m s})$ and the $w_{b} (n_{d o m s})$ curves are shown in Figure 4. In this case, the optimal value (intersecting point) for w is when $n_{d o m s} = 448$ . The proposed load-balancing scheme can be described by Algorithm 2.

Figure 4.

The $w (n_{d o m s})$ and the $w_{b} (n_{d o m s})$ curves for 32 CPUs ( $10 \times cores$ ), 32 GPUs, and $n = 16, 777, 216$ are shown. GPU: graphics processing unit.

Algorithm 2

Load Balancing algorithm

The $M A X_D O U B L E_V A L U E$ is the largest double-precision floating-point number that can be represented by the IEEE 754 technical standard, and $M A X_w$ is the maximum number of local linear systems that can be solved in the GPU, based on the memory requirements of the preconditioned BiCGSTAB method. The proposed load-balancing algorithm computes the value of w such that the GPUs and the multicore processors would perform computations concurrently. The relative error of w is given by the following expression

η = | \frac{w_{M P M} - w}{w_{M P M}} |

where $w_{M P M} = ⌊ \frac{T_{C P U}^{M P M}}{T_{G P U}^{M P M}} ⌋$ is the optimal value of w, given by the a posteriori measurements of the performance of the MPM preconditioning scheme. The relative error of w using the random benchmark coefficient matrix for a coefficient matrix A of dimension $n = 16, 777, 216$ , derived from the discretization of the 3-D Poisson problem with the finite differences method, and for the cage14 problem case, obtained from SuiteSparse matrix collection (Davis and Hu, 2011), with various numbers of processes, is depicted in Figure 5. Moreover, the relative error between the elapsed time for the sparse matrix–vector multiplication on the GPU, using the random benchmark coefficient matrix ( $T_{G P U}^{r a n d}$ ), and the average elapsed time from the resulting local coefficient matrices ( $T_{G P U}$ ) for the aforementioned coefficient matrices and various numbers of processes is depicted in Figure 6.

Figure 5.

The relative error of w using the random benchmark coefficient matrix for (a) the coefficient matrix A of dimension $n = 16, 777, 216$ , derived from a finite difference 3-D Poisson problem, and (b) the cage14 coefficient matrix with various numbers of processes, is depicted.

Figure 6.

The relative error between the elapsed time on the GPU for the sparse matrix–vector multiplication, using the random benchmark coefficient matrix, with the average elapsed time from the resulting local coefficient matrices (a) for a coefficient matrix A of dimension $n = 16, 777, 216$ , derived from the discretization of the 3-D Poisson problem with the finite differences method, and (b) for the cage14 problem case, for various numbers of processes, is depicted. GPU: graphics processing unit.

It should be mentioned that the relative error between the elapsed time of the sparse matrix–vector multiplication on the GPU and the random benchmark coefficient matrix model is substantial. This stems from the fact that in the local coefficient matrix, the rows and the columns corresponding to the aggregated (coarse) components have increased number of nonzero elements compared to rows corresponding to the fine components. In the case of large number of nonzero elements in some rows of the coefficient matrix, the parallelization and, thus, the performance of the sparse matrix–vector multiplication on the GPU are affected negatively, due to asymmetric workload of the GPU threads. An example of the sparsity pattern of a local coefficient matrix derived from the discretization of the 2-D Poisson problem with the finite difference method, partitioned into 16 subdomains, is shown in Figure 7.

Figure 7.

An example of the sparsity pattern of a local coefficient matrix derived from the discretization of the 2-D Poisson problem with the finite difference method, partitioned into 16 subdomains, is shown.

In order to improve the model with respect to sparsity pattern of the local coefficient matrix of Figure 7 as well as the elapsed time required to compute the sparse matrix–vector multiplication, an alternative technique for forming the benchmark coefficient matrix has been used. For each different value of $n_{d o m s}$ and $n_{l o c a l}$ , an arbitrary restriction operator $V^{T} \in ℝ^{n_{l o c a l} \times n} : Ω \to Z$ , equation (4), is formed in order to restrict A into the corresponding arbitrary subdomain Z and consequently to form the benchmark coefficient matrix. This process ensures that the benchmark coefficient matrix would have similar size, number of nonzero elements, and sparsity pattern of a potential local coefficient matrix. Specifically, the arbitrary restriction operator is computed by partitioning the vertices of graph corresponding to the coefficient matrix A into $n_{d o m s}$ subdomains of the same size using lexicographic order.

However, load balancing also depends on the number of iterations that the preconditioned Krylov subspace iterative method requires to obtain the solution of each local linear system. In Figure 8, the histogram of the number of required iterations of the preconditioned CG method for solving the Poisson 3-D problem ( $n = 16, 777, 216$ ) is shown. The relatively small variance shown in the histogram enables the hypothesis that the solution of each local linear system requires almost the same number of iterations for convergence to approximately the prescribed tolerance. However, to further improve the load-balancing technique, the maximum number of allowed inner iterations has been limited. The limit is set by measuring the numbers of required inner iterations for each local linear system in the first outer iteration and then computing the rounded mean value. In the rest of the outer iterations, the maximum number of iterations for the preconditioned Krylov method is set to this mean value. Therefore, any delay that can be caused by local linear systems that require more iterations than the mean value is prevented. Furthermore, the convergence behavior of the outer flexible Krylov subspace method is not affected significantly as the relative residual of the inner iterative methods is usually close to the desired tolerance prescribed in the stopping criterion.

Figure 8.

The histogram of the number of the required iterations for convergence of the preconditioned CG for solving the Poisson 3-D problem ( $n = 16, 777, 216$ ) is shown CG: conjugate gradient.

Moreover, it should be noted that the METIS graph partitioning algorithm partitions the graph corresponding to the coefficient matrix A with respect to the minimization of the edge-cut set. Consequently, the number of nonzero elements in the rows and the columns corresponding to the aggregated (coarse) components of each local coefficient matrix depends on the partitioning. The relative error of w, using the restriction technique to form the benchmark coefficient matrix, for the coefficient matrix A of dimension $n = 16, 777, 216$ , derived from the discretization of the 3-D Poisson problem with the finite differences method, and for the cage14 coefficient matrix for various numbers of processes, is depicted in Figure 9. The relative error of the elapsed time for the sparse matrix–vector multiplication on GPU using the benchmark coefficient matrix, derived with the restriction technique, and using the average elapsed time of the resulting local coefficient matrices for the coefficient matrix A of dimension $n = 16, 777, 216$ , derived from the discretization of the 3-D Poisson problem with the finite differences method, and for the cage14 problem case, with various numbers of processes, is shown in Figure 10.

Figure 9.

The relative error of w, using the restriction technique to form the benchmark coefficient matrix, for (a) the coefficient matrix A of dimension $n = 16, 777, 216$ , derived from the discretization of the 3-D Poisson problem with the finite differences method, and (b) the cage14 coefficient matrix for various numbers of processes, is depicted.

Figure 10.

The relative error of the elapsed time for the sparse matrix–vector multiplication on GPU using the benchmark coefficient matrix, derived with the restriction technique, and using the average elapsed time of the resulting local coefficient matrices for (a) the coefficient matrix A of dimension $n = 16, 777, 216$ , derived from the discretization of the 3-D Poisson problem with the finite differences method, and (b) the cage14 problem case, with various numbers of processes, is depicted. GPU: graphics processing unit.

It should be stated that the $η$ values and the relative error of T_GPU decreases significantly, using the restriction technique instead of the random benchmark matrix technique. The proposed load-balancing scheme can be used along with any domain decomposition method, by choosing a different arbitrary restriction matrix V^T with respect to the respective subdomains.

In Figure 11, the performance of the IPCG, enhanced with the hybrid MPM preconditioning scheme for the Poisson 3-D problem ( $n = 16, 777, 216$ ) with fixed number of workstations (8× CPUs and GPUs) and different number of subdomains is shown, in order to demonstrate the improved performance of the proposed method, with respect to the appropriate selection of the number of subdomains assigned to the GPUs. It should be mentioned that the load-balancing algorithm chooses $w = 5$ , that is, five subdomains per GPU, which is the optimal value in this case.

Figure 11.

The performance of the proposed scheme for the Poisson 3-D problem ( $n = 16, 777, 216$ ) with fixed number of workstations (8× CPUs and GPUs) and different number of subdomains is shown. GPUs: graphics processing units.

The proposed load-balancing scheme is expected to evenly distribute the workload of the MPM among the CPUs and the GPUs. However, it should be noted that the following assumptions have been made: (a) the coefficient matrix of each local linear system should have similar size and sparsity pattern; (b) the local linear solver should require similar number of required iterations for convergence to the prescribed tolerance for every local linear system; and (c) the ratio $\frac{T_{c o r e} (n_{d o m s})}{T_{G P U} (n_{d o m s})}$ should be similar for both the sparse matrix–vector multiplication of the benchmark matrix and the iteration time of the local linear solver for every local linear system.

4. Implementation details

The flexible Krylov subspace methods have been implemented using neighboring communication. The linear system (1) is reordered according to the partitioning vector, computed by the METIS partitioning algorithm (Karypis and Kumar, 1998). The reordered linear system has limited amount of edges (nonzero elements) in the off-diagonal blocks due to the minimized edge-cut technique used by the METIS partitioning. Therefore, the neighbor-wise communications, among the compute nodes of the distributed memory system, required for the computation of the sparse matrix–vector, are reduced. Moreover, the Gram–Schmidt process of the FGMRES(m), applied to the linear system (1), is implemented block-wise, in order to require local inner products, local daxpys, and only one all-reduce communication operation per inner iteration. The Givens rotation process concerns the reduction of the Hessenberg matrix from the Arnoldi process to a triangular matrix, and it is computed redundantly in each workstation.

It should be noted that the MAGMA 2.2.0 GPU library (Anzt et al., 2015b) has been used for implementing the proposed method on GPU. The MAGMA built-in function of the preconditioned BiCGSTAB method (Anzt et al., 2017; Gravvanis et al., 2012) has been combined with the MGenFAspI matrix to solve the unsymmetric local linear systems, whereas for the SPD cases, the preconditioned CG method, enhanced with the SFASI matrix, has been used. In the MPM preprocessing phase, the MGenFAspI-SFASI matrices of the local linear coefficient matrices A_j (5) are computed, concurrently by each core. The A_j and the MGenFAspI-SFASI matrices, required for solving the local linear systems of the corresponding GPU, $(\in Z_{j}^{G P U})$ , are offloaded to the GPU memory only once in the preprocessing phase.

It should be mentioned that the required wall-clock time of the preprocessing phase is not significant compared to the wall-clock time required for the solution phase, therefore the numerical results are focused on the performance of the latter. Thus, a more advanced load-balancing scheme concerning the preprocessing phase would impact the performance of the method negatively.

Furthermore, a pipeline technique, based on overlapping communication operations between the CPU and the GPU memory, is utilized for increasing the concurrency of the preconditioning step. In the MPM step, the communication concerning the local solution vectors is overlapped by the computations corresponding to the right-hand-side vector of the next GPU subdomain. Algorithm 3 describes the GPU overlapping communication of the MPM preconditioning step.

Algorithm 3

GPU Overlapping Communication MPM Preconditioning step

It should be mentioned that the $b_{j} = V_{j}^{T} b$ is computed in two phases. For each subdomain, the corresponding aggregate component $a_{j} = \sum_{k \in Ω_{j}} \frac{b (k)}{n c_{j}}$ is computed and is sent to all workstations. Afterward, the b_j is computed as follows

b_{j} = {[b_{F_{j}}, a_{0}, \dots, a_{j - 1}, a_{j + 1}, \dots, a_{n_{d o m s} - 1}]}^{T}

In the case of badly scaled problems, the termination criterion based on the actual residual vector may not be satisfied, even though the norm corresponding to the preconditioned residual satisfies the termination criterion (Saad, 2003). An a priori equilibration of the linear system should be used to acquire a solution that satisfies both termination criteria (Hoemmen, 2010; Sluis, 1969). The most cost-effective equilibration is row scaling (Hoemmen, 2010)

D_{r} A x - D_{r} b

where D_r is a diagonal row scaling matrix, which normalizes each row of the coefficient matrix A using the ${‖ \cdot ‖}_{\infty}$ . The new termination criterion is based on the equilibrated preconditioned residual $\tilde{z} = M D_{r} (b - A x)$ ; therefore, the convergence behavior of the FGMRES(m) is expected to be affected (Hoemmen, 2010; Sluis, 1969).

5. Numerical results

Numerical experiments were carried out on ARIS High Performance Computing infrastructure in order to examine the convergence behavior, the performance, and the scalability of the proposed method. Each compute node is equipped with 2× Haswell-Intel Xeon E5-2660v3 (10 cores each), 64-GB RAM, and 2× NVIDIA Tesla K40-12 GB. The number of nonzero elements of the coefficient matrix A are denoted by $n n z (A)$ . The restart parameter of FGMRES(m) is chosen to be 20, and the maximum number of iterations is set to 100. The maximum number of iterations for the IPCG method is set to 1000. The execution times are given in “seconds.” The termination criterion of the flexible Krylov subspace methods is selected to be ${‖ r ‖}_{2} {< 10}^{- 8} {‖ b ‖}_{2}$ . The number of required iterations for convergence to the prescribed tolerance for the FGMRES(m) method is given by outer(inner). The inner iterations denote the number of iterations after the last restart of the FGMRES(m) method, and the outer iterations denote the number of times FGMRES(m) method was restarted. The right-hand-side vectors are computed as a product of the coefficient matrix A with the solution vector set to ${[0, 1, \dots, n - 1]}^{T}$ . The parameters for the computation of the local preconditioners (MGenFAspI and SFASI) are set to “ $l f i l l$ ” = 1 and “ $d r o p t o l$ ” = 0. The maximum number of iterations for the preconditioned BiCGSTAB-CG method is 500 for the first outer iteration and afterward is computed as it is described in Section 3. The termination criterion of the preconditioned Krylov subspace methods for the local linear systems is ${‖ r ‖}_{2} {< 10}^{- 4} {‖ b ‖}_{2}$ , unless otherwise stated. The performance and the scalability of the proposed hybrid method (MPM-hybrid) that utilizes both CPUs and GPUs for solving the local linear systems have been compared to the cases of using only the CPUs (MPM-CPU) and only the GPUs (MPM-GPU).

5.1. 2-D Poisson problem

The Poisson equation in two space variables has been discretized by the five-point stencil of the finite differences method. In Figure 12, the performance of the IPCG method, enhanced with the MPM preconditioning scheme (CPU-only case and hybrid case) for three different pairs of the number of CPUs and the number of subdomains. In the first case, as presented in Figure 12(a), the number of CPUs (i.e. number of processes) is proportionate to the number of subdomains, where each subdomain is mapped to one thread of the CPU. In the second case, as depicted in Figure 12(b), the number of subdomains is chosen according to the load-balancing algorithm of the hybrid (GPU + CPU) MPM preconditioning scheme, whereas the number of CPUs increases according to the powers of two. In the third case, as shown in Figure 12(c), it has been used more compute nodes and consequently CPUs, so that the number of the subdomains, selected by the load-balancing algorithm, can be mapped one-to-one to a CPU-thread. Moreover, the convergence behavior of the IPCG method, enhanced with the MPM preconditioning scheme, for the different number of subdomains is given in Table 1. It should be mentioned that the performance of the hybrid (GPU+CPU) MPM preconditioning scheme, Figure 12(b), is similar to the performance of the CPU MPM preconditioning scheme, in the case of having a larger distributed system without GPUs, Figure 12(c). Comparing the performance of the CPU MPM preconditioning scheme in Figure 12(a) and (b), in the cases that the increase in the number of subdomains, causes an improvement in the convergence behavior of the method, a corresponding improvement in the performance is noticed. In the rest of the cases that the number of iterations for convergence to the prescribed tolerance remains the same, the performance of the one-to-one mapping is slightly better. In the following numerical results, the number of subdomains has been chosen the same for all the MPM preconditioning scheme implementations, in order to maintain the same number of required floating point operations in each solution process and therefore to focus on examining the efficiency of the proposed method using a GPU cluster. The optimal parameters with respect to the hybrid (GPU + CPU) MPM scheme are used.

Figure 12.

The performance of the IPCG method, enhanced with the MPM preconditioning scheme (CPU-only case and hybrid case), for three different pairs of the number of CPUs and the number of subdomains is shown. (a) The number of CPUs (i.e. number of processes) is proportionate to the number of subdomains, where each subdomain is mapped to one thread of the CPU. (b) The number of subdomains is chosen according to the load-balancing algorithm of the hybrid (GPU + CPU) MPM preconditioning scheme, whereas the number of CPUs increases according to the powers of two. (c) It has been used more compute nodes and consequently CPUs, so that the number of the subdomains, selected by the load-balancing algorithm, can be mapped one-to-one to a CPU-thread. GPU: graphics processing unit. IPCG: inexact preconditioned CG; MPM: multi-projection method.

Table 1.

The convergence behavior of the IPCG enhanced with the MPM preconditioning scheme of the 2-D Poisson problem ( $n = 4, 194, 304$ ) for different number of subdomains.

Subdomains	Iterations	Subdomains	Iterations
28	231	20	234
56	209	40	230
104	185	80	199
208	166	160	166
384	141	320	141

IPCG: inexact preconditioned CG; MPM: multi-projection method.

The strong scalability diagram (logarithmic scale) of the proposed scheme for the 2-D Poisson problem ( $n = 4, 194, 304$ ) is depicted in Figure 13. The convergence behavior of the IPCG, enhanced with the MPM preconditioning scheme, for the strong scalability diagram is given in Table 2. The performance (logarithmic scale) of the MPM preconditioning step for the 2-D Poisson problem ( $n = 4, 194, 304$ ) is presented in Figure 14. It should be noted that the number of required iterations for convergence to the prescribed tolerance decreases as the number of subdomains is increased. The improved convergence behavior of the IPCG, enhanced with the MPM preconditioning scheme, is a result of the increased number of aggregated components, corresponding to the subdomains, in the local linear systems.

Figure 13.

The strong scalability diagram (logarithmic scale) of the IPCG enhanced with the MPM preconditioning scheme for the 2-D Poisson problem ( $n = 4, 194, 304$ ) is shown. IPCG: inexact preconditioned CG; MPM: multi-projection method.

Table 2.

The convergence behavior of the IPCG enhanced with the MPM preconditioning scheme for the strong scalability diagram of the 2-D Poisson problem ( $n = 4, 194, 304$ ).

2-D Poisson strong scalability convergence behavior
Processes	2	4	8	16	32
Subdomains	28	56	104	208	384
Iterations	231	209	185	166	141

IPCG: inexact preconditioned CG; MPM: multi-projection method.

Figure 14.

The performance (logarithmic scale) of the MPM preconditioning step for the 2-D Poisson problem ( $n = 4, 194, 304$ ) is presented. MPM: multi-projection method.

The strong scalability of the IPCG method, enhanced with the hybrid (GPU + CPU) MPM preconditioning scheme, has improved performance compared to the CPU-only case. The improved performance of the proposed method relies on accelerating the MPM preconditioning step by utilizing the GPUs according to the proposed load balancing, Algorithm 2. The performance of the MPM preconditioning step, for the hybrid CPU/GPU case, is improved by a factor of up to 1.7×. In the case of a small number of processes, the GPU-only case performs better than the hybrid scheme. This is occurring due to the fact that the load-balancing algorithm was designed to ensure that the GPU would not be slower than one core of the CPU, whereas in these cases, $(n_{c o r e s} - 1 + w) T_{G P U} < T_{C P U}$ . More specifically, the execution time required by the GPU to solve $n_{c o r e s} - 1 + w$ local linear systems is less than the execution time required by a core of the CPU to solve one linear system. The assumptions that have been made concerning the load-balancing algorithm are not met in this case; hence, the hybrid scheme do not perform better than the GPU-only case. The speedup of the hybrid MPM preconditioning scheme is superlinear, since the required iterations for convergence to the prescribed tolerance is reduced for increased number of subdomains/CPUs/GPUs. Additionally, the MGenFAspI-SFASI local matrices G_j and H_j have less nonzero elements as the size of local coefficient matrices decreases, therefore less floating point operations are required in each sparse matrix–vector product of the preconditioning step.

The time required for transferring the right-hand-side vectors from the CPU to the GPU in each MPM preconditioning step for the 2-D Poisson problem ( $n = 4, 194, 304$ ) is given in Table 3. The presented timings were computed as the mean value of the time measurements for the whole solution process. The data transfer time is insignificant compared to the total time of the preconditioning step. The time concerning the transfer of the solution vectors from the GPU to the CPU is mentioned in the numerical results, since it is overlapped by CPU computations as it is described in Algorithm 3.

Table 3.

The time required for transferring the right-hand-side vectors from the CPU to the GPU in each MPM preconditioning step for the 2-D Poisson problem ( $n = 4, 194, 304$ ).

MPM-hybrid (GPU + CPU)
Processes	Subdomains	w	$T_{p r e c}$	$T_{t r a n s f e r}$	Percentage
2	28	5	2.566	9.23E-03	0.36
4	56	5	1.21	8.22E-03	0.68
8	104	4	0.47	5.52E-03	1.17
16	208	4	0.1781	1.06E-03	0.60
54	384	3	0.06636	3.20E-04	0.48

MPM: multi-projection method; GPU: graphics processing unit.

The weak scaling of the proposed scheme has been examined by solving the 2-D Poisson problem with $400, 000$ unknowns per process. The sizes of the coefficient matrices A and the convergence behavior of the proposed scheme for the weak scalability diagram of the 2-D Poisson problem are given in Table 4. The IPCG method, enhanced with the MPM preconditioning scheme, retains almost constant number of iterations for convergence to the prescribed tolerance, despite the increase in the problem size required for the weak scalability diagram of the 2-D Poisson problem. In Figure 15, the weak scalability diagram (logarithmic scale) of the IPCG method, in conjunction with the MPM preconditioning scheme, is presented. The hybrid MPM scheme is almost 1.5× faster than the CPU case, even for large problem sizes. It should be stated that the proposed hybrid method requires only 0.34× more time to solve a 16× larger problem. Moreover, it should be noted that the GPU case is slower than the CPU case, for the following reasons: (a) the required communication operations, concerning the transfer of the right-hand-side vectors from the CPU memory to the GPU memory; (b) the fact that the GPU solves the assigned local linear systems one after the other using fine-grained parallelism, that is, row-wise parallelism in the SpMv operation; and (c) the asymmetric workload on the GPU threads, due to the different number of nonzero elements in the rows of the local coefficient matrices and the local MGenFAspI-SFASI matrices.

Table 4.

The sizes of the coefficient matrices A and the convergence behavior of the IPCG method, enhanced with the MPM preconditioning scheme, for the weak scalability diagram of the 2-D Poisson problem.

2-D Poisson weak scalability convergence behavior
Processes	2	4	8	16	32
Size	799,236	1,597,696	3,196,944	6,395,841	12,794,929
Subdomains	26	52	104	208	416
Iterations	165	177	171	176	177

IPCG: inexact preconditioned CG; MPM: multi-projection method.

Figure 15.

The weak scalability diagram (logarithmic scale) of the IPCG enhanced with the MPM preconditioning scheme for the 2-D Poisson problem ( $400, 000$ unknowns per process) is shown. IPCG: inexact preconditioned CG; MPM: multi-projection method.

5.2. 3-D Poisson problem

In order to further examine the convergence behavior and the parallel performance of the proposed method, the Poisson equation in three space variables discretized with the finite differences method has been considered. In Figure 16, the strong scalability diagram (logarithmic scale) of the IPCG method, enhanced with the MPM preconditioning scheme, for the 3-D Poisson problem ( $n = 16, 777, 216$ ) is depicted. It should be stated that in the GPU-only scheme, the performance for two processes is missing due to excessive memory requirements for GPU memory. The convergence behavior of the proposed scheme for the 3-D Poisson strong scalability diagram is given in Table 5. The corresponding performance of the MPM preconditioning step for the 3-D Poisson problem ( $n = 16, 777, 216$ ) is presented in Figure 17. For the 3-D Poisson problem, as in the 2-D case, the proposed method has improved convergence behavior, as the number of subdomains is increased. The strong scaling of the IPCG, enhanced with the MPM preconditioning scheme, for the 3-D Poisson problem is improved, compared to the CPU-only case, by exploiting the GPUs of each workstation. The load-balancing algorithm leads to improved performance since it allows the computations to be performed concurrently in CPUs and GPUs. Furthermore, the proposed method presents superlinear speedup, since the number of iterations for convergence to the prescribed tolerance decreases (Table 5). As the number of subdomains increases along with the number of processes, the size of the local linear systems is reduced, leading to significantly less required computational work for solving the local linear systems with the preconditioned Krylov subspace method (CG or BiCGSTAB).

Figure 16.

The strong scalability diagram (logarithmic scale) of the IPCG enhanced with the MPM preconditioning scheme for the 3-D Poisson problem ( $n = 16, 777, 216$ ) is shown. IPCG: inexact preconditioned CG; MPM: multi-projection method.

Table 5.

The convergence behavior of the IPCG method, enhanced with the MPM preconditioning scheme, for the strong scalability diagram of the 3-D Poisson problem ( $n = 16, 777, 216$ ).

3-D Poisson strong scalability convergence behavior
Processes	2	4	8	16	32
Subdomains	32	60	112	224	416
Iterations	121	102	100	97	87

IPCG: inexact preconditioned CG; MPM: multi-projection method.

Figure 17.

The performance (logarithmic scale) of the MPM preconditioning step for the 3-D Poisson problem ( $n = 16, 777, 216$ ) is presented. MPM: multi-projection method.

Moreover, the weak scalability diagram (logarithmic scale) of the IPCG method, enhanced with the MPM preconditioning scheme, is presented in Figure 18. For each process added, the linear system is augmented by one million unknowns. The sizes of the coefficient matrices A and the convergence behavior of the proposed method for the weak scalability diagram of the 3-D Poisson problem are given in Table 6. It should be noted that the size of the linear system slightly affects the convergence behavior of the IPCG method, enhanced with the MPM preconditioning scheme. The required time for solving a linear system with 32 million unknowns is (0.65×) more than the time required for solving a linear system with 2 million unknowns, indicating the satisfactory weak scaling of the proposed method. The hybrid (GPU + CPU) MPM preconditioning scheme sometimes, for instance Figure 18, performs slightly more than twice faster than the GPU case, due to the increased number of offload operations, which cause a further delay. Additionally, the fact that the local linear systems, assigned to the GPUs, are solved one after the other, using fine-grained parallelism, causing the asymmetric workload on the GPU threads.

Figure 18.

The weak scalability diagram (logarithmic scale) of the IPCG enhanced with the MPM preconditioning scheme for the 3-D Poisson problem (1 million unknowns per process) is shown. IPCG: inexact preconditioned CG; MPM: multi-projection method.

Table 6.

The sizes of the coefficient matrices A and the convergence behavior of the IPCG method, enhanced with the MPM preconditioning scheme for the weak scalability diagram of the 3-D Poisson problem.

3-D Poisson weak scalability convergence behavior
Processes	2	4	8	16	32
Size	1,953,125	3,944,312	7,880,599	15,813,251	31,855,013
Subdomains	28	56	112	224	448
Iterations	77	84	94	98	93

IPCG: inexact preconditioned CG; MPM: multi-projection method.

5.3. SuiteSparse collection matrices

Numerical experiments were performed with various sparse coefficient matrices, obtained by the SuiteSparse matrix collection (Davis and Hu, 2011), examining the applicability, the convergence behavior, and the performance of the proposed method. The sizes and the nonzero elements of the SuiteSparse coefficient matrices are given in Table 7. The cage14, the cage15, and the circuit5M_dc are unsymmetric coefficient matrices, therefore are solved by the FGMRES(m) method enhanced with the MPM preconditioning scheme, whereas the thermal2 coefficient matrix is symmetric and is solved by the IPCG method along with the MPM preconditioning scheme.

Table 7.

The sizes and the nonzero elements of the SuiteSparse coefficient matrices.

Name	Size	nnz(A)
cage14	1,505,785	27,130,349
cage15	5,154,859	99,199,551
circuit5M_dc	3,523,317	14,865,409
thermal2	1,228,045	8,580,313

The convergence behavior of the FGMRES(m)-IPCG method, enhanced with the MPM preconditioning scheme, for the SuiteSparse matrices is presented in Table 8. The strong scalability diagrams (logarithmic scale) of the proposed method for the SuiteSparse matrices, given in Table 7, are depicted in Figure 19. The performance (logarithmic scale) of the MPM preconditioning step for the SuiteSparse matrices, given in Table 7, is presented in Figure 20. The thermal2 problem case has been solved using up to 16 processes instead of 32, since the size and the number of nonzero elements are substantially reduced. In the cases of the cage14, cage15, and circuit5M_dc problems, the number of required iterations for convergence to the prescribed tolerance is retained constant as the number of subdomains increases, whereas it decreases for the thermal2 problem case. The performance and the strong scalability of the proposed method are improved using the hybrid CPU/GPU MPM preconditioning scheme, compared to the CPU-only and the GPU-only schemes, for every problem case and number of processes. Therefore, the load-balancing algorithm is suitable for general matrices and provides satisfactory speedup especially for large number of processes.

Table 8.

The convergence behavior of the FGMRES(m)-IPCG method in conjunction with the MPM preconditioning scheme for the SuiteSparse matrices, as given in Table 7.

cage14		cage15
Subdomains	Iterations	Subdomains	Iterations
32	0(9)	48	0(9)
60	0(10)	72	0(10)
104	0(11)	120	0(11)
208	0(10)	224	0(10)
384	0(11)	416	0(10)
circuit5M_dc		thermal2
Subdomains	Iterations	Subdomains	Iterations
34	1(2)	26	258
60	1(3)	52	178
112	1(4)	96	169
208	1(16)	192	169
384	1(17)	—	—

FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; IPCG: inexact preconditioned CG; MPM: multi-projection method.

Figure 19.

The strong scalability diagram (logarithmic scale) of the FGMRES(m)-IPCG enhanced with the MPM preconditioning scheme for the SuiteSparse matrices, as given in Table 7, is depicted: (a) cage14, (b) cage15, (c) circuit5M_dc, and (d) thermal2. IPCG: inexact preconditioned CG; FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; MPM: multi-projection method.

Figure 20.

The performance (logarithmic scale) of the MPM preconditioning step for the SuiteSparse matrices, as given in Table 7, is presented: (a) cage14, (b) cage15, (c) circuit5M_dc, and (d) thermal 2. MPM: multi-projection method.

5.4. Comparison with local direct solver

The MPM can be implemented either by solving the local linear systems with a preconditioned Krylov subspace method or by using a direct method, such as PARDISO solver (Schenk and Gärtner, 2004). Numerical results concerning the latter case are presented in Moutafis et al. (2017a); however, a comparison with the proposed hybrid (GPU + CPU) MPM method is given in this subsection. It is well-known that the direct methods require excessive memory and substantial computational work, especially for large coefficient matrices, derived from 3-D problems. The proposed hybrid (GPU + CPU) MPM method has less memory requirements, allowing the solution of larger linear systems. The number of subdomains for the direct variant of the MPM preconditioning scheme has been selected such that each subdomain is mapped to one core of the distributed memory system. Whereas, for the hybrid scheme, it is selected according to the load-balancing algorithm. These choices have been made in order to present a more fair comparison between the two methods. For optimal performance, the hybrid (GPU + CPU) MPM scheme should be used with the number of subdomains that resulted from the load-balancing scheme, whereas the classical MPM scheme, with direct solver for the local linear systems, performs better by mapping one subdomain to one core of each processor. Assigning the same number of subdomains for both cases would cause a degraded performance for the classical MPM scheme, since only some cores would factorize (preprocessing phase) and solve (preconditioning step) two linear systems, instead of one. Therefore, significant increased floating-point operations would be performed by some cores, leading to unbalanced workload among the CPU cores. In Figure 21, the strong scalability diagram (logarithmic scale) of the total time for solving the 3-D Poisson problem ( $n = 9, 261, 000$ ) using the IPCG method, enhanced with the hybrid (GPU + CPU) MPM preconditioning scheme, and the IPCG method, enhanced with the direct MPM preconditioning scheme, is shown. The convergence behavior of the IPCG method, enhanced with the hybrid and the direct MPM preconditioning schemes, for the 3-D Poisson problem ( $n = 9, 261, 000$ ) is presented in Table 9. It should be noted that the hybrid MPM preconditioning scheme has shown slightly better convergence behavior than the direct MPM preconditioning scheme, since it uses more subdomains and therefore more aggregated components are involved in the local linear systems. Moreover, the hybrid MPM preconditioning scheme is more efficient in solving the linear system than the direct MPM preconditioning scheme, especially in the cases of larger local linear systems.

Figure 21.

The strong scalability diagram (logarithmic scale) of the total time for solving the 3-D Poisson problem ( $n = 9, 261, 000$ ) using the IPCG method, enhanced with the hybrid (GPU + CPU) MPM preconditioning scheme, and the IPCG method, enhanced with the direct MPM preconditioning scheme, is presented. GPU: graphics processing unit. IPCG: inexact preconditioned CG; MPM: multi-projection method.

Table 9.

The convergence behavior of the IPCG method, enhanced with the hybrid and the direct MPM preconditioning schemes, for the 3-D Poisson problem ( $n = 9, 261, 000$ ).

	MPM-hybrid (GPU + CPU)		MPM-direct
Processes	Subdomains	Iterations	Subdomains	Iterations
4	52	95	40	98
8	96	90	80	90
16	192	88	160	91
32	384	77	320	82

IPCG: inexact preconditioned CG; MPM: multi-projection method; GPU: graphics processing unit.

5.5. Comparison with RAS method

The proposed hybrid (GPU + CPU) MPM preconditioning scheme is compared to the RAS method preconditioning scheme (Cai and Sarkis, 1999) in order to further examine the convergence behavior, the performance, and the scalability. The RAS scheme is implemented for hybrid (GPU + CPU) workstations and is combined with the FGMRES(m) method for solving two-dimensional heterogeneous convection–diffusion problem (2-D hconvdiff) and three-dimensional heterogeneous convection–diffusion problem (3-D hconvdiff). The level of overlap is chosen to be either one (RAS-1) or two (RAS-2). The tolerance of the local preconditioned BiCGSTAB method is set to 1e-6.

The 2-D hconvdiff problem is derived from the following PDE

- ε \nabla (K \cdot \nabla u) + \vec{v} \cdot \nabla u = f in Ω \equiv {[0, 1]}^{2}

u = 0 on \partial Ω

where $ε$ is a scalar, K is a positive definite bounded tensor, and $\vec{v}$ is a velocity field defined on $Ω$ . The diffusion coefficients in the x- and y-direction are described by the diagonal matrix K, which has piecewise constant function entries, as follows

a (x, y) = b (x, y) = {\begin{matrix} 1 & in Ω_{1} \cap Ω_{3} \cap Ω_{5} \\ 10^{3} & in Ω_{2} \cap Ω_{4} \cap Ω_{6} \end{matrix}

It should be noted that the domains $Ω_{i}, i \in [1, 6]$ are partitions of the domain $Ω$ and are defined by the following expressions

\begin{matrix} Ω_{1} & \equiv & [0, \frac{1}{3}) \times [0, \frac{1}{2}) \\ Ω_{2} & \equiv & [\frac{1}{3}, \frac{2}{3}) \times [0, \frac{1}{2}) \\ Ω_{3} & \equiv & [\frac{2}{3},1] \times [0, \frac{1}{2}) \\ Ω_{4} & \equiv & [0, \frac{1}{3}) \times [\frac{1}{2},1] \\ Ω_{5} & \equiv & [\frac{1}{3}, \frac{2}{3}) \times [\frac{1}{2},1] \\ Ω_{6} & \equiv & [\frac{2}{3},1] \times [\frac{1}{2},1] \end{matrix}

The v_x and v_y components of the vector $\vec{v}$ are given by the following equations

v_{x} = (x - x^{2}) (2 y - 1)

v_{y} = (y - y^{2}) (2 x - 1)

defining a convection term that is governed by a circular flow in the xy-direction. The five-point stencil of the finite difference method is used as discretization method. The 2-D hconvdiff problem $(n = 4, 194, 304)$ has been solved for $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ . In Table 10, the convergence behavior of the FGMRES(m) method, enhanced with the MPM, the RAS-1, and the RAS-2 scheme, for both the cases of 2-D hconvdiff problem $(n = 4, 194, 304)$ with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ , is presented. The strong scalability diagrams (logarithmic scale) of the FGMRES(m) along with the MPM, the RAS-1, and the RAS-2 scheme, for solving the 2-D hconvdiff problem $(n = 4, 194, 304)$ with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ , are depicted in Figure 22(a) and (b), respectively.

Table 10.

The convergence behavior of the FGMRES(m) method, enhanced with the MPM, the RAS-1, and the RAS-2 scheme, for both the cases of 2-D hconvdiff $(n = 4, 194, 304)$ problem with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ .

	2-D hconvdiff Strong scalability convergence behavior
	Processes	2	4	8	16	32
	Subdomains	30	48	96	192	384
$ε {= 10}^{- 1}$	MPM	25(7)	19(16)	32(13)	10(19)	8(10)
	RAS-1	29(15)	34(2)	40(12)	46(3)	66(16)
	RAS-2	18(7)	21(3)	28(6)	34(15)	44(5)
$ε {= 10}^{- 2}$	MPM	27(3)	44(9)	15(14)	11(8)	8(16)
	RAS-1	24(6)	21(3)	40(19)	43(3)	52(15)
	RAS-2	14(7)	14(11)	23(6)	31(15)	39(11)

FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; 2-D hconvdiff: two-dimensional heterogeneous convection–diffusion problem; MPM: multi-projection method.

Figure 22.

The strong scalability diagrams (logarithmic scale) of the FGMRES(m) along with the MPM, the RAS-1, and the RAS-2 scheme, for solving the heterogeneous convection–diffusion 2-D problem $(n = 4, 194, 304)$ with (a) $ε {= 10}^{- 1}$ and (b) $ε {= 10}^{- 2}$ are shown. FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; MPM: multi-projection method.

In the case of larger number of subdomains, the MPM preconditioning scheme presents improved convergence behavior for solving the 2-D hconvdiff with both $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ , compared to the RAS-1 and RAS-2 schemes. The increase in the number of subdomains augments the local linear systems of the MPM scheme with more aggregated (coarse) components, leading to better approximation of the exact solution vector. On the contrary, the convergence behavior of the overlapping RAS schemes degrades as the number of subdomains increases. The convergence behavior of the preconditioning schemes reflects in the performance, where the MPM scheme achieves significantly better execution times than the RAS schemes for large number of subdomains. The decrease in the number of required iterations for convergence to the prescribed tolerance is responsible for the shown superlinear speedup of the MPM scheme. Moreover, comparative weak scaling numerical experiments concerning the MPM, the RAS-1, and the RAS-2 preconditioning scheme have been conducted for the 2-D hconvdiff problem. The size of the problem is increased by $400, 000$ unknowns per process. The convergence behavior of FGMRES(m) method, in conjunction with the MPM, the RAS-1, and the RAS-2 scheme, for the weak scaling experiment of the 2-D hconvdiff problem with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ , is given in Table 11. The weak scalability diagram (logarithmic scale) of the FGMRES(m) method, enhanced with the MPM, the RAS-1, and the RAS-2 scheme, for solving the 2-D hconvdiff problem with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ , are depicted in Figure 23(a) and (b), respectively.

Table 11.

The convergence behavior of FGMRES(m) method, in conjunction with the MPM, the RAS-1, and the RAS-2 scheme, for the weak scaling experiment of the 2-D hconvdiff problem with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ .

	2-D hconvdiff Weak scalability convergence behavior
	Processes	2	4	8	16	32
	Subdomains	30	48	96	192	384
	Size	799,236	1,597,696	3,196,944	6,395,841	12,794,929
$ε {= 10}^{- 1}$	MPM	10(5)	11(4)	11(4)	14(13)	36(15)
	RAS-1	14(3)	24(5)	40(14)	72(12)	>100
	RAS-2	8(13)	16(3)	25(6)	37(7)	63(12)
$ε {= 10}^{- 2}$	MPM	10(17)	12(12)	16(6)	14(5)	>100
	RAS-1	10(6)	16(8)	34(11)	47(16)	>100
	RAS-2	8(18)	14(7)	21(20)	32(3)	72(9)

FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; 2-D hconvdiff: two-dimensional heterogeneous convection–diffusion problem; MPM: multi-projection method.

Figure 23.

The weak scalability diagram (logarithmic scale) of the FGMRES(m) method, enhanced with the MPM, the RAS-1, and the RAS-2 scheme, for solving the heterogeneous convection–diffusion 2-D problem with (a) $ε {= 10}^{- 1}$ and (b) $ε {= 10}^{- 2}$ are shown. FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; MPM: multi-projection method.

The convergence behavior of the proposed method is improved compared to the RAS schemes in most of the cases, concerning the 2-D hconvdiff weak scaling experiment. In the case of 32 subdomains and $ε {= 10}^{- 2}$ , the MPM scheme exceeds the maximum number of iterations, without converging to the prescribed tolerance. A smaller tolerance in local solvers or a larger number of subdomains is required to avoid the failure of convergence. The RAS schemes show degrading convergence behavior as the size of problem and the number of subdomains increases, whereas the MPM scheme has a more reasonable increase in the number of required iterations until the 16 subdomains. The weak scaling of the proposed method is satisfactory, whereas it is quite better than the RAS-1 and the RAS-2 weak scaling.

Furthermore, the 3-D hconvdiff is derived by the following PDE

- ε \nabla (K \cdot \nabla u) + \vec{v} \cdot \nabla u = f in Ω \equiv {[0, 1]}^{3}

u = 0 on \partial Ω

where $ε$ is a scalar, K is a positive definite bounded tensor, and $\vec{v}$ is a velocity field defined on $Ω$ .

The matrix K is diagonal with piecewise constant function entries, described by the following set of diffusion coefficients in the x-, y-, and z-direction

a (x, y, z) = b (x, y, z) = c (x, y, z) = {\begin{matrix} 1 & in Ω_{1} \cap Ω_{3} \cap Ω_{5} \\ 10^{3} & in Ω_{2} \cap Ω_{4} \cap Ω_{6} \end{matrix}

The domains $Ω_{i}, i \in [1, 6]$ , are partitions of the domain $Ω$ , and are defined by the following expressions

\begin{matrix} Ω_{1} & \equiv & [0, \frac{1}{3}) \times [0, \frac{1}{2}) \times [0, 1] \\ Ω_{2} & \equiv & [\frac{1}{3}, \frac{2}{3}) \times [0, \frac{1}{2}) \times [0, 1] \\ Ω_{3} & \equiv & [\frac{2}{3},1] \times [0, \frac{1}{2}) \times [0, 1] \\ Ω_{4} & \equiv & [0, \frac{1}{3}) \times [\frac{1}{2},1] \times [0, 1] \\ Ω_{5} & \equiv & [\frac{1}{3}, \frac{2}{3}) \times [\frac{1}{2},1] \times [0, 1] \\ Ω_{6} & \equiv & [\frac{2}{3},1] \times [\frac{1}{2},1] \times [0, 1] \end{matrix}

The $v_{x}, v_{y}, and v_{z}$ components of the vector $\vec{v}$ are given as follows

v_{x} = (x - x^{2}) (2 y - 1)

v_{y} = (y - y^{2}) (2 x - 1)

v_{z} = sin π z

defining a convection term that is governed by a circular flow in the xy-direction and a sinusoidal flow in the z-direction. The seven-point stencil of the finite difference method is used as discretization method. The 3-D hconvdiff problem $(n = 16, 777, 216)$ has been solved for $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ with the FGMRES(m) method, enhanced with the MPM, the RAS-1, and the RAS-2 scheme. The convergence behavior of the FGMRES(m) method, along with the MPM, the RAS-1, and the RAS-2 scheme, for the 3-D hconvdiff problem $(n = 16, 777, 216)$ with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ , is given in Table 12. The strong scalability diagrams (logarithmic scale) of the FGMRES(m) along with the MPM, the RAS-1, and the RAS-2 scheme, for solving the 3-D hconvdiff problem $(n = 16, 777, 216)$ with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ , are depicted in Figure 24(a) and (b), respectively.

Table 12.

The convergence behavior of the FGMRES(m) method, along with the MPM, the RAS-1, and the RAS-2 scheme, for the 3-D hconvdiff problem $(n = 16, 777, 216)$ with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ .

	3-D hconvdiff Strong scalability convergence behavior
	Processes	2	4	8	16	32
	Subdomains	30	56	112	208	384
$ε {= 10}^{- 1}$	MPM	6(5)	5(14)	5(9)	5(3)	4(9)
	RAS-1	4(5)	4(18)	5(8)	5(12)	6(8)
	RAS-2	3(4)	3(14)	4(3)	4(8)	4(11)
$ε {= 10}^{- 2}$	MPM	8(4)	8(18)	7(3)	7(10)	6(18)
	RAS-1	3(16)	4(2)	4(15)	5(16)	6(3)
	RAS-2	3(1)	3(2)	3(12)	3(20)	4(5)

FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; 3-D hconvdiff: three-dimensional heterogeneous convection–diffusion problem; MPM: multi-projection method.

Figure 24.

The strong scalability diagrams (logarithmic scale) of the FGMRES(m) along with the MPM, the RAS-1, and the RAS-2 scheme, for solving the heterogeneous convection–diffusion 3-D problem $(n = 16, 777, 216)$ with (a) $ε {= 10}^{- 1}$ and (b) $ε {= 10}^{- 2}$ are shown. FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; MPM: multi-projection method.

In the case of $ε {= 10}^{- 1}$ , the MPM scheme requires less iterations for convergence to the prescribed tolerance than the RAS-1 scheme after the 16 processes and less than the RAS-2 for the 32 processes. In the case of $ε {= 10}^{- 2}$ , the RAS schemes have better convergence behavior than the MPM scheme for the 3-D hconvdiff problem, however, according to the trend line of the required iterations as the number of subdomains increases, a better convergence behavior for the MPM scheme is expected for more than 32 processes. The MPM scheme presents satisfactory performance compared to the RAS schemes, especially for larger number of subdomains.

The corresponding weak scalability numerical experiment has been conducted for the 3-D hconvdiff problem, where the linear system is augmented by 1,000,000 unknowns per process. In Table 13, the convergence behavior of FGMRES(m) method, along with the MPM, the RAS-1, and the RAS-2 scheme, concerning the 3-D hconvdiff problem with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ , is presented. The weak scalability diagrams (logarithmic scale) of the FGMRES(m) method, in conjunction with the MPM, the RAS-1, and the RAS-2 scheme, for solving the 3-D hconvdiff problem for $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ , are illustrated in Figure 25(a) and (b), respectively.

Table 13.

The convergence behavior of FGMRES(m) method, along with the MPM, the RAS-1, and the RAS-2 scheme, concerning the 3-D hconvdiff problem with $ε {= 10}^{- 1}$ and $ε {= 10}^{- 2}$ .

	3-D hconvdiff Weak scalability convergence behavior
	Processes	2	4	8	16	32
	Subdomains	24	48	96	208	448
	Size	1,953,125	3,944,312	7,880,599	15,813,251	31,855,013
$ε {= 10}^{- 1}$	MPM	4(5)	4(5)	4(19)	4(19)	4(19)
	RAS-1	2(19)	3(15)	4(12)	5(19)	7(16)
	RAS-2	2(2)	2(10)	3(10)	4(8)	5(2)
$ε {= 10}^{- 2}$	MPM	4(12)	5(6)	6(14)	7(9)	8(8)
	RAS-1	2(10)	3(3)	4(2)	5(10)	7(2)
	RAS-2	1(15)	2(5)	2(18)	3(19)	4(19)

FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; MPM: multi-projection method; 3-D hconvdiff: three-dimensional heterogeneous convection–diffusion problem.

Figure 25.

The weak scalability diagram (logarithmic scale) of the FGMRES(m) method, in conjunction with the MPM, the RAS-1, and the RAS-2 scheme, for solving the heterogeneous convection–diffusion 3-D problem for (a) $ε {= 10}^{- 1}$ and (b) $ε {= 10}^{- 2}$ are shown. FGMRES(m): flexible GMRES(m): Generalized Minimum RESidual restarted; MPM: multi-projection method.

The MPM scheme requires almost constant number of iterations for convergence to the prescribed tolerance, especially for the $ε {= 10}^{- 1}$ case, whereas the RAS schemes have worse convergence behavior as the size of the problem increases. The performance of the MPM scheme in the weak scaling experiment is closer to the ideal weak scalability curve, whereas the performance of the RAS schemes degrades for larger sizes of linear systems. It should be stated that the MPM scheme is expected to perform even better in a larger GPU cluster. Numerical results of the classical MPM scheme, that is, the CPU case with direct solver for solving the local linear systems, on a larger scale supercomputing system are given in Moutafis et al. (2017a).

The comparison with the RAS preconditioning scheme indicates that the proposed method is suitable for large-scale GPU clusters, since it has improved convergence behavior as the number of subdomains increases, and presents satisfactory weak scaling.

6. Conclusion

The proposed hybrid MPM enhanced with generic factored approximate sparse inverse matrices has been shown to have satisfactory performance and convergence behavior for various model problems retaining almost constant number of iterations for convergence to the prescribed tolerance, for increasing numbers of subdomains. The hybrid MPM scheme, along with the proposed load-balancing algorithm, exploits the CPUs and GPUs of the distributed memory system appropriately, leading up to 1.7× speedup compared to the CPU-only case and up to 2× for the GPU-only case, especially for large number of subdomains. The BiCGSTAB-CG method, enhanced with the MGenFAspI-SFASI preconditioner, has been used for solving the local linear systems, since it can be parallelized efficiently in both the CPU and GPU environments. Furthermore, the use of coprocessors, such as GPUs, enables the concurrent processing of larger number of subdomains leading to improved convergence behavior of the hybrid MPM scheme. Future work will be concentrated on further improving the convergence behavior of the proposed method using different basis vectors in the semi-aggregation procedure. Moreover, advanced techniques based on communication avoidance and asynchronous communication would be combined with mixed precision computations on GPUs, resulting in improved performance for the FGMRES(m)-IPCG method along with the MPM preconditioning scheme. Another scientific area in which future work will be focused is the study of an analytical performance model for balancing the workload among the workstations.

Footnotes

Acknowledgment

The authors acknowledge the Greek Research and Technology Network (GRNET) for the provision of the National HPC facility ARIS under project PR004033-ScaleSciCompII and the project PR006053-ScaleSciCompIII.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research work of Byron E Moutafis, as a PhD candidate, was funded by the General Secretariat for Research and Technology (GSRT) and the Hellenic Foundation for Research and Innovation (HFRI) (Grant-Code: 1609).

ORCID iD

George A Gravvanis

References

Anzt

Chow

Dongarra

(2015a) Iterative sparse triangular solves for preconditioning. In: Träff

Hunold

Versaci

(eds) European conference on parallel processing, Vienna, Austria, 24-28 August 2015, pp. 650–661. Berlin, Heidelberg: Springer.

Anzt

Gates

Dongarra

, et al. (2017) Preconditioned Krylov solvers on GPUs. Parallel Computing 68: 32–44.

Anzt

Tomov

Luszczek

, et al. (2015b) Acceleration of GPU-based Krylov solvers via data transfer reduction. The International Journal of High Performance Computing Applications 29(3): 366–383.

Babich

Clark

Joó

, et al. (2011) Scaling lattice QCD beyond 100 GPUs. In: Proceedings of 2011 international conference for high performance computing, networking, storage and analysis, SC ‘11. New York, NY, USA, pp. 70:1–70:11. ACM. DOI: 10.1145/2063384.2063478.

Bank

Falgout

Jones

, et al. (2015) Algebraic multigrid domain and range decomposition (AMG-DD/AMG-RD). SIAM Journal on Scientific Computing 37(5): S113–S136.

Bank

Smith

(2002) An algebraic multilevel multigraph algorithm. SIAM Journal on Scientific Computing 23(5): 1572–1592.

Bergamaschi

Martnez

(2012) Banded target matrices and recursive FSAI for parallel preconditioning. Numerical Algorithms 61(2): 223–241.

Cai

Sarkis

(1999) A restricted additive Schwarz preconditioner for general sparse linear systems. SIAM Journal on Scientific Computing 21(2): 792–797.

Chan

Mathew

(1994) Domain decomposition algorithms. Acta Numerica 3: 61–143.

10.

Chow

(2000) A priori sparsity patterns for parallel sparse approximate inverse preconditioners. SIAM Journal on Scientific Computing 21(5): 1804–1822.

11.

Chow

(2001) Parallel implementation and practical use of sparse approximate inverse preconditioners with a priori sparsity patterns. The International Journal of High Performance Computing Applications 15(1): 56–74.

12.

Davis

(2011) The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38(1): 1.

13.

Farhat

Lesoinne

Pierson

(2000) A scalable dual-primal domain decomposition method. Numerical Linear Algebra With Applications 7(7-8): 687–714.

14.

Ferronato

Janna

Pini

(2014) A generalized Block FSAI preconditioner for nonsymmetric linear systems. Journal of Computational and Applied Mathematics 256: 230–241.

15.

Filelis-Papadopoulos

Gravvanis

(2016) A class of generic factored and multi-level recursive approximate inverse techniques for solving general sparse systems. Engineering Computations 33(1): 74–99.

16.

Golub

(1999) Inexact preconditioned conjugate gradient method with inner-outer iteration. SIAM Journal on Scientific Computing 21(4): 1305–1320.

17.

Gravvanis

Filelis-Papadopoulos

Giannoutakis

(2012) Solving finite difference linear systems on GPUs: CUDA based parallel explicit preconditioned biconjugate conjugate gradient type methods. The Journal of Supercomputing 61(3): 590–604.

18.

Hestenes

Stiefel

(1952) Methods of Conjugate Gradients for Solving Linear Systems, Vol. 49. Washington, DC: NBS.

19.

Hoemmen

(2010) Communication-Avoiding Krylov Subspace Methods. Berkeley: University of California.

20.

Karypis

Kumar

(1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20(1): 359–392.

21.

Kolotilina

Yeremin

(1993) Factorized sparse approximate inverse preconditionings I. Theory. SIAM Journal on Matrix Analysis and Applications 14(1): 45–58.

22.

Kyziropoulos

Filelis-Papadopoulos

Gravvanis

(2018) A class of symmetric factored approximate inverses and hybrid two-level solver. International Journal of Computational Methods 15(6): 1850050.

23.

Luo

Yang

Zhao

, et al. (2011) A scalable hybrid algorithm based on domain decomposition and algebraic multigrid for solving partial differential equations on a cluster of CPU/GPUs. In: Mehofer

Schordan

Quinlan

, et al. (eds) 2nd International workshop on GPUs and scientific applications (GPUSCA 2011) Galveston Island, Texas, USA, 10 October 2011, pp. 45–50. Austria: University of Vienna

24.

Manguoglu

(2012) Parallel solution of sparse linear systems. In: Berry

Gallivan

Gallopoulos

Grama

Philippe

Saad

Saied

(eds), High-Performance Scientific Computing. Berlin: Springer, pp. 171–184.

25.

Mitchell

(1997) A parallel multigrid method using the full domain partition. Electronic Transactions on Numerical Analysis 6: 224–233.

26.

Moutafis

Filelis-Papadopoulos

Gravvanis

(2017a) Parallel multi-projection preconditioned methods based on semi-aggregation techniques. Journal of Computational Science 22: 45–53.

27.

Moutafis

Filelis-Papadopoulos

Gravvanis

(2017b) Parallel multiprojection preconditioned methods based on subspace compression. Mathematical Problems in Engineering 2017: Article ID 2580820. DOI: 10.1155/2017/2580820.

28.

Notay

(2006) Aggregation-based algebraic multilevel preconditioning. SIAM Journal on Matrix Analysis and Applications 27(4): 998–1018.

29.

Notay

(2010) An aggregation-based algebraic multigrid method. Electronic Transactions on Numerical Analysis 37(6): 123–146.

30.

Osaki

Ishikawa

(2010) Domain decomposition method on GPU cluster. In: In proceedings of the XXVIII international symposium on lattice field TheoryPoS, LATTICE2010 036, arXiv preprint arXiv:1011.3318.

31.

Papadrakakis

Stavroulakis

Karatarakis

(2011) A new era in scientific computing: domain decomposition methods in hybrid CPU–GPU architectures. Computer Methods in Applied Mechanics and Engineering 200(13): 1490–1508.

32.

Saad

(1993) A flexible inner-outer preconditioned GMRES algorithm. SIAM Journal on Scientific Computing 14(2): 461–469.

33.

Saad

(2003) Iterative Methods for Sparse Linear Systems, 2nd edn. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. ISBN 0898715342.

34.

Schenk

Gärtner

(2004) Solving unsymmetric sparse systems of linear equations with pardiso. Future Generation Computer Systems 20(3): 475–487.

35.

Sluis

(1969) Condition numbers and equilibration of matrices. Numerische Mathematik 14(1): 14–23.

36.

Smith

Bjorstad

Gropp

, et al. (2004) Domain Decomposition: Parallel Multilevel Methods for Elliptic Partial Differential Equations. Cambridge: Cambridge University Press.

37.

Toselli

Widlund

(2005) Domain Decomposition Methods: Algorithms and Theory, Vol. 34. Berlin: Springer.

38.

Van der Vorst

(1992) Bi-CGSTAB: a fast and smoothly converging variant of Bi-CG for the solution of nonsymmetric linear systems. SIAM Journal on Scientific and Statistical Computing 13(2): 631–644.

39.

Yamazaki

Anzt

Tomov

, et al. (2014a) Improving the performance of CA-GMRES on multicores with multiple GPUs. In: 2014 IEEE 28th international parallel and distributed processing symposium, Phoenix, AZ, USA, 19-23 May 2014, pp. 382–391. USA: IEEE Computer Society.

40.

Yamazaki

Rajamanickam

Boman

, et al. (2014b) Domain decomposition preconditioners for communication-avoiding Krylov methods on a hybrid CPU/GPU cluster. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, New Orleans, LA, USA, 16-21 November 2014, pp. 933–944. USA: IEEE Computer Society.

41.

Zhu

Sameh

(2017) PSPIKE+: a family of parallel hybrid sparse linear system solvers. Journal of Computational and Applied Mathematics 311: 682–703.