Abstract
In this article, we present algorithms and implementations for the end-to-end GPU acceleration of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods described here allow for the construction of effective preconditioners for high-order problems with optimal memory usage and computational complexity. The preconditioners are based on the construction of a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for example, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and polynomial degree. For vector finite element problems in
Introduction
High-order finite element methods provide highly accurate solutions for problems with complex geometry using unstructured meshes, while simultaneously achieving efficiency and scalability on modern supercomputing platforms, in particular those with accelerator- and GPU-based architectures (Kolev et al., 2021b; Brown et al., 2021; Hutchinson et al., 2016). While the efficient evaluation of high-order finite element operators using matrix-free methods and sum factorization (cf. Fischer et al., 2020; Kronbichler and Kormann, 2019; Orszag, 1980) is well-studied, the solution of the resulting large linear systems remains challenging, in large part because the cost of assembling the system matrix of the high-order operator is prohibitive, both in terms of memory requirements and computation time. In this work, we describe the GPU acceleration and high-performance implementation of low-order-refined (LOR) preconditioning for high-order finite element problems posed in H1,
There are a number of other approaches for the matrix-free preconditioning of high-order finite element problems. One common approach is p-multigrid, which involves the construction of a hierarchy of p-coarsened spaces (Helenbrook and Atkins, 2006; Sundar et al., 2015). The coarsest space is typically a lowest-order space, which can be treated by means of a standard preconditioner for low-order finite element problems, for example, algebraic multigrid methods. On meshes that permit h-coarsening (e.g., meshes that result from the successive refinement of an initial coarse mesh), geometric multigrid methods can also be used for matrix-free preconditioning. These two approaches can be combined in a relatively flexible manner to obtain hp-coarsened hierarchies. In all of these methods, a smoother is required at each level of the hierarchy; typically, Jacobi or Chebyshev smoothing is performed (where the diagonal of the matrix is obtained without full matrix assembly using sum factorization, cf., Rønquist and Patera, 1987); more sophisticated smoothers such as overlapping Schwarz may also be used (Fischer 1997). Combinations of these techniques with low-order-refined preconditioning are also possible (see Pazner, 2020). Detailed performance comparisons of matrix-free methods for high-order discretizations are available in Kronbichler and Wall (2018); Kronbichler and Ljungkvist (2019); Phillips et al. (2022); and Kolev et al. (2021a), among others.
In this work, we develop algorithms and software implementations to effectively leverage GPU-based computing architectures for the entire low-order-refined preconditioning algorithm. This includes the parallel assembly of the auxiliary low-order system matrices, the construction and application of algebraic multigrid preconditioners, and the matrix-free application of the high-order operator in the context of a preconditioned conjugate gradient iteration. For
In this article, we provide comprehensive performance studies of the solver algorithms, measuring runtimes, kernel throughput, and strong and weak parallel scalability of all components of the solution algorithm. Additionally, we compare the relative weightings of the algorithmic components on GPU and CPU, where our results show that the computational bottlenecks for GPU-accelerated solvers are significantly different than for CPU solvers. In particular, while on CPU, the solve phase (specifically the preconditioner application) dominates the overall runtime, on GPU, the setup phase represents the majority of the computational time. The construction of algebraic multigrid preconditioners is less amenable to GPU acceleration than either the application of the AMG V-cycle or the computation of the action of the high-order finite element operator. The LOR matrix assembly algorithms proposed in the present work ensure that the assembly of the low-order-refined matrix represents only a small fraction of the total computation time on both GPU or CPU. Through the development of GPU algorithms for all components of the solution procedure, we avoid expensive host–device data transfers, leading to high-throughput, scalable solvers for large-scale high-order finite element problems.
The first work on LOR preconditioning dates to 1980, when Orszag proposed using second-order finite difference methods to precondition certain spectral collocation methods (Orszag 1980). Since then, there has been significant work extending this concept, making use of low-order finite element discretizations to precondition spectral methods, spectral element methods, and finite element methods (Canuto, 1994; Canuto et al., 2010; Fischer, 1997; Deville and Mund, 1985). The low-order discretization is often in-turn preconditioned using multigrid, in particular algebraic or semi-structured multigrid methods, resulting in scalable solvers that deliver robust convergence under both h- and p-refinement (Bello-Maldonado and Fischer, 2019; Pazner, 2020). Although these multigrid-based methods typically perform well, an attractive feature of LOR preconditioning is that any effective preconditioner for the low-order system can be used to precondition the high-order system. LOR preconditioning has been successfully applied to spectral element and high-order finite element discretizations of the incompressible Navier–Stokes equations (Fischer and Lottes 2005; Franco et al., 2020), and has been extended to other discretizations, including discontinuous Galerkin methods with hp-refinement (Pazner and Kolev, 2021). Recently, spectrally equivalent LOR preconditioners for Maxwell problems in
A novel aspect of this work is the use of a macro-element batching strategy for the assembly of the low-order-refined system matrices, and the associated algorithms and data structures. There is significant existing work in the literature on GPU-accelerated algorithms for the matrix-free evaluation of high-order operators (Ljungkvist, 2017; Abdelfattah et al., 2021; Franco et al., 2020); these algorithms have been a primary focus of the CEED project targeting exascale architectures (Kolev et al., 2021a). Similarly, GPU-accelerated algebraic multigrid algorithms have been developed as part of the hypre (Falgout and Yang 2002; Falgout et al., 2021) and AmgX (Naumov et al., 2015) libraries, among others. However, the efficient assembly of the low-order-refined matrices on GPUs is a topic that has not been addressed in these previous works. The algorithms and GPU implementations described presently are shown to significantly outperform more generic matrix assembly approaches that do not take advantage of the macro-element-level regular structure and topology of the low-order-refined discretizations.
Low-order-refined preconditioning
In this section, we give a brief overview of low-order-refined preconditioning for high-order finite element problems. We begin by defining the low-order-refined mesh. Then, low-order preconditioning for H1-conforming discretizations of the Poisson problem is described, followed by a discussion of preconditioning problems in
The low-order-refined mesh
Let Top row: examples of high-order (coarse) meshes with p = 9 Gauss–Lobatto nodes. Bottom row: corresponding low-order-refined meshes.
Poisson problem
Consider the model Poisson problem
To discretize this problem, define the H1-conforming degree-p finite element space V
p
,
Because of the prohibitive assembly costs, the matrix • Since the high-order and low-order-refined degrees of freedom coincide, • The low-order matrix •
Since
Vector finite elements
While the approach described above can be used to obtain spectrally equivalent low-order-refined preconditioners using nodal H1 discretizations for Poisson problems, the situation is more delicate for Maxwell and grad-div problems in
Instead, an interpolation–histopolation basis for the high-order
Adaptive mesh refinement and hanging node constraints
In the case of meshes with hanging nodes (e.g., those resulting from nonconforming adaptive mesh refinement), the construction of the appropriate low-order-refined discretization is less obvious. In this case, the previously described refinement procedure to obtain the low-order-refined mesh results in non-matching interfaces that do not directly lend themselves to the construction of spectrally equivalent low-order discretizations. See Figure 2 for an illustration of non-matching interfaces resulting from the refinement of meshes with hanging nodes. In such cases, the procedure to assemble a spectrally equivalent low-order directly on the refined mesh is unclear. Illustration of nonconforming meshes with hanging nodes and their low-order-refined counterparts.
Pazner and Kolev (2021) proposed an approach for constructing low-order-refined preconditioners for discretizations posed on meshes with hanging nodes based on variational restriction. In this approach, the high-order system is written as
The low-order-refined counterpart
GPU algorithms
The LOR-based solution procedure can broadly be divided into two phases. The setup phase, performed before the beginning of the conjugate gradient iteration, consists of problem setup, matrix assembly, and construction of the algebraic multigrid preconditioner. The solve phase consists of the preconditioned conjugate gradient iteration, which at every step requires the matrix-free application of the high-order operator and the application of the algebraic multigrid V-cycle. In this section, we describe the approach for GPU-accelerated algorithms for each of these components. We summarize the algorithmic steps that make up the solution algorithm: S1. Problem setup: S1.1. High-order operator setup; S1.2. Low-order-refined matrix assembly; S1.3. Algebraic multigrid setup. S2. Preconditioned conjugate gradient, including at every iteration: S2.1. High-order operator evaluation; S2.2. Algebraic multigrid V-cycle.
Steps S1.1 and S2.1 are the main algorithmic components for the fast evaluation of matrix-free high-order operators, as described in detail in Anderson et al. (2020); Kolev et al. (2021a); Abdelfattah et al. (2021); Brown et al. (2021); and Kronbichler and Kormann (2019), among others; efficient GPU-accelerated algorithms and implementations for these operations have been extensively discussed in the literature. Likewise, Steps S1.3 and S2.2 make up the two phases of algebraic multigrid methods, and efficient GPU strategies have been discussed in Haase et al. (2010); Falgout et al. (2021); Bell et al. (2012); and Naumov et al. (2015). Although we will briefly summarize the important aspects of these algorithms in the following sections, the primary contribution of this work is the development of efficient, GPU-suitable algorithms for Step S1.2. Prior to the development of efficient macro-element strategies for the LOR assembly, this step represented the bottleneck in the solution algorithm. Using the algorithms described in the following sections, this component is reduced to a small fraction of the overall runtime (see Figure 5).
Step S1.1. High-order operator setup
To achieve good performance with high-order discretizations, especially on GPUs, the high-order operator is evaluated matrix-free; the discretization matrix associated with the operator is never assembled. Instead, we make use of a matrix-free approach termed partial assembly. The high-order operator, denoted A
p
, is decomposed as a nested product of matrices Schematic of the high-order operator decomposition used in the partial assembly approach in the MFEM and libCEED software libraries.
In the partial assembly framework, the product in (4) is not formed, and instead the individual factors are applied sequentially. The factors are themselves typically not stored as sparse matrices. Instead, the matrices P and G make use of degree of freedom index mappings that are precomputed and stored; these mappings depend only on the mesh and finite element space, and not on the specific operator or physics of the problem. In the case of nonconforming adaptive mesh refinement, the P matrix also incorporates the AMR constraints, and is represented as a parallel CSR matrix; when the mesh is conforming, P is represented matrix-free. The basis operator B is identical between all elements of the same type (geometry and polynomial degree). In this work, we consider meshes consisting of tensor-product elements with constant polynomial degree, in which case one element-local B matrix can be stored and used for all elements in the mesh. Additionally, the tensor-product structure of quadrilateral and hexahedral elements allows for B to be expressed as a Kronecker product, requiring only the storage of small one-dimensional factors. The approach is closely related to sum factorization, which has been widely adopted in the areas of spectral and spectral element methods (Orszag, 1980). Finally, the D matrix consists of either scalar values or small matrices (d × d, where d is the spatial dimension) at each quadrature point. These values are precomputed and stored as part of the operator setup.
Partial assembly refers to the precomputation of the degree of freedom index mappings required for P and G, the one-dimensional interpolation and derivative matrices required for B, and the construction of the D matrix. Given the geometric factors at quadrature points, all of these precomputations can be performed in constant time per degree of freedom (i.e., linear time in the problem size, regardless of the polynomial degree of the finite element space). The evaluation of the geometric factors itself can be performed using sum factorization, resulting in a scaling of
Step S1.2. Low-order-refined matrix assembly
Although the action of the high-order operator is computed matrix-free, in order to construct the low-order-refined algebraic multigrid preconditioners, it is required to assemble the system matrix associated with the low-order-refined discretization. This matrix has the same dimensions as that of the high-order operator, but is significantly sparser: while number of nonzeros per row of the high-order matrix scales like
While the asymptotic complexity of the matrix assembly is known to be optimal, the achieved throughput of these operations on GPU-based architectures depends greatly on the structure of the assembly algorithms, the choice of data structures, and the threading strategies. In particular, it has been widely shown that high-order finite element operations often achieve greater throughput for similarly sized problems when compared with their low-order counterparts, despite requiring more arithmetic operations (Kolev et al., 2021a; Franco et al., 2020). In large part, this is because typical finite element operations are memory-bound, and the cost of performing arithmetic operations is negligible compared with the cost of memory transfer and access. High-order operators lend themselves to more structured and efficient memory access patterns than low-order operators, leading to greater performance. This presents a challenge for the assembly of low-order-refined matrices, which inherently use lowest-order finite elements.
The approach taken in this work is to structure the assembly of the low-order-refined matrices around macro-elements. In the context of low-order-refined methods, each element of the coarse, high-order mesh An example of a high-order mesh 
The overall matrix assembly algorithm proceeds according to the following steps, which are described in detail in the following sections. A1. Assembly of sparse matrices for each macro-element. A2. Assembly of processor-local sparse matrix. A3. Assembly of global (parallel) sparse matrix. A4. Elimination of boundary conditions.
Step A1. Macro-element sparse matrix assembly
Traditional algorithms for finite element assembly proceed by constructing small, independent element-local matrices (in “unassembled form”), and then placing them in the system matrix. This can be expressed algebraically as
The main work of Step A1 is to fill in the entries of the
In order to assemble the subelement matrices, geometric factors are required at the subelement vertices. In the completely unstructured approach, this would first require the computation of the mesh coordinates at each low-order-refined vertex in “broken” element-wise format (i.e., as an LOR E-vector), resulting in unnecessary duplication and inefficient access patterns. Instead, in the macro-element approach, the mesh coordinates are represented as high-order E-vectors, where all the coordinates corresponding to a macro-element are stored contiguously, with no duplication at the subelement level. This is more efficient both in terms of memory usage and access patterns, and also allows the reuse of the high-order G operator, as described in Step S1.1.
Step A2. Processor-local assembly
Once the macro-element sparse matrices
Once
Step A3. Global matrix assembly
When running on a single MPI rank (i.e., on a single GPU), this step can be skipped entirely. When running in parallel, the construction of the global (parallel) CSR matrix is required. In this format, the degrees of freedom are partitioned by MPI rank, and each rank owns the matrix rows associated with its degrees of freedom. Each rank has a diagonal block, which contains the nonzeros for which both the row and column indices are owned by itself, and an off-diagonal block, which contains the nonzeros for which the column indices are owned by another rank. The construction of this parallel CSR matrix is performed using a triple-product P T AP. Because this operation (or RAP more generally) is important for the construction of coarse operators in algebraic multigrid, the hypre library provides optimized GPU kernels for computing triple products. This operation can further be optimized using the knowledge that in the case of conforming meshes, the P matrix is Boolean. For details on the algorithms used for this operation, see Falgout et al. (2021).
Step A4. Elimination of boundary conditions
Once the global matrix has been formed, the matrix must be modified to take into account essential boundary conditions. The rows and columns corresponding to each essential degree of freedom must be eliminated, and the corresponding diagonal entry is typically replaced by value 1. The elimination of columns in parallel requires communication, because a column corresponding to a degree of freedom owned by one rank may have nonzeros in rows owned by other ranks. The first step of the elimination procedure is to communicate to each rank which columns it must eliminate. This communication is performed using non-blocking, device-aware MPI, whenever available. The column indices to eliminate are obtained in the form of a marker array, that is, a Boolean array with 1 in the indices that must be eliminated, and 0 elsewhere.
Once the communication has begun, the rows and columns in the diagonal block, and the rows in the off-diagonal block can be eliminated, thus allowing for the overlap of communication and computation. The elimination is embarrassingly parallel, and is threaded over the number of essential degrees of freedom. After this procedure is complete, we wait for the communication to finish, and finally eliminate the columns in the off-diagonal block. Since this data is available in the format of a marker array, each nonzero in the off-diagonal block is simply scaled by one or zero depending on the marker value, allowing parallelization over the number of rows in the off-diagonal.
Auxiliary computations
The above sections describe the main algorithmic components of the low-order-refined matrix assembly. In the case of vector finite element problems in
The LOR mesh vertex coordinate vectors can be computed efficiently using geometric information that is already available from the LOR matrix assembly in Step S1.2. The matrix assembly required the mesh coordinates in the form of a high-order E-vector; this vector contains the mesh coordinates required for the construction of the interpolation matrices, with duplications because of the E-vector format. Deduplicating this vector is performed efficiently using the element restriction degree of freedom index mappings (similarly to the action of G T in Step S2.1); this operation can be performed in parallel with one thread per deduplicated DOF without requiring any MPI communication.
Similarly, the discrete curl matrix maps from LOR edges to faces, such that the value 1 or −1 is placed at the (i, j) entry of the matrix, where the LOR edge j is one of the edges of the LOR face i. The sign is determined by the orientation of the mesh edge relative to that of the mesh face. The same consideration as in the case of the discrete gradient matrix also applies here; a local version of the mapping is computed on a reference macro-element, and reused for the entire mesh. The index mapping information from the element restriction operator is used to assemble the global operator.
Step S1.3. Algebraic multigrid setup
After Step S1.2 has been performed, the parallel CSR matrix can be passed to algebraic multigrid library to perform the AMG setup. In this work, we use hypre’s AMG implementation, but any GPU-capable algebraic multigrid library could equally well be used, for example, AmgX. Because of the sparsity of the low-order-refined matrix, the construction of the AMG hierarchy is generally efficient, and the resulting operator complexities are not prohibitively high. One of the main components of the algebraic multigrid setup is the construction of the coarse-grid operators, which is performed using triple-product (P T AP) operations. This is the same operation that is used for the global (parallel) LOR matrix assembly in Step A3, and so both the algebraic multigrid setup and low-order-refined assembly benefit from improvements to the triple-product algorithm. Details on the development and porting of the AMG setup algorithms to GPU-based architectures are described in Falgout et al. (2021).
Step S2.1. High-order operator evaluation
At each conjugate gradient iteration, the action of the high-order operator must be applied. As described in Step S1.1, we use the partial assembly approach for the high-order operator. Given the decomposition (4), the action of the high-order operator A
p
can be computed by successively applying the operators P, G, B, and D, and the transposes B
T
, G
T
, and P
T
. In the case of a conforming mesh, the matrices P and G are Boolean matrices, which are stored as mappings between local and distributed degree of freedom indices; the case of nonconforming adaptive refinement results in more complicated prolongation operators that are treated differently. The action of P is computed with MPI communication using these index mappings. When possible, device-aware MPI is used for these communication routines in order to reduce memory transfer between host and device. In MFEM, this is achieved using the
The action of the triple-product B
T
DB is implemented as a single kernel that operates on the E-vector level. In MFEM, this action is performed by an operator-specific class derived from
The most common threading strategy for the products B T DB is to use one block of threads per element, with threads corresponding to the quadrature points. The basis function values (or gradients) at quadrature points are stored in shared memory, together with the intermediate vectors required for the sum factorized computation of the matrix–vector products. The diagonal (or block-diagonal) matrix D is precomputed in Step S1.1, and its application trivially parallelizes over all quadrature points. The transposed basis operator B T returns from quadrature points to (dual) degrees of freedom, and its application essentially follows to reverse order as that of B.
Step S2.2. Algebraic multigrid V-cycle
The application of the algebraic multigrid V-cycle (i.e., the AMG solve phase) is more amenable to straightforward GPU implementation than the setup phase. Each level in the multigrid hierarchy involves residual computation, point smoothing, and the application of restriction and prolongation matrices. These operations can entirely be cast as standard linear algebraic operations, that is, sparse matrix–vector (
Results and numerical experiments
In this section, we present several numerical results studying the performance of the GPU algorithms described above. These algorithms are implemented in the MFEM finite element software library (Anderson et al., 2020). The low-order-refined discretizations are formed using the class
Algorithmic components
First we consider the relative weighting of the algorithmic components of the solver described above. We solve a definite Helmholtz (i.e., diffusion–reaction) problem on a Cartesian grid in 2D and 3D using polynomial degree p = 6. In 2D, the mesh consists of 262,144 elements, resulting in a total of 9,443,329 degrees of freedom. In 3D, the mesh consists of 32,768 elements, resulting in 7,189,057 degrees of freedom. Because in both cases the mesh is topologically Cartesian, the low-order-refined system matrix has 9 nonzeros per row in 2D, and 27 nonzeros per row in 3D (with the exception of degrees of freedom at the domain boundary), resulting in 84,953,089 nonzeros in 2D and 192,100,033 nonzeros in 3D. The linear system is solved using a relative tolerance of 10−12. In 2D, this requires 32 CG iterations, and in 3D convergence is attained after 40 CG iterations. These problems are solved on a single V100 GPU. We also compare the GPU timings with the runtimes on one Power 9 CPU core. We emphasize that the purpose of this comparison is to study the difference in relative weightings of the algorithmic components on CPU and GPU, not to compare directly GPU to CPU performance. The runtimes for the algorithmic components of the solution procedure are shown in Figure 5. Wall-clock runtimes for the algorithmic components enumerated above, solving a definite Helmholtz problem in H1 with polynomial degree p = 6. Top row: GPU timings on one V100 GPU. Bottom row: CPU timings on one Power 9 core. Left column: 2D test case with 262,144 elements and 9,443,329 degrees of freedom, 33 CG iterations. Right column: 3D test case with 32,768 elements at 7,189,057 degrees of freedom, 40 CG iterations.
On the GPU, in both the 2D and 3D cases, around 75% of the time to solution was spent in the setup phase, with over 50% of the time to solution in AMG setup. The time spent assembling the LOR system was approximately 3% and 10% of the total runtime, in 2D and 3D, respectively. Note that the timings for the setup of the high-order operator also includes the construction of the element restriction operator (see Step S1.1), which, because of the macro-element strategy described in Step S1.2, is reused for the LOR matrix assembly. Within the solve phase, over 80% of the runtime was spent in the AMG V-cycle in both cases, and only 16–18% in the high-order operator evaluation. These results suggest that the largest potential for further performance gains are possible by optimizing the preconditioner construction and application. Interestingly, on CPU, the relative weight of the AMG setup is significantly reduced, and in contrast to the GPU runs, the large majority of the time is spent in the solve phase.
These results highlight one advantage of LOR-based methods: any improvements or optimizations to the algebraic multigrid implementation (or in fact any other preconditioner suitable for the low-order problem) can automatically benefit the high-order solvers. We further note that in many practical time-dependent problems, the mesh and problem coefficients remain fixed, either for the duration of the entire simulation, or for numerous time steps (see, e.g., Franco et al., 2020). In this context, the setup time is amortized over the many repeated applications of the solve phase. As a result, for these problems, the relatively expensive AMG setup may constitute a small fraction of the overall runtime. However, for problems which require performing the preconditioner setup every time step (e.g., problems incorporating mesh motion or time-dependent variable coefficients), the setup will remain a significant portion of the time to solution.
Kernel throughput
We study the dependence of the kernel throughput on polynomial degree and problem size. Working on a sequence of 3D Cartesian meshes, we measure the throughput (degrees of freedom per second) of the main computational kernels. First, we consider high-order operator setup and application. The MFEM library supports multiple backends that can be used for GPU evaluation. We compare MFEM’s native GPU backend to libCEED’s Throughput (millions of degrees of freedom per second) for high-order operator and setup for the definite Helmholtz problem on one V100 GPU. Left: high-order operator setup (partial assembly). Center: high-order operator evaluation using MFEM’s native kernels. Right: high-order operator evaluation using MFEM’s libCEED backend.
Next, we consider the throughput of the low-order-refined assembly algorithms. We compare the macro-element assembly strategy described in Step S1.2 to a more generic, fully unstructured matrix assembly algorithm that does not take advantage of the structure of the low-order-refined mesh. The more general, unstructured algorithm is suitable for assembling matrices corresponding to any polynomial degree, but does not contain any optimizations specific to the case of low-order-refined discretizations. This algorithm uses one block of threads per element, with one thread per degree of freedom, constructing small dense matrices for each element of the low-order-refined mesh, and then directly assembling the global sparse matrix (skipping the macro-element sparse matrix assembly in the LOR approach). The throughput plots comparing these two approaches are shown in the left and center columns of Figure 7. Throughput (millions of degrees of freedom per second) for high-order operator and setup for the definite Helmholtz problem on one V100 GPU. Left: LOR matrix assembly using the macro-element approach. Center: LOR matrix assembly using the unstructured (legacy) approach. Right: algebraic multigrid setup.
The fully unstructured assembly algorithm shows no variation in throughput between polynomial degrees of the high-order space; this is expected because for equal problem size, the low-order-refined meshes corresponding different high-order polynomial degrees are topologically identically, and differ only in the mesh distortion, which has no algorithmic impact on the matrix assembly. On the other hand, the macro-element assembly algorithm exhibits increasing throughput with high-order polynomial degree. As the degree of the high-order space increases, the size of each macro-element increases, and the matrix assembly becomes more structured, allowing for increased fine-grained parallelism, leading to higher performance. In the lowest-order case, the macro-element strategy performs worse than the unstructured algorithm. This is because in the case of p = 1, the macro elements each only contain a single element, and so each block consists only of one thread. On the other hand, the unstructured algorithm threads over the E-vector degrees of freedom, resulting in eight threads per element. However, at p = 2, the performance of the macro-element strategy is roughly equal to that of the unstructured algorithm, and for p > 2, we observe meaningful performance improvements. For the highest orders, the macro-element approach is more than twice as fast as the unstructured approach. It is also important to note that this factor does not include the savings that are obtained by reusing the high-order mesh topology and element restriction operator. In the fully unstructured case, the overhead associated with constructing the low-order-refined mesh and element restriction operators can dominate the time spent in the matrix assembly, often by large factors.
The right column of Figure 7 shows the throughput of the algebraic multigrid setup as a function of problem size. As in the case of the unstructured LOR matrix assembly, there is essentially no dependence of the performance of these kernels on the polynomial degree of the high-order space. The resulting system matrices have the same sparsity pattern, independent of p. Differences in the values of the matrix entries can lead to different choices in the AMG coarsening, and hence will result in different AMG hierarchies, which has the potential to have a marginal impact on performance. In practice, these differences are found to be negligible. The results obtained here corroborate those shown previously. The bottleneck in the problem setup consists of the AMG setup phase, which reaches a peak throughput of about 1.5 million degrees of freedom for the largest problem sizes. Finally we consider the AMG V-cycle, the throughput of which is shown in the left column of Figure 8. As in the case of AMG setup, the performance of these operations does not display any dependence on the polynomial degree. This is because the sparsity patterns for LOR systems of a given size are independent of the polynomial degree of the high-order space. The BoomerAMG V-cycle reaches a peak throughput of about 300 million degrees of freedom per second. Throughput (millions of degrees of freedom per second) for algebraic multigrid preconditioner application on one V100. Left: one BoomerAMG V-cycle. Center: application of the AMS preconditioner for 
Nédélec and Raviart–Thomas elements
In this section, we consider the throughput of the kernels required to construct low-order-refined preconditioners for problems in Throughput (millions of degrees of freedom per second) for Nédélec and Raviart–Thomas problems. Left column: assembly of LOR system matrix. Center column: construction of discrete gradient and curl matrices. Right column: construction of coordinate vectors needed for discrete interpolation.
As described in the preceding sections, the AMS and ADS algebraic auxiliary space solvers for
Parallel scaling
In this section, we examine the parallel scalability of these algorithms. The construction of the process-local sparse matrices is performed in independently on each MPI rank. These matrices are then placed in a global block-diagonal matrix, with one block per rank. The global system matrix is computed as A = P T A L P, where A L is the block-diagonal matrix described above and P is the parallel prolongation matrix. For this triple product, we use the sparse matrix–matrix product GPU implementation available in hypre, the development of which is described in Falgout et al. (2021).
Once the global (parallel) sparse matrix has been constructed, the remaining parallel operations are AMG setup, AMG V-cycle, and high-order operator evaluation. For details on the parallelization of the BoomerAMG preconditioner, see Henson and Yang (2002). The approach taken to parallelize the high-order operator is described in Step item S2.1, using the operator decomposition (4). Writing A
p
= P
T
G
T
B
T
DBGP, the process-local operator is given by A
L
= G
T
B
T
DBG. The action of this linear operator is completely local, and requires no MPI communication. Therefore, the only operations that require parallel communication are the application of P and its transpose. For conforming meshes, the operator P is represented in MFEM as an object of type
To measure the parallel scalability of our algorithms, we perform a strong and weak scaling study, using between 4 and 1024 GPUs. We consider eight problem configurations, with sizes increasing by factors of two from 8.4 × 106 degrees of freedom to 1.1 × 109 degrees of freedom. Each problem configuration is run on three sets of GPUs to obtain the strong scaling results. Collecting the timings across different problem configurations gives the weak scaling results. The different algorithmic components as outlined in the previous sections are instrumented separately. The scaling results are shown in Figure 10. Comparisons with ideal strong and weak scaling curves indicate excellent parallel scalability and efficiency for the high-order operator setup, low-order-refined matrix assembly, and high-order operator evaluation. The algebraic multigrid setup is the component of the algorithm that is most challenging for parallel scalability. The AMG V-cycle is less efficient for smaller problem sizes, but demonstrates good weak scalability for larger problem sizes. The overall weak scaling parallel efficiency for the 1.1 × 109 problem on 256 GPUs was 83% for the setup phase (HO set, LOR assembly, and AMG setup) and 87% for the solve phase (HO apply and AMG V-cycle). Parallel scaling results for setup and solve phases from 4 to 1024 Tesla V100-SXM2 NVIDIA GPUs. Strong scaling results are given by the solid lines, and weak scaling results are given by the dashed lines. The ideal strong scaling curve is shown for reference. Ideal weak scaling corresponds to horizontal lines. The results have been done with CUDA 11.7.0 and hypre 2.25.0, with optimized sparse matrix–matrix multiplications (SpGEMM).
Adaptive mesh refinement
To illustrate the performance of the low-order-refined solvers on a problem with adaptive mesh refinement, we consider a Poisson problem whose solution exhibits an inner layer with a sharp gradient. Let u = arctan(α(r − r0)), where r is the radial distance and α is a sharpness parameter. We choose α = 20 and r = 0.725. The right-hand side f and Dirichlet condition g
D
are chosen such that u is the solution to the Poisson problem (1). We solve this problem on a mesh of the Fichera corner, which has been adaptively refined around the inner layer. The refined mesh is 1-irregular, and consists of 93,940 elements with 23,562 nonconforming interfaces. The solution and mesh are shown in Figure 11. Mesh and solution of inner layer problem run, showing nonconforming adaptive mesh refinement run on 12 V100 GPUs. Left: Fichera corner mesh. Right: mesh cut by plane normal to (1, 1, 1) passing through the point (0.2, 0.2, 0.2).
We solve this problem using polynomial degrees p = 1 through p = 7, using three nodes of Lassen with 12 V100 GPUs. The relative GPU runtimes of the algorithmic components are shown in Figure 12. For comparison, we also include the relative CPU runtimes. As in the previous test cases, the AMG setup represents the dominant portion of the total time to solution on the GPU. In this case, the setup is even more expensive because of the increased fill-in. Furthermore, the triple product P
T
AP (where the matrix P represents both the parallel decomposition and the nonconforming constraints) represents an increasing fraction of the total runtime as the constraint matrix becomes increasingly coupled at higher orders. The high-order operator setup, low-order-refined matrix assembly (not including constraints enforced through the triple-product), and high-order operator application (totaled over all the CG iterations) represent 10% or less of the time to solution for p ≥4. On the other hand, on the CPU, the relative cost of the AMG V-cycle and high-order operator evaluation are more significant. For p >2, the AMG V-cycle represents the majority of the runtime on CPU, and the high-order operator evaluation represents roughly 15% of the CPU runtime. In contrast to the GPU runtimes, on CPU, the AMG setup and P
T
AP operation each represent less than 10% of the total runtime. Results for diffusion problem on mesh with nonconforming adaptive refinement. Relative runtime of algorithmic components: left: GPU relative runtimes (12 V100 GPUs) and right: CPU relative runtimes (one Power 9 core).
Results for diffusion problem on mesh with nonconforming adaptive refinement. Problem size, LOR system sparsity, CG iterations required to converge to a relative residual of 10−12, and wall-clock runtime (12 V100 GPUs on three nodes of Lassen).
Large-scale electromagnetic diffusion
In this section, we use the GPU-accelerated solvers described above to solve a large-scale magnetic diffusion problem posed on a realistic geometry. This problem illustrates the use of low-order-refined BoomerAMG and AMS preconditioners in H1 and
The coil intersects the domain boundary at two terminals, Γin and Γout, such that ∂Ω = Γin ∪ Γout ∪ Γbox.
The current running through the copper coil is driven by a potential difference enforced as boundary conditions at the two terminals, Γin and Γout. The electric scalar potential ϕ is obtained as the solution to the Poisson problem
For this problem, we take ϕin = 0 and ϕout = 1. After obtaining the electric scalar potential, the magnetic vector potential A is given as the solution to the curl–curl problem with homogeneous tangential Dirichlet conditions at all domain boundaries
The magnetic field B is given by the curl of the vector potential, B = ∇ × A.
The domain is discretized using a hexahedral mesh with element boundaries fitted to the air-coil interface. To obtain this mesh, first an unstructured tetrahedral mesh of the geometry was generated, and then each tetrahedron was subdivided into four hexahedra to obtain an all-hexahedral mesh. While this tet-to-hex strategy often results in large meshes with poorly shaped elements, it has been used successfully in challenging high-order and spectral element applications (see, e.g., Yuan et al., 2020). The resulting mesh consists of 1,532,116 hexahedral elements. We solve this problem with polynomial degree p = 4. The high-order H1 finite element space has approximate 108 degrees of freedom, and the Nédélec space has about 2.9 × 108 degrees of freedom.
The first step of the solution procedure requires the solution of the Poisson problem (6) to compute the electric scalar potential ϕ. The resulting LOR system for the H1 system has 27 nonzeros per row, leading to a total of 2.6 × 109 nonzeros. Relative and absolute tolerances of 10−8 were used as stopping criteria for the conjugate gradient solver, resulting in convergence after 45 iterations. After the potential ϕ has been found, the right-hand side for (7) must be computed. First, ∇ϕ is computed by applying the discrete gradient operator, obtaining a vector field in
Given the right-hand side computed above, the curl–curl problem (7) is solved to obtain the magnetic vector potential A. The LOR system used to construct the AMS preconditioner has approximately 33 nonzeros per row, resulting in 9.7 × 109 nonzero entries. The same stopping criteria were used as in the H1 problem. The CG iteration converged after 22 iterations. Once the magnetic vector potential has been computed, the magnetic field is represented as a function in Top: illustration of a coarser version of the mesh for the coil problem, cut with a plane passing through the origin. Bottom: magnetic field streamlines colored by electric scalar potential.
While this example has illustrated the use of GPU-accelerated LOR preconditioners for magnetic diffusion problems in the A–ϕ formulation, the solvers are immediately applicable to a range of other problems posed in H1,
Open-source code availability
All of the algorithms and implementations described in this article are available in the open-source MFEM software library under the permissive BSD license at https://mfem.org and https://github.com/mfem/mfem (see also Anderson et al., 2020). The construction and use of matrix-free low-order-refined solvers is illustrated in the included Illustration of MFEM’s API for the construction and use of the low-order-refined solvers. Top: forming the low-order-refined discretization and assembling the matrix in parallel CSR format. Bottom: creating low-order-refined hypre AMG, AMS, and ADS preconditioners given the high-order bilinear form.
Conclusions
In this article, we have described algorithms and implementations for the GPU-accelerated solution of high-order finite element problems using low-order-refined preconditioning. This approach combines matrix-free operator evaluation using the partial assembly approach with algebraic multigrid preconditioners constructed using assembled low-order-refined matrices. The end-to-end GPU acceleration of LOR preconditioning allows for the efficient and scalable solution to high-order finite element problems on GPU-based supercomputing architectures, avoiding the memory-intensive and costly assembly of the high-order system matrices. New LOR matrix assembly algorithms based on macro-element batching were introduced, achieving significant speedup when compared with fully unstructured low-order matrix assembly. A detailed study of kernel throughput for the algorithmic components of the solution procedure was presented. The performance of preconditioners on problems with nonconforming adaptive mesh refinement is considered, showing fast problem setup and convergence even at high orders. The scalability of the solvers was demonstrated on problems with over 1 billion degrees of freedom on 1024 GPUs. Finally, we demonstrated the capability of these solvers on a challenging large-scale electromagnetic diffusion problem with complex geometry using 320 GPUs.
Footnotes
Acknowledgements
The authors thank V. Dobrev for the unfailingly insightful comments and suggestions, and M. Stowell for help with the electromagnetic diffusion problem formulation. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-JRNL-841564). W. Pazner and Tz. Kolev were partially supported by the LLNL-LDRD Program under Project No. 20-ERD-002. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Lawrence Livermore National Laboratory (LDRD 20-ERD-002) and U.S. Department of Energy (Exascale Computing Project (17-SC-20-SC)).
Author biographies
Will Pazner is an assistant professor in the Fariborz Maseeh Department of Mathematics and Statistics at Portland State University, where his research focuses on high-order finite elements, efficient solvers and preconditioners, and high-performance computing. Previously, he was the Sidney Fernbach Postdoctoral Fellow at Lawrence Livermore National Laboratory’s Center for Applied Scientific Computing.
Tzanio Kolev is a computational mathematician at the Center for Applied Scientific Computing (CASC) in Lawrence Livermore National Laboratory, where he works on finite element meshing, discretizations and solvers for problems arising in computational electromagnetics, elasticity, and compressible shock hydrodynamics. He won R&D 100 award in 2007 as a member of the hypre project and was an R&D 100 finalist for MFEM in 2020. Tzanio is leading the finite element R&D efforts in the MFEM and BLAST projects in CASC and is the director of the co-design Center for Efficient Exascale Discretizations (CEED) in DOE’s Exascale Computing Project (ECP). He won an LLNL mid-career recognition award in 2019.
John Camier is a research scientist in the numerical analysis and simulations group at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His current research focuses on parallel computer architectures, performance portability, and computing with high-order finite element methods.
