End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations

Abstract

In this article, we present algorithms and implementations for the end-to-end GPU acceleration of matrix-free low-order-refined preconditioning of high-order finite element problems. The methods described here allow for the construction of effective preconditioners for high-order problems with optimal memory usage and computational complexity. The preconditioners are based on the construction of a spectrally equivalent low-order discretization on a refined mesh, which is then amenable to, for example, algebraic multigrid preconditioning. The constants of equivalence are independent of mesh size and polynomial degree. For vector finite element problems in H (curl) and H (div) (e.g., for electromagnetic or radiation diffusion problems), a specially constructed interpolation–histopolation basis is used to ensure fast convergence. Detailed performance studies are carried out to analyze the efficiency of the GPU algorithms. The kernel throughput of each of the main algorithmic components is measured, and the strong and weak parallel scalability of the methods is demonstrated. The different relative weighting and significance of the algorithmic components on GPUs and CPUs is discussed. Results on problems involving adaptively refined nonconforming meshes are shown, and the use of the preconditioners on a large-scale magnetic diffusion problem using all spaces of the finite element de Rham complex is illustrated.

Keywords

High-order finite elements GPU matrix-free preconditioning low-order-refined

Introduction

High-order finite element methods provide highly accurate solutions for problems with complex geometry using unstructured meshes, while simultaneously achieving efficiency and scalability on modern supercomputing platforms, in particular those with accelerator- and GPU-based architectures (Kolev et al., 2021b; Brown et al., 2021; Hutchinson et al., 2016). While the efficient evaluation of high-order finite element operators using matrix-free methods and sum factorization (cf. Fischer et al., 2020; Kronbichler and Kormann, 2019; Orszag, 1980) is well-studied, the solution of the resulting large linear systems remains challenging, in large part because the cost of assembling the system matrix of the high-order operator is prohibitive, both in terms of memory requirements and computation time. In this work, we describe the GPU acceleration and high-performance implementation of low-order-refined (LOR) preconditioning for high-order finite element problems posed in H¹, H (curl), and H (div) spaces. These preconditioners, also known as “FEM–SEM preconditioners” (cf. Orszag, 1980; Casarin, 1997; Canuto et al., 2010), are based on the idea of constructing an auxiliary, spectrally equivalent low-order discretization, and then applying effective preconditioners (for instance, algebraic multigrid methods) constructed using the low-order matrix, directly to the high-order system.

There are a number of other approaches for the matrix-free preconditioning of high-order finite element problems. One common approach is p-multigrid, which involves the construction of a hierarchy of p-coarsened spaces (Helenbrook and Atkins, 2006; Sundar et al., 2015). The coarsest space is typically a lowest-order space, which can be treated by means of a standard preconditioner for low-order finite element problems, for example, algebraic multigrid methods. On meshes that permit h-coarsening (e.g., meshes that result from the successive refinement of an initial coarse mesh), geometric multigrid methods can also be used for matrix-free preconditioning. These two approaches can be combined in a relatively flexible manner to obtain hp-coarsened hierarchies. In all of these methods, a smoother is required at each level of the hierarchy; typically, Jacobi or Chebyshev smoothing is performed (where the diagonal of the matrix is obtained without full matrix assembly using sum factorization, cf., Rønquist and Patera, 1987); more sophisticated smoothers such as overlapping Schwarz may also be used (Fischer 1997). Combinations of these techniques with low-order-refined preconditioning are also possible (see Pazner, 2020). Detailed performance comparisons of matrix-free methods for high-order discretizations are available in Kronbichler and Wall (2018); Kronbichler and Ljungkvist (2019); Phillips et al. (2022); and Kolev et al. (2021a), among others.

In this work, we develop algorithms and software implementations to effectively leverage GPU-based computing architectures for the entire low-order-refined preconditioning algorithm. This includes the parallel assembly of the auxiliary low-order system matrices, the construction and application of algebraic multigrid preconditioners, and the matrix-free application of the high-order operator in the context of a preconditioned conjugate gradient iteration. For H (curl) and H (div) problems, we use the auxiliary Maxwell solver (AMS) (Kolev and Vassilevski, 2009) and auxiliary divergence solver (ADS) (Kolev and Vassilevski, 2012) preconditioners, respectively, whose construction requires additional problem-specific inputs including mesh coordinates and discrete differential operators in matrix form; these inputs are also constructed on-device. The AMS and ADS preconditioners are based on subspace decompositions that guarantee rapid convergence for large-scale problems. In this work, we illustrate the relevance of GPU-accelerated versions of these solvers by solving a challenging, large-scale, high-order electromagnetic diffusion test problem posed on complex geometry in under 30 s on 320 GPUs.

In this article, we provide comprehensive performance studies of the solver algorithms, measuring runtimes, kernel throughput, and strong and weak parallel scalability of all components of the solution algorithm. Additionally, we compare the relative weightings of the algorithmic components on GPU and CPU, where our results show that the computational bottlenecks for GPU-accelerated solvers are significantly different than for CPU solvers. In particular, while on CPU, the solve phase (specifically the preconditioner application) dominates the overall runtime, on GPU, the setup phase represents the majority of the computational time. The construction of algebraic multigrid preconditioners is less amenable to GPU acceleration than either the application of the AMG V-cycle or the computation of the action of the high-order finite element operator. The LOR matrix assembly algorithms proposed in the present work ensure that the assembly of the low-order-refined matrix represents only a small fraction of the total computation time on both GPU or CPU. Through the development of GPU algorithms for all components of the solution procedure, we avoid expensive host–device data transfers, leading to high-throughput, scalable solvers for large-scale high-order finite element problems.

The first work on LOR preconditioning dates to 1980, when Orszag proposed using second-order finite difference methods to precondition certain spectral collocation methods (Orszag 1980). Since then, there has been significant work extending this concept, making use of low-order finite element discretizations to precondition spectral methods, spectral element methods, and finite element methods (Canuto, 1994; Canuto et al., 2010; Fischer, 1997; Deville and Mund, 1985). The low-order discretization is often in-turn preconditioned using multigrid, in particular algebraic or semi-structured multigrid methods, resulting in scalable solvers that deliver robust convergence under both h- and p-refinement (Bello-Maldonado and Fischer, 2019; Pazner, 2020). Although these multigrid-based methods typically perform well, an attractive feature of LOR preconditioning is that any effective preconditioner for the low-order system can be used to precondition the high-order system. LOR preconditioning has been successfully applied to spectral element and high-order finite element discretizations of the incompressible Navier–Stokes equations (Fischer and Lottes 2005; Franco et al., 2020), and has been extended to other discretizations, including discontinuous Galerkin methods with hp-refinement (Pazner and Kolev, 2021). Recently, spectrally equivalent LOR preconditioners for Maxwell problems in H (curl), grad-div problems in H (div), and interior penalty discontinuous Galerkin diffusion problems were developed (Dohrmann 2021; Pazner et al., 2022).

A novel aspect of this work is the use of a macro-element batching strategy for the assembly of the low-order-refined system matrices, and the associated algorithms and data structures. There is significant existing work in the literature on GPU-accelerated algorithms for the matrix-free evaluation of high-order operators (Ljungkvist, 2017; Abdelfattah et al., 2021; Franco et al., 2020); these algorithms have been a primary focus of the CEED project targeting exascale architectures (Kolev et al., 2021a). Similarly, GPU-accelerated algebraic multigrid algorithms have been developed as part of the hypre (Falgout and Yang 2002; Falgout et al., 2021) and AmgX (Naumov et al., 2015) libraries, among others. However, the efficient assembly of the low-order-refined matrices on GPUs is a topic that has not been addressed in these previous works. The algorithms and GPU implementations described presently are shown to significantly outperform more generic matrix assembly approaches that do not take advantage of the macro-element-level regular structure and topology of the low-order-refined discretizations.

Low-order-refined preconditioning

In this section, we give a brief overview of low-order-refined preconditioning for high-order finite element problems. We begin by defining the low-order-refined mesh. Then, low-order preconditioning for H¹-conforming discretizations of the Poisson problem is described, followed by a discussion of preconditioning problems in H (curl) and H (div).

The low-order-refined mesh

Let $Ω \subseteq R^{d}$ denote the spatial domain, d ∈ {1, 2, 3}. The domain is discretized using a mesh $T_{p}$ of tensor-product elements, such that each element $κ \in T_{p}$ is given by the image of the reference element $\hat{κ} = {[- 1,1]}^{d}$ under a (typically isoparametric) transformation, that is, $κ = T_{κ} (\hat{κ})$ . Let ${\hat{x}}_{i} \in [- 1,1]$ denote the p + 1 Gauss–Lobatto points. The d-fold Cartesian product of the points ${{\hat{x}}_{i}}$ is used as a set of Lagrange interpolation points in the reference element $\hat{κ}$ . These points are also used to define the low-order-refined mesh $T_{h}$ . Each element $κ \in T_{p}$ is subdivided into p^d topologically Cartesian subelements, whose vertices are given by the image of Gauss–Lobatto points under the element transformation T_κ. Some examples of high-order (coarse) meshes $T_{p}$ and the corresponding low-order-refined meshes $T_{h}$ are given in Figure 1.

Figure 1.

Top row: examples of high-order (coarse) meshes with p = 9 Gauss–Lobatto nodes. Bottom row: corresponding low-order-refined meshes.

Poisson problem

Consider the model Poisson problem

\begin{array}{c} - Δ u & = f in Ω, \\ u & = g_{D} on \partial Ω . \end{array}

(1)

To discretize this problem, define the H¹-conforming degree-p finite element space V_p,

V_{p} = {v \in H^{1} (Ω) : v |_{κ} \circ T_{κ} \in Q_{p} (\hat{κ})},

where

Q_{p}

is the space of multivariate polynomials of at most degree p in each variable. A nodal basis for

Q_{p}

is given by the Lagrange interpolating polynomials defined at the Cartesian product of the Gauss–Lobatto quadrature points, such that the degrees of freedom of the space V_p are point values at the Gauss–Lobatto nodes. The standard bilinear form

A_{V_{p}} = (\nabla u, \nabla v)

for u, v ∈ V_p gives rise to the high-order stiffness matrix

A_{V_{p}}

. In general, the total number of nonzeros in this matrix will scale like

O (n_{e} p^{2 d})

, where n_e is the number of elements in the mesh, and the number of operations required to assemble this matrix ranges from

O (n_{e} p^{2 d + 1})

O (n_{e} p^{3 d})

, depending on the algorithm used (Melenk et al., 2001). This memory and computational costs are prohibitive for moderate to large values of p, and so instead a matrix-free approach is adopted, where the action of

A_{V_{p}}

is computed without constructing the matrix. The action can be computed using sum factorization with optimal

O (n_{e} p^{d})

memory usage and

O (n_{e} p^{d + 1})

operations (Orszag, 1980); this matrix-free operator evaluation approach is the key component of the high-performance finite element software library libCEED (Brown et al., 2021) and also provides the foundation for MFEM’s partial assembly algorithm (Anderson et al., 2020).

Because of the prohibitive assembly costs, the matrix $A_{V_{p}}$ is unavailable for preconditioner construction. Instead, we construct an auxiliary low-order discretization matrix that is spectrally equivalent to the high-order matrix. The low-order-refined space V_h is defined to be the p = 1 H¹-conforming space on the low-order-refined mesh $T_{h}$ , whose construction is described in the preceding section. The degrees of freedom of V_h coincide with the degrees of freedom of the high-order space V_p. In both cases, the degrees of freedom represent point values at the Gauss–Lobatto points; in the high-order case, the associated basis functions are degree-p polynomials defined on the elements of the coarse mesh $T_{p}$ , and in the low-order case, the associated basis functions are the standard “hat functions” defined on the refined mesh $T_{h}$ . Given the low-order space V_h, it is possible to assemble the low-order-refined stiffness matrix $A_{V_{h}}$ . We briefly summarize some important properties of the matrices $A_{V_{h}}$ and $A_{V_{p}}$ .

• Since the high-order and low-order-refined degrees of freedom coincide, $A_{V_{h}}$ and $A_{V_{p}}$ are matrices of the same size.

• The low-order matrix $A_{V_{h}}$ has $O (1)$ nonzeros per row. On the other hand, the high-order matrix $A_{V_{p}}$ has $O (p^{2 d})$ nonzeros per row.

• $A_{V_{h}}$ and $A_{V_{p}}$ are spectrally equivalent, independent of p. This spectral equivalence is often known as FEM–SEM equivalence, and was proven by Canuto (1994) and Parter and Rothman (1995).

Since $A_{V_{h}}$ and $A_{V_{p}}$ are spectrally equivalent, any uniform preconditioner B_h of $A_{V_{h}}$ (i.e., such that the condition number of $B_{h} A_{V_{h}}$ is bounded independently of problem size and other discretization parameters) will also be an effective preconditioner for $A_{V_{p}}$ . In this work, we take B_h to be one V-cycle of algebraic multigrid (AMG) constructed using the assembled matrix $A_{V_{h}}$ . Other choices for B_h are possible, including geometric or semi-structured multigrid, domain decomposition, or even sparse direct solvers, but in this work, we restrict ourselves to AMG methods. Note that the same considerations regarding sparsity also apply to the high-order and low-order-refined mass matrices (denoted $M_{V_{p}}$ and $M_{V_{h}}$ , respectively). These mass matrices are spectrally equivalent (with constants of equivalence independent of polynomial degree p), and so low-order-refined preconditioning can be also applied to problems involving non-negative linear combinations of the mass and stiffness matrices.

Vector finite elements

While the approach described above can be used to obtain spectrally equivalent low-order-refined preconditioners using nodal H¹ discretizations for Poisson problems, the situation is more delicate for Maxwell and grad-div problems in H (curl) and H (div). In these spaces, nodal bases using Gauss–Lobatto or Gauss–Legendre points do not give rise to spectrally equivalent low-order discretizations, and as a result, the condition number of the preconditioned system grows rapidly, leading to large iteration counts.

Instead, an interpolation–histopolation basis for the high-order H (curl) and H (div) spaces must be used (Dohrmann, 2021; Pazner et al., 2022). The resulting basis functions are piecewise polynomials that take on prescribed mean values over subcell edges (in the case of H (curl)) or subcell faces (in the case of H (div)). These basis functions were introduced by Kreeft et al. (2011) in the context of mimetic methods. When this basis is used, spectral equivalence of the high-order and low-order-refined stiffness matrices (and mass matrices) are recovered for vector finite element spaces. The construction of low-order-refined preconditioners for this case is discussed at length by Pazner et al. (2022). We note that the lowest-order case of the interpolation–histopolation bases reduces exactly to the standard lowest-order Nédélec and Raviart–Thomas elements. As in the case of the Poisson problem, any effective preconditioner constructed using the low-order-refined system matrix can be used to effectively precondition the high-order system. However, even in the low-order case, the construction of preconditioners for these problems is more challenging. Typically, multigrid methods with specialized smoothers (Arnold et al., 2000) or auxiliary space methods (Hiptmair and Xu, 2007) are required to achieve good convergence. Analogous to the use of algebraic multigrid for the Poisson problem, in this work we focus on the use of auxiliary space AMG methods for these problems; in particular, we use the AMG-based algebraic Maxwell solver (AMS) for H (curl) problems (Kolev and Vassilevski, 2009), and the auxiliary divergence solver (ADS) for H (div) problems (Kolev and Vassilevski, 2012). These solvers are based on combining auxiliary space methods with algebraic multigrid, and are designed to work as black-box methods, requiring relatively little discretization information beyond the assembled system matrix. The AMS H (curl) solver requires two additional inputs: a vector of mesh vertex coordinates (used to construct the interpolation operator mapping the vector H¹ to the Nédélec space; this interpolation matrix can also be provided directly) and a discrete gradient matrix. In addition to the discrete gradient and interpolation, the ADS H (div) solver additionally requires a discrete curl matrix, mapping from H (curl) into H (div). The construction of these discrete matrices is also discussed in the following sections.

Adaptive mesh refinement and hanging node constraints

In the case of meshes with hanging nodes (e.g., those resulting from nonconforming adaptive mesh refinement), the construction of the appropriate low-order-refined discretization is less obvious. In this case, the previously described refinement procedure to obtain the low-order-refined mesh results in non-matching interfaces that do not directly lend themselves to the construction of spectrally equivalent low-order discretizations. See Figure 2 for an illustration of non-matching interfaces resulting from the refinement of meshes with hanging nodes. In such cases, the procedure to assemble a spectrally equivalent low-order directly on the refined mesh is unclear.

Figure 2.

Illustration of nonconforming meshes with hanging nodes and their low-order-refined counterparts.

Pazner and Kolev (2021) proposed an approach for constructing low-order-refined preconditioners for discretizations posed on meshes with hanging nodes based on variational restriction. In this approach, the high-order system is written as

A_{p} = Λ^{T} {\hat{A}}_{p} Λ,

(2)

where Λ is the matrix that enforces the constraints at element interfaces and

{\hat{A}}_{p}

is a block-diagonal matrix acting on vectors of unconstrained (duplicated) degrees of freedom with element matrices

{\hat{A}}_{p, i}

on the diagonal. The matrix Λ is often known as an assembly matrix; in the case where the mesh contains no hanging nodes (and without p-refinement), the matrix Λ is a Boolean matrix encoding the local (element-wise) to global DOF mapping. When the mesh contains hanging nodes, constraints must be enforced at the nonconforming interfaces, resulting in a more complicated Λ matrix; see also Červený et al. (2019) for more details.

The low-order-refined counterpart ${\hat{A}}_{h, i}$ of each element matrix ${\hat{A}}_{p, i}$ can be assembled to obtain a block-diagonal matrix ${\hat{A}}_{h}$ . The diagonal blocks of ${\hat{A}}_{h}$ are sparse, whereas the blocks of ${\hat{A}}_{p}$ are in general dense. Then, a low-order-refined preconditioner can be formed by

A_{h} = Λ^{T} {\hat{A}}_{h} Λ

(3)

where the same matrix Λ is used in (2) and (3). In the case of conforming meshes, the matrix A_h defined by (3) is identical to that obtained through the process described in the preceding section. When the mesh is nonconforming, a simple variational argument shows that A_h is spectrally equivalent to A_p (independent of polynomial degree), and therefore can be used as a preconditioner. The implementation of such preconditioners requires only the construction of the constraint matrix Λ, and the computation of the triple product in (3). Then, AMG or other algebraic preconditioners can be constructed using the assembled matrix A_h, and used to precondition the high-order system defined by A_p. Note that this algebraic approach can be used to construct spectrally equivalent low-order preconditioners whenever the high-order operator is posed as a variational restriction with a constraint matrix, taking the form of (2). For example, the same approach could be used in the case of variable polynomial degrees or nonconforming hp-refinement.

GPU algorithms

The LOR-based solution procedure can broadly be divided into two phases. The setup phase, performed before the beginning of the conjugate gradient iteration, consists of problem setup, matrix assembly, and construction of the algebraic multigrid preconditioner. The solve phase consists of the preconditioned conjugate gradient iteration, which at every step requires the matrix-free application of the high-order operator and the application of the algebraic multigrid V-cycle. In this section, we describe the approach for GPU-accelerated algorithms for each of these components. We summarize the algorithmic steps that make up the solution algorithm:

S1. Problem setup:

S1.1. High-order operator setup;

S1.2. Low-order-refined matrix assembly;

S1.3. Algebraic multigrid setup.

S2. Preconditioned conjugate gradient, including at every iteration:

S2.1. High-order operator evaluation;

S2.2. Algebraic multigrid V-cycle.

Steps S1.1 and S2.1 are the main algorithmic components for the fast evaluation of matrix-free high-order operators, as described in detail in Anderson et al. (2020); Kolev et al. (2021a); Abdelfattah et al. (2021); Brown et al. (2021); and Kronbichler and Kormann (2019), among others; efficient GPU-accelerated algorithms and implementations for these operations have been extensively discussed in the literature. Likewise, Steps S1.3 and S2.2 make up the two phases of algebraic multigrid methods, and efficient GPU strategies have been discussed in Haase et al. (2010); Falgout et al. (2021); Bell et al. (2012); and Naumov et al. (2015). Although we will briefly summarize the important aspects of these algorithms in the following sections, the primary contribution of this work is the development of efficient, GPU-suitable algorithms for Step S1.2. Prior to the development of efficient macro-element strategies for the LOR assembly, this step represented the bottleneck in the solution algorithm. Using the algorithms described in the following sections, this component is reduced to a small fraction of the overall runtime (see Figure 5).

Step S1.1. High-order operator setup

To achieve good performance with high-order discretizations, especially on GPUs, the high-order operator is evaluated matrix-free; the discretization matrix associated with the operator is never assembled. Instead, we make use of a matrix-free approach termed partial assembly. The high-order operator, denoted A_p, is decomposed as a nested product of matrices

A_{p} = P^{T} G^{T} B^{T} D B G P,

(4)

(see also Figure 3). This decomposition is used heavily in libCEED (Brown et al., 2021), MFEM (Anderson et al., 2020), and associated software. In the above, P denotes the parallel prolongation operator (resulting from the parallel decomposition of the spatial domain) mapping global parallel vectors (referred to as T-vectors) to processor-local vectors (L-vectors), G denotes the restriction from an L-vector of conforming degrees of freedom, to a (“broken”) vector of element-wise degrees of freedom (referred to as an E-vector), B is the basis operator, that evaluates the shape functions (and their derivatives) at quadrature points, and D is a pointwise (diagonal or block-diagonal) matrix that contains the geometric factors of the mesh and point values of problem-specific variable coefficients.

Figure 3.

Schematic of the high-order operator decomposition used in the partial assembly approach in the MFEM and libCEED software libraries.

In the partial assembly framework, the product in (4) is not formed, and instead the individual factors are applied sequentially. The factors are themselves typically not stored as sparse matrices. Instead, the matrices P and G make use of degree of freedom index mappings that are precomputed and stored; these mappings depend only on the mesh and finite element space, and not on the specific operator or physics of the problem. In the case of nonconforming adaptive mesh refinement, the P matrix also incorporates the AMR constraints, and is represented as a parallel CSR matrix; when the mesh is conforming, P is represented matrix-free. The basis operator B is identical between all elements of the same type (geometry and polynomial degree). In this work, we consider meshes consisting of tensor-product elements with constant polynomial degree, in which case one element-local B matrix can be stored and used for all elements in the mesh. Additionally, the tensor-product structure of quadrilateral and hexahedral elements allows for B to be expressed as a Kronecker product, requiring only the storage of small one-dimensional factors. The approach is closely related to sum factorization, which has been widely adopted in the areas of spectral and spectral element methods (Orszag, 1980). Finally, the D matrix consists of either scalar values or small matrices (d × d, where d is the spatial dimension) at each quadrature point. These values are precomputed and stored as part of the operator setup.

Partial assembly refers to the precomputation of the degree of freedom index mappings required for P and G, the one-dimensional interpolation and derivative matrices required for B, and the construction of the D matrix. Given the geometric factors at quadrature points, all of these precomputations can be performed in constant time per degree of freedom (i.e., linear time in the problem size, regardless of the polynomial degree of the finite element space). The evaluation of the geometric factors itself can be performed using sum factorization, resulting in a scaling of $O (p_{g}^{d + 1})$ , where p_g is the polynomial degree associated with the element transformation mapping (p_g = 1 for straight-sided meshes, p_g = p for isoparametric curved meshes).

Step S1.2. Low-order-refined matrix assembly

Although the action of the high-order operator is computed matrix-free, in order to construct the low-order-refined algebraic multigrid preconditioners, it is required to assemble the system matrix associated with the low-order-refined discretization. This matrix has the same dimensions as that of the high-order operator, but is significantly sparser: while number of nonzeros per row of the high-order matrix scales like $O (p^{d})$ , the number of nonzeros per row in the low-order-refined matrix remains bounded, independent of p. As a result, the assembly cost and memory usage of the low-order-refined matrix is linear in the problem size, which is optimal.

While the asymptotic complexity of the matrix assembly is known to be optimal, the achieved throughput of these operations on GPU-based architectures depends greatly on the structure of the assembly algorithms, the choice of data structures, and the threading strategies. In particular, it has been widely shown that high-order finite element operations often achieve greater throughput for similarly sized problems when compared with their low-order counterparts, despite requiring more arithmetic operations (Kolev et al., 2021a; Franco et al., 2020). In large part, this is because typical finite element operations are memory-bound, and the cost of performing arithmetic operations is negligible compared with the cost of memory transfer and access. High-order operators lend themselves to more structured and efficient memory access patterns than low-order operators, leading to greater performance. This presents a challenge for the assembly of low-order-refined matrices, which inherently use lowest-order finite elements.

The approach taken in this work is to structure the assembly of the low-order-refined matrices around macro-elements. In the context of low-order-refined methods, each element of the coarse, high-order mesh $T_{p}$ is refined into p^d subelements to obtain the refined mesh $T$ . The collection of p^d refined elements in $T_{h}$ corresponding to a single high-order element is referred to as a macro-element. An illustration of a macro-element is shown in Figure 4. The advantage of working at the level of macro-elements is that within each macro-element, the local mesh topology is that of a structured Cartesian mesh. Furthermore, this local mesh topology is identical between all macro-elements, repeated across the mesh. This structure allows for application of algorithmic components similar to those of high-order finite elements to the case of low-order-refined assembly.

Figure 4.

An example of a high-order mesh $T_{p}$ with a macro-element of the LOR mesh $T_{h}$ overlaid.

The overall matrix assembly algorithm proceeds according to the following steps, which are described in detail in the following sections.

A1. Assembly of sparse matrices for each macro-element.

A2. Assembly of processor-local sparse matrix.

A3. Assembly of global (parallel) sparse matrix.

A4. Elimination of boundary conditions.

Step A1. Macro-element sparse matrix assembly

Traditional algorithms for finite element assembly proceed by constructing small, independent element-local matrices (in “unassembled form”), and then placing them in the system matrix. This can be expressed algebraically as

A = Λ^{T} \hat{A} Λ

where

\hat{A}

denotes a block-diagonal matrix, where the ith diagonal block

{\hat{A}}_{i}

consists of the local matrix of the ith element, and Λ is the assembly matrix introduced in the section on adaptive mesh refinement. In the case of low-order-refined methods, it is inefficient to work at the level of individual elements, because this would result in fully unstructured memory access, and additional duplication of matrix entries corresponding to shared degrees of freedom. Instead, we construct blocks

{\hat{A}}_{i}

that are local to each macro-element. To take advantage of the sparsity of the LOR discretization, the macro-element matrices are stored in sparse rather than dense format. Because of the uniform Cartesian structure within each macro-element, the sparsity pattern for each block

{\hat{A}}_{i}

is identical, and so the sparse matrix graph needs only be stored once for the entire problem. Each block

{\hat{A}}_{i}

is stored in a modified CSR format. In this format, the row array (denoted

\hat{I}

) is implicit, since as a result of the Cartesian structure, most rows have the same number of nonzeros (e.g., the local matrices of the low-order-refined diffusion operator have 9 nonzeros per row in 2D and 27 nonzeros per row in 3D, with the exception of rows corresponding to vertices that lie on the boundary of the macro-element, which have fewer nonzeros). We will use nnz_per_row to denote this bound on the number of nonzeros per row. The column pointer array (denoted

\hat{J}

) is computed once and shared between all macro-elements. It has shape

n n z_p e r_r o w \times n d o f_p e r_e l

, where

n d o f_p e r_e l

is the number of degrees of freedom per macro-element. If a row has fewer than

n n z_p e r_r o w

nonzeros (i.e., the row corresponds to a degree of freedom lying on the macro-element boundary), then

\hat{J}

is padded with −1, which represents an invalid index. The array of CSR nonzeros (denoted

\hat{A}

) is stored as a tensor of shape

n n z_p e r_r o w \times n d o f_p e r_e l \times n e l

, where

n e l

is the number of macro-elements in the mesh

T_{h}

(i.e., the number of coarse elements in the mesh

T_{p}

The main work of Step A1 is to fill in the entries of the $\hat{A}$ array, since $\hat{I}$ is omitted, and $\hat{J}$ is independent of the number of elements in the mesh. This procedure is carried out on the GPU using one block of threads per macro-element. In 2D, the block of threads is of size p × p, corresponding to the p² subelements in each macro-element. In 3D, the block of threads is of size $p \times p \times \tilde{p}$ , where the last dimension $\tilde{p}$ may be lowered to avoid register spilling. The matrix associated with each subelement is assembled by one thread, and then placed in the $\hat{A}$ array, which is in global memory. Since we are presently concerned with symmetric problems, only one of the upper or lower triangular parts of the local matrix needs to be assembled. The placement into the $\hat{A}$ array requires atomic operations, because subelements sharing a common degree of freedom may be assembled by different threads.

In order to assemble the subelement matrices, geometric factors are required at the subelement vertices. In the completely unstructured approach, this would first require the computation of the mesh coordinates at each low-order-refined vertex in “broken” element-wise format (i.e., as an LOR E-vector), resulting in unnecessary duplication and inefficient access patterns. Instead, in the macro-element approach, the mesh coordinates are represented as high-order E-vectors, where all the coordinates corresponding to a macro-element are stored contiguously, with no duplication at the subelement level. This is more efficient both in terms of memory usage and access patterns, and also allows the reuse of the high-order G operator, as described in Step S1.1.

Step A2. Processor-local assembly

Once the macro-element sparse matrices ${\hat{A}}_{i}$ have been constructed, the next step is to assemble a processor-local matrix in CSR format. To determine the number of nonzeros for each degree of freedom, the nonzeros indices in each macro-element sparse block are added, taking care not to double-count nonzeros indices corresponding to degrees of freedom that lie on macro-element interfaces. This deduplication is achieved by adding only those nonzeros indices that correspond to the minimal macro-element index among all macro-elements containing the specified degree of freedom. This is performed using $n d o f_p e r_e l \times n e l$ . Threads and, as before, atomic operations must be used to avoid conflicts. A scan operation is then performed to construct the processor-local $I$ array.

Once $I$ has been constructed, $J$ and $A$ are computed using the sparse blocks ${\hat{A}}_{i}$ . Again, $n d o f_p e r_e l \times n e l$ threads are used, and each thread loops over the nonzeros of the corresponding row of ${\hat{A}}_{i}$ . The responsibility to add nonzero values to the CSR arrays is left to the thread belonging to the macro-element with the minimal index among all macro-elements with the given nonzero index. The column pointer array $\hat{J}$ is used to identify the column index for each nonzero. For vector finite elements in H (curl) and H (div), the orientation of the vector-valued basis functions is encoded in the sign of the degrees of freedom. At the macro-element level, because of the Cartesian structure, lexicographic ordering and orientation is used. At the processor-local level, the orientation is inherited from that of the mesh. This sparse matrix constructor procedure is also responsible for ensuring consistent orientations during assembly.

Step A3. Global matrix assembly

When running on a single MPI rank (i.e., on a single GPU), this step can be skipped entirely. When running in parallel, the construction of the global (parallel) CSR matrix is required. In this format, the degrees of freedom are partitioned by MPI rank, and each rank owns the matrix rows associated with its degrees of freedom. Each rank has a diagonal block, which contains the nonzeros for which both the row and column indices are owned by itself, and an off-diagonal block, which contains the nonzeros for which the column indices are owned by another rank. The construction of this parallel CSR matrix is performed using a triple-product P^TAP. Because this operation (or RAP more generally) is important for the construction of coarse operators in algebraic multigrid, the hypre library provides optimized GPU kernels for computing triple products. This operation can further be optimized using the knowledge that in the case of conforming meshes, the P matrix is Boolean. For details on the algorithms used for this operation, see Falgout et al. (2021).

Step A4. Elimination of boundary conditions

Once the global matrix has been formed, the matrix must be modified to take into account essential boundary conditions. The rows and columns corresponding to each essential degree of freedom must be eliminated, and the corresponding diagonal entry is typically replaced by value 1. The elimination of columns in parallel requires communication, because a column corresponding to a degree of freedom owned by one rank may have nonzeros in rows owned by other ranks. The first step of the elimination procedure is to communicate to each rank which columns it must eliminate. This communication is performed using non-blocking, device-aware MPI, whenever available. The column indices to eliminate are obtained in the form of a marker array, that is, a Boolean array with 1 in the indices that must be eliminated, and 0 elsewhere.

Once the communication has begun, the rows and columns in the diagonal block, and the rows in the off-diagonal block can be eliminated, thus allowing for the overlap of communication and computation. The elimination is embarrassingly parallel, and is threaded over the number of essential degrees of freedom. After this procedure is complete, we wait for the communication to finish, and finally eliminate the columns in the off-diagonal block. Since this data is available in the format of a marker array, each nonzero in the off-diagonal block is simply scaled by one or zero depending on the marker value, allowing parallelization over the number of rows in the off-diagonal.

Auxiliary computations

The above sections describe the main algorithmic components of the low-order-refined matrix assembly. In the case of vector finite element problems in H (curl) and H (div), certain additional discretization information is required by the AMS and ADS algebraic solvers. In particular, both solvers require vectors of the low-order-refined mesh coordinates, from which hypre can build interpolation matrices that are used in the auxiliary space preconditioning method. Additionally, AMS requires a discrete gradient matrix, that maps functions in the low-order-refined H¹ finite element space (defined as values at the LOR mesh vertices) to their gradients in the corresponding H (curl) space (defined on edges of the LOR mesh). Likewise, ADS requires a discrete curl matrix, that maps functions in the low-order-refined H (curl) space to their curls in the corresponding H (div) space (defined on faces of the LOR mesh). Because of the construction of the interpolation–histopolation basis as described in the section on vector finite elements, the high-order and low-order-refined versions of these matrices exactly coincide, and can be constructed using purely topological information. That is, the matrices depend only on the topology of the high-order (coarse) mesh and the polynomial degree, and do not depend on the mesh geometry or other problem-specific parameters.

The LOR mesh vertex coordinate vectors can be computed efficiently using geometric information that is already available from the LOR matrix assembly in Step S1.2. The matrix assembly required the mesh coordinates in the form of a high-order E-vector; this vector contains the mesh coordinates required for the construction of the interpolation matrices, with duplications because of the E-vector format. Deduplicating this vector is performed efficiently using the element restriction degree of freedom index mappings (similarly to the action of G^T in Step S2.1); this operation can be performed in parallel with one thread per deduplicated DOF without requiring any MPI communication.

The discrete gradient matrix maps from LOR vertices to edges, such that the value 1 or −1 is placed at the (i, j) entry of the matrix, where the LOR vertex j is one of the endpoints of the edge i, and the sign is determined by the orientation of the edge. Because each LOR macro-element has a uniform Cartesian structure, a local vertex-to-edge mapping can be computed once, and reused for the entire mesh. The values are inserted into the correct locations using the index mappings from the element restriction operator. Note that grad-div problems in 2D can also use the AMS solver, but with a modified gradient matrix that computes the rotated gradient ∇^⊥ = (−∂_y, ∂_x). The construction of this operator largely follows the same structure. Pseudocode for the kernel used to construct the discrete gradient matrix in CSR format is shown in Algorithm 1. In this algorithm, the DOF and maps

e l e m e n t_m a p

h c u r l_g l o b a l_t o_l o c a l

, and

h 1_l o c a l_t o_g l o b a l

encode the information used to compute the action of the high-order G operator (see Step item S1.1). In the low-order-refined context, these are interpreted as macro-element maps. Finally,

e d g e_t o_v e r t e x

gives the indices of the two vertices corresponding to an (oriented) low-order-refined edge. Because each macro-element has identical Cartesian structure, this one mapping is used for all DOFs.

Algorithm 1. Construction of discrete gradient matrix in CSR format

N ←# of Nédélec DOFs

parallel for i ∈ {0, …, N − 1} do

e \leftarrow e l e m e n t_m a p [i]

i_{l o c} \leftarrow h c u r l_g l o b a l_t o_l o c a l [i]

σ \leftarrow o r i e n t a t i o n [i]

j_{0, l o c} \leftarrow e d g e_t o_v e r t e x [0, i_{l o c}]

j_{1, l o c} \leftarrow e d g e_t o_v e r t e x [1, i_{l o c}]

I[i] ← 2i

J [2 i] \leftarrow h 1_l o c a l_t o_g l o b a l [e, j_{0, l o c}]

J [2 i + 1] \leftarrow h 1_l o c a l_t o_g l o b a l [e, j_{1, l o c}]

A[2i] ← − σ

A[2i + 1] ← σ

End

I[N] ← 2N

Similarly, the discrete curl matrix maps from LOR edges to faces, such that the value 1 or −1 is placed at the (i, j) entry of the matrix, where the LOR edge j is one of the edges of the LOR face i. The sign is determined by the orientation of the mesh edge relative to that of the mesh face. The same consideration as in the case of the discrete gradient matrix also applies here; a local version of the mapping is computed on a reference macro-element, and reused for the entire mesh. The index mapping information from the element restriction operator is used to assemble the global operator.

Step S1.3. Algebraic multigrid setup

After Step S1.2 has been performed, the parallel CSR matrix can be passed to algebraic multigrid library to perform the AMG setup. In this work, we use hypre’s AMG implementation, but any GPU-capable algebraic multigrid library could equally well be used, for example, AmgX. Because of the sparsity of the low-order-refined matrix, the construction of the AMG hierarchy is generally efficient, and the resulting operator complexities are not prohibitively high. One of the main components of the algebraic multigrid setup is the construction of the coarse-grid operators, which is performed using triple-product (P^TAP) operations. This is the same operation that is used for the global (parallel) LOR matrix assembly in Step A3, and so both the algebraic multigrid setup and low-order-refined assembly benefit from improvements to the triple-product algorithm. Details on the development and porting of the AMG setup algorithms to GPU-based architectures are described in Falgout et al. (2021).

Step S2.1. High-order operator evaluation

At each conjugate gradient iteration, the action of the high-order operator must be applied. As described in Step S1.1, we use the partial assembly approach for the high-order operator. Given the decomposition (4), the action of the high-order operator A_p can be computed by successively applying the operators P, G, B, and D, and the transposes B^T, G^T, and P^T. In the case of a conforming mesh, the matrices P and G are Boolean matrices, which are stored as mappings between local and distributed degree of freedom indices; the case of nonconforming adaptive refinement results in more complicated prolongation operators that are treated differently. The action of P is computed with MPI communication using these index mappings. When possible, device-aware MPI is used for these communication routines in order to reduce memory transfer between host and device. In MFEM, this is achieved using the $C o n f o r m i n g P r o l o n g a t i o n O p e r a t o r$ class. The element restriction operator G, represented in MFEM with the class $E l e m e n t R e s t r i c t i o n$ , maps from rank-local vectors (referred to as L-vectors in CEED terminology) to element-wise vectors (referred to as E-vectors). This is a type of gather operation that can be performed on device using a gather map, threaded over the number of E-vector degrees of freedom. The action of the transpose G^T is a scatter-type operation, that similarly can be performed efficiently on-device using the same threading strategy.

The action of the triple-product B^TDB is implemented as a single kernel that operates on the E-vector level. In MFEM, this action is performed by an operator-specific class derived from $B i l i n e a r F o r m I n t e g r a t o r$ . The specific form of the basis operator B depends on the finite element bilinear form being evaluated. For example, the mass matrix requires the evaluation of basis functions at quadrature points, for which B takes the form of a Kronecker product

\begin{array}{c} B = B_{1 D} \otimes B_{1 D} & in 2 D, \\ B = B_{1 D} \otimes B_{1 D} \otimes B_{1 D} & in 3 D, \end{array}

(5)

where B_1D is the one-dimensional basis evaluation operator. On the other hand, diffusion-type operators require the computation of the gradient of basis functions at quadrature points, for which B can be written as

\begin{array}{l} B = (\begin{array}{c} B_{1 D} \otimes \partial_{1 D} \\ \partial_{1 D} \otimes B_{1 D} \end{array}) in 2 D, \\ B = (\begin{array}{c} B_{1 D} \otimes B_{1 D} \otimes \partial_{1 D} \\ B_{1 D} \otimes \partial_{1 D} \otimes B_{1 D} \\ \partial_{1 D} \otimes B_{1 D} \otimes B_{1 D} \end{array}) in 3 D, \end{array}

where ∂_1D is the one-dimensional basis differentiation operator. Owing to the tensor-product structure of the B operator, the action of B and B^T can be computed efficiently using sum factorization algorithms (Orszag, 1980; Van Loan, 2000). Only the one-dimensional matrices B_1D and ∂_1D are formed explicitly, and the action of the Kronecker product matrix–vector products is computed on the fly.

The most common threading strategy for the products B^TDB is to use one block of threads per element, with threads corresponding to the quadrature points. The basis function values (or gradients) at quadrature points are stored in shared memory, together with the intermediate vectors required for the sum factorized computation of the matrix–vector products. The diagonal (or block-diagonal) matrix D is precomputed in Step S1.1, and its application trivially parallelizes over all quadrature points. The transposed basis operator B^T returns from quadrature points to (dual) degrees of freedom, and its application essentially follows to reverse order as that of B.

Step S2.2. Algebraic multigrid V-cycle

The application of the algebraic multigrid V-cycle (i.e., the AMG solve phase) is more amenable to straightforward GPU implementation than the setup phase. Each level in the multigrid hierarchy involves residual computation, point smoothing, and the application of restriction and prolongation matrices. These operations can entirely be cast as standard linear algebraic operations, that is, sparse matrix–vector ( $s p m v$ ) products and vector operations such as $a x p y$ . In hypre’s GPU AMG implementation, these sparse matrix–vector products are computed using vendor-provided libraries such as cuSPARSE and rocSPARSE. At the coarsest level, the system is small enough that it can be efficiently solved directly. For more details on the GPU acceleration of AMG methods, see Haase et al. (2010); Naumov et al. (2015); and Falgout et al. (2021).

Results and numerical experiments

In this section, we present several numerical results studying the performance of the GPU algorithms described above. These algorithms are implemented in the MFEM finite element software library (Anderson et al., 2020). The low-order-refined discretizations are formed using the class $L O R D i s c r e t i z a t i o n$ (and its parallel counterpart $P a r L O R D i s c r e t i z a t i o n$ ), and the GPU-accelerated matrix assembly is performed by the class $B a t c h e d L O R A s s e m b l y$ . The class template $L O R S o l v e r$ is used to create preconditioners using the LOR discretizations. The specializations LORSolver<HypreAMS> and LORSolver<HypreADS> handle the construction of the coordinate vectors, and discrete gradient and discrete curl matrices required by the AMS and ADS solvers. The algebraic multigrid solvers are provided by the hypre library. All results were performed on LLNL’s Lassen supercomputer, each node of which has 4 NVIDIA V100 GPUs and 44 Power 9 CPU cores.

Algorithmic components

First we consider the relative weighting of the algorithmic components of the solver described above. We solve a definite Helmholtz (i.e., diffusion–reaction) problem on a Cartesian grid in 2D and 3D using polynomial degree p = 6. In 2D, the mesh consists of 262,144 elements, resulting in a total of 9,443,329 degrees of freedom. In 3D, the mesh consists of 32,768 elements, resulting in 7,189,057 degrees of freedom. Because in both cases the mesh is topologically Cartesian, the low-order-refined system matrix has 9 nonzeros per row in 2D, and 27 nonzeros per row in 3D (with the exception of degrees of freedom at the domain boundary), resulting in 84,953,089 nonzeros in 2D and 192,100,033 nonzeros in 3D. The linear system is solved using a relative tolerance of 10⁻¹². In 2D, this requires 32 CG iterations, and in 3D convergence is attained after 40 CG iterations. These problems are solved on a single V100 GPU. We also compare the GPU timings with the runtimes on one Power 9 CPU core. We emphasize that the purpose of this comparison is to study the difference in relative weightings of the algorithmic components on CPU and GPU, not to compare directly GPU to CPU performance. The runtimes for the algorithmic components of the solution procedure are shown in Figure 5.

Figure 5.

Wall-clock runtimes for the algorithmic components enumerated above, solving a definite Helmholtz problem in H¹ with polynomial degree p = 6. Top row: GPU timings on one V100 GPU. Bottom row: CPU timings on one Power 9 core. Left column: 2D test case with 262,144 elements and 9,443,329 degrees of freedom, 33 CG iterations. Right column: 3D test case with 32,768 elements at 7,189,057 degrees of freedom, 40 CG iterations.

On the GPU, in both the 2D and 3D cases, around 75% of the time to solution was spent in the setup phase, with over 50% of the time to solution in AMG setup. The time spent assembling the LOR system was approximately 3% and 10% of the total runtime, in 2D and 3D, respectively. Note that the timings for the setup of the high-order operator also includes the construction of the element restriction operator (see Step S1.1), which, because of the macro-element strategy described in Step S1.2, is reused for the LOR matrix assembly. Within the solve phase, over 80% of the runtime was spent in the AMG V-cycle in both cases, and only 16–18% in the high-order operator evaluation. These results suggest that the largest potential for further performance gains are possible by optimizing the preconditioner construction and application. Interestingly, on CPU, the relative weight of the AMG setup is significantly reduced, and in contrast to the GPU runs, the large majority of the time is spent in the solve phase.

These results highlight one advantage of LOR-based methods: any improvements or optimizations to the algebraic multigrid implementation (or in fact any other preconditioner suitable for the low-order problem) can automatically benefit the high-order solvers. We further note that in many practical time-dependent problems, the mesh and problem coefficients remain fixed, either for the duration of the entire simulation, or for numerous time steps (see, e.g., Franco et al., 2020). In this context, the setup time is amortized over the many repeated applications of the solve phase. As a result, for these problems, the relatively expensive AMG setup may constitute a small fraction of the overall runtime. However, for problems which require performing the preconditioner setup every time step (e.g., problems incorporating mesh motion or time-dependent variable coefficients), the setup will remain a significant portion of the time to solution.

Kernel throughput

We study the dependence of the kernel throughput on polynomial degree and problem size. Working on a sequence of 3D Cartesian meshes, we measure the throughput (degrees of freedom per second) of the main computational kernels. First, we consider high-order operator setup and application. The MFEM library supports multiple backends that can be used for GPU evaluation. We compare MFEM’s native GPU backend to libCEED’s $c u d a - g e n$ backend, which uses code generation to fuse the processor-local products G^TB^TDBG (cf., equation (4) and Brown et al., 2021). The results are shown in Figure 6. These high-order operations are independent of the choice of preconditioner, and have previously been studied in detail (see Kolev et al., 2021a,b; Fischer et al., 2020; Franco et al., 2020). As is typical with high-order finite elements, greater throughput is achieved for higher polynomial degrees. For p = 7, the operator setup (i.e., the construction of the D matrix described in Step S1.1) reaches a peak throughput for about 700 million degrees of freedom per second, and MFEM’s native operator application (using the algorithm described in Step S2.1) reaches a peak throughput of 1.6 billion degrees of freedom per second, while libCEED’s operator evaluation reaches a peak of 3.2 billion degrees of freedom per second.

Figure 6.

Throughput (millions of degrees of freedom per second) for high-order operator and setup for the definite Helmholtz problem on one V100 GPU. Left: high-order operator setup (partial assembly). Center: high-order operator evaluation using MFEM’s native kernels. Right: high-order operator evaluation using MFEM’s libCEED backend.

Next, we consider the throughput of the low-order-refined assembly algorithms. We compare the macro-element assembly strategy described in Step S1.2 to a more generic, fully unstructured matrix assembly algorithm that does not take advantage of the structure of the low-order-refined mesh. The more general, unstructured algorithm is suitable for assembling matrices corresponding to any polynomial degree, but does not contain any optimizations specific to the case of low-order-refined discretizations. This algorithm uses one block of threads per element, with one thread per degree of freedom, constructing small dense matrices for each element of the low-order-refined mesh, and then directly assembling the global sparse matrix (skipping the macro-element sparse matrix assembly in the LOR approach). The throughput plots comparing these two approaches are shown in the left and center columns of Figure 7.

Figure 7.

Throughput (millions of degrees of freedom per second) for high-order operator and setup for the definite Helmholtz problem on one V100 GPU. Left: LOR matrix assembly using the macro-element approach. Center: LOR matrix assembly using the unstructured (legacy) approach. Right: algebraic multigrid setup.

The fully unstructured assembly algorithm shows no variation in throughput between polynomial degrees of the high-order space; this is expected because for equal problem size, the low-order-refined meshes corresponding different high-order polynomial degrees are topologically identically, and differ only in the mesh distortion, which has no algorithmic impact on the matrix assembly. On the other hand, the macro-element assembly algorithm exhibits increasing throughput with high-order polynomial degree. As the degree of the high-order space increases, the size of each macro-element increases, and the matrix assembly becomes more structured, allowing for increased fine-grained parallelism, leading to higher performance. In the lowest-order case, the macro-element strategy performs worse than the unstructured algorithm. This is because in the case of p = 1, the macro elements each only contain a single element, and so each block consists only of one thread. On the other hand, the unstructured algorithm threads over the E-vector degrees of freedom, resulting in eight threads per element. However, at p = 2, the performance of the macro-element strategy is roughly equal to that of the unstructured algorithm, and for p > 2, we observe meaningful performance improvements. For the highest orders, the macro-element approach is more than twice as fast as the unstructured approach. It is also important to note that this factor does not include the savings that are obtained by reusing the high-order mesh topology and element restriction operator. In the fully unstructured case, the overhead associated with constructing the low-order-refined mesh and element restriction operators can dominate the time spent in the matrix assembly, often by large factors.

The right column of Figure 7 shows the throughput of the algebraic multigrid setup as a function of problem size. As in the case of the unstructured LOR matrix assembly, there is essentially no dependence of the performance of these kernels on the polynomial degree of the high-order space. The resulting system matrices have the same sparsity pattern, independent of p. Differences in the values of the matrix entries can lead to different choices in the AMG coarsening, and hence will result in different AMG hierarchies, which has the potential to have a marginal impact on performance. In practice, these differences are found to be negligible. The results obtained here corroborate those shown previously. The bottleneck in the problem setup consists of the AMG setup phase, which reaches a peak throughput of about 1.5 million degrees of freedom for the largest problem sizes. Finally we consider the AMG V-cycle, the throughput of which is shown in the left column of Figure 8. As in the case of AMG setup, the performance of these operations does not display any dependence on the polynomial degree. This is because the sparsity patterns for LOR systems of a given size are independent of the polynomial degree of the high-order space. The BoomerAMG V-cycle reaches a peak throughput of about 300 million degrees of freedom per second.

Figure 8.

Throughput (millions of degrees of freedom per second) for algebraic multigrid preconditioner application on one V100. Left: one BoomerAMG V-cycle. Center: application of the AMS preconditioner for H (curl). Right: application of the ADS preconditioner for H (div).

Nédélec and Raviart–Thomas elements

In this section, we consider the throughput of the kernels required to construct low-order-refined preconditioners for problems in H (curl) and H (div) using Nédélec and Raviart–Thomas elements. First, we consider the assembly of the LOR system matrix, for which throughput plots are shown in the left column of Figure 9. The assembly of the LOR Nédélec matrices performs roughly the same as the H¹ assembly (compare with Figure 6). However, the Raviart–Thomas assembly enjoys significantly higher throughput. This is largely because each Raviart–Thomas degree of freedom corresponds to a face of the low-order-refined mesh, whereas each Nédélec degree of freedom corresponds to an edge, and each H¹ degree of freedom corresponds to a vertex. Therefore, each RT DOF is shared by at most two elements, whereas in the case of a Cartesian mesh, Nédélec and H¹ DOFs are shared by four and eight elements, respectively. As a result, the RT system matrix enjoys greater sparsity, and the sparse matrix construction requires fewer atomic additions during assembly.

Figure 9.

Throughput (millions of degrees of freedom per second) for Nédélec and Raviart–Thomas problems. Left column: assembly of LOR system matrix. Center column: construction of discrete gradient and curl matrices. Right column: construction of coordinate vectors needed for discrete interpolation.

As described in the preceding sections, the AMS and ADS algebraic auxiliary space solvers for H (curl) and H (div) problems require the construction of discrete interpolation, gradient, and curl matrices. The interpolation matrix is constructed using vectors of the mesh vertex coordinates, and the discrete gradient and curl are constructed using only mesh topology information. The throughput for the construction of the discrete gradient and curl is shown in the center and right columns of Figure 9. The peak throughput for the construction of the discrete gradient matrix is about 12 billion degrees of freedom per second, and the peak throughput for the construction of the discrete curl matrix is 4.5 billion degrees of freedom per second. To construct the coordinate vectors needed for the interpolation operator, first the high-order (coarse) mesh vertices are interpolated to the low-order-refined (Gauss–Lobatto) mesh vertices in E-vector format. This operation is essentially an element-wise tensor contraction using the tensor-product basis operator given in (5). The throughput for this operation is shown in the top-right plot of Figure 9. Once the coordinate vector has been obtained in E-vector format, it should be converted into T-vector (global, parallel) format. Conceptually, this operation corresponds to applying the element restriction operator to convert from E-vector to L-vector, and then applying the parallel restriction operator to convert from L-vector to T-vector. In practice, this can be done in a single step without any parallel communication. The throughput for this operation is shown in the bottom-right plot of Figure 9. Both coordinate vector operations achieve a peak throughput of over 22 billion degrees of freedom per second. The throughput for the application of the AMS and ADS preconditioners is shown in the center and right columns of Figure 8. The AMS and ADS preconditioners have lower throughput than one application of BoomerAMG, since each application of AMS and ADS requires multiple AMG V-cycles, in addition to applications of the discrete gradient, curl, and interpolation matrices (see Kolev and Vassilevski 2009, 2012). The peak throughput of AMS is 67 million degrees of freedom per second, and of ADS is 41 million degrees of freedom per second.

Parallel scaling

In this section, we examine the parallel scalability of these algorithms. The construction of the process-local sparse matrices is performed in independently on each MPI rank. These matrices are then placed in a global block-diagonal matrix, with one block per rank. The global system matrix is computed as A = P^TA_LP, where A_L is the block-diagonal matrix described above and P is the parallel prolongation matrix. For this triple product, we use the sparse matrix–matrix product GPU implementation available in hypre, the development of which is described in Falgout et al. (2021).

Once the global (parallel) sparse matrix has been constructed, the remaining parallel operations are AMG setup, AMG V-cycle, and high-order operator evaluation. For details on the parallelization of the BoomerAMG preconditioner, see Henson and Yang (2002). The approach taken to parallelize the high-order operator is described in Step item S2.1, using the operator decomposition (4). Writing A_p = P^TG^TB^TDBGP, the process-local operator is given by A_L = G^TB^TDBG. The action of this linear operator is completely local, and requires no MPI communication. Therefore, the only operations that require parallel communication are the application of P and its transpose. For conforming meshes, the operator P is represented in MFEM as an object of type $C o n f o r m i n g P r o l o n g a t i o n O p e r a t o r$ . This class provides an optimized implementation of the action of P and P^T using device-aware MPI. When the mesh is nonconforming, P takes a more complicated form, and is represented as a parallel sparse matrix. In this case, the action of P and P^T is computed as parallel sparse matrix–vector products using the hypre library.

To measure the parallel scalability of our algorithms, we perform a strong and weak scaling study, using between 4 and 1024 GPUs. We consider eight problem configurations, with sizes increasing by factors of two from 8.4 × 10⁶ degrees of freedom to 1.1 × 10⁹ degrees of freedom. Each problem configuration is run on three sets of GPUs to obtain the strong scaling results. Collecting the timings across different problem configurations gives the weak scaling results. The different algorithmic components as outlined in the previous sections are instrumented separately. The scaling results are shown in Figure 10. Comparisons with ideal strong and weak scaling curves indicate excellent parallel scalability and efficiency for the high-order operator setup, low-order-refined matrix assembly, and high-order operator evaluation. The algebraic multigrid setup is the component of the algorithm that is most challenging for parallel scalability. The AMG V-cycle is less efficient for smaller problem sizes, but demonstrates good weak scalability for larger problem sizes. The overall weak scaling parallel efficiency for the 1.1 × 10⁹ problem on 256 GPUs was 83% for the setup phase (HO set, LOR assembly, and AMG setup) and 87% for the solve phase (HO apply and AMG V-cycle).

Figure 10.

Parallel scaling results for setup and solve phases from 4 to 1024 Tesla V100-SXM2 NVIDIA GPUs. Strong scaling results are given by the solid lines, and weak scaling results are given by the dashed lines. The ideal strong scaling curve is shown for reference. Ideal weak scaling corresponds to horizontal lines. The results have been done with CUDA 11.7.0 and hypre 2.25.0, with optimized sparse matrix–matrix multiplications (SpGEMM).

Adaptive mesh refinement

To illustrate the performance of the low-order-refined solvers on a problem with adaptive mesh refinement, we consider a Poisson problem whose solution exhibits an inner layer with a sharp gradient. Let u = arctan(α(r − r₀)), where r is the radial distance and α is a sharpness parameter. We choose α = 20 and r = 0.725. The right-hand side f and Dirichlet condition g_D are chosen such that u is the solution to the Poisson problem (1). We solve this problem on a mesh of the Fichera corner, which has been adaptively refined around the inner layer. The refined mesh is 1-irregular, and consists of 93,940 elements with 23,562 nonconforming interfaces. The solution and mesh are shown in Figure 11.

Figure 11.

Mesh and solution of inner layer problem run, showing nonconforming adaptive mesh refinement run on 12 V100 GPUs. Left: Fichera corner mesh. Right: mesh cut by plane normal to (1, 1, 1) passing through the point (0.2, 0.2, 0.2).

We solve this problem using polynomial degrees p = 1 through p = 7, using three nodes of Lassen with 12 V100 GPUs. The relative GPU runtimes of the algorithmic components are shown in Figure 12. For comparison, we also include the relative CPU runtimes. As in the previous test cases, the AMG setup represents the dominant portion of the total time to solution on the GPU. In this case, the setup is even more expensive because of the increased fill-in. Furthermore, the triple product P^TAP (where the matrix P represents both the parallel decomposition and the nonconforming constraints) represents an increasing fraction of the total runtime as the constraint matrix becomes increasingly coupled at higher orders. The high-order operator setup, low-order-refined matrix assembly (not including constraints enforced through the triple-product), and high-order operator application (totaled over all the CG iterations) represent 10% or less of the time to solution for p ≥4. On the other hand, on the CPU, the relative cost of the AMG V-cycle and high-order operator evaluation are more significant. For p >2, the AMG V-cycle represents the majority of the runtime on CPU, and the high-order operator evaluation represents roughly 15% of the CPU runtime. In contrast to the GPU runtimes, on CPU, the AMG setup and P^TAP operation each represent less than 10% of the total runtime.

Figure 12.

Results for diffusion problem on mesh with nonconforming adaptive refinement. Relative runtime of algorithmic components: left: GPU relative runtimes (12 V100 GPUs) and right: CPU relative runtimes (one Power 9 core).

The nonconforming refinement results in nontrivial constraints that are then incorporated into the prolongation operator Λ, resulting in greater fill-in, and a less sparse LOR matrix. The constraints at a nonconforming interface will generally couple all the high-order degrees of freedom lying at the adjacent faces, resulting in decreased sparsity with increasing polynomial degree (see Table 1). However, we emphasize that despite this modest decrease in sparsity, for p = 7, the LOR matrix A_h is still roughly an order of magnitude sparser than the corresponding high-order matrix A_p. The number of conjugate gradient iterations required to converge to a relative tolerance of 10⁻¹² remains roughly constant for p > 1, corroborating the spectral equivalence results for meshes with adaptive refinement.

Table 1.

Results for diffusion problem on mesh with nonconforming adaptive refinement. Problem size, LOR system sparsity, CG iterations required to converge to a relative residual of 10⁻¹², and wall-clock runtime (12 V100 GPUs on three nodes of Lassen).

p	DOFs	NNZ	NNZ per row	Its.	GPU runtime (s)
1	6.0 × 10⁴	1.7 × 10⁶	28	28	0.4
2	6.1 × 10⁵	2.2 × 10⁷	36	43	0.7
3	2.2 × 10⁶	8.8 × 10⁷	40	42	1.1
4	5.5 × 10⁶	2.3 × 10⁸	42	44	2.0
5	1.1 × 10⁷	5.0 × 10⁸	45	45	3.3
6	1.9 × 10⁷	9.2 × 10⁸	48	46	5.7
7	3.1 × 10⁷	1.6 × 10⁹	52	47	9.9

Large-scale electromagnetic diffusion

In this section, we use the GPU-accelerated solvers described above to solve a large-scale magnetic diffusion problem posed on a realistic geometry. This problem illustrates the use of low-order-refined BoomerAMG and AMS preconditioners in H¹ and H (curl), and the representation of vector fields in H (div), using the sparse discrete differential operators resulting from the interpolation–histopolation basis. We consider a charged copper coil in air, and solve for the resulting magnetic field using the A–ϕ potential formulation of magnetic diffusion (cf., Rieben and White, 2006). The domain Ω is the box $[- \frac{1}{2}, \frac{1}{2}] \times [- \frac{1}{2}, \frac{1}{2}] \times [- \frac{3}{4}, \frac{3}{4}]$ , partitioned into non-overlapping regions representing the coil and air, Ω = Ω_coil ∪ Ω_air. The piecewise constant conductivity coefficient β is defined as

β = {\begin{array}{c} 1 & in Ω_{coil}, \\ 10^{- 6} & in Ω_{air} . \end{array}

The coil intersects the domain boundary at two terminals, Γ_in and Γ_out, such that ∂Ω = Γ_in ∪ Γ_out ∪ Γ_box.

The current running through the copper coil is driven by a potential difference enforced as boundary conditions at the two terminals, Γ_in and Γ_out. The electric scalar potential ϕ is obtained as the solution to the Poisson problem

\begin{array}{l} \nabla \cdot β \nabla ϕ = 0 in Ω, \\ ϕ = ϕ_{in} on Γ_{in}, \\ ϕ = ϕ_{out} on Γ_{out}, \\ \frac{\partial ϕ}{\partial n} = 0 on Γ_{box} . \end{array}

(6)

For this problem, we take ϕ_in = 0 and ϕ_out = 1. After obtaining the electric scalar potential, the magnetic vector potential A is given as the solution to the curl–curl problem with homogeneous tangential Dirichlet conditions at all domain boundaries

\begin{array}{l} \nabla \times \nabla \times A + β A = - β \nabla ϕ in Ω, \\ n \times A = 0 on \partial Ω . \end{array}

(7)

The magnetic field B is given by the curl of the vector potential, B = ∇ × A.

The domain is discretized using a hexahedral mesh with element boundaries fitted to the air-coil interface. To obtain this mesh, first an unstructured tetrahedral mesh of the geometry was generated, and then each tetrahedron was subdivided into four hexahedra to obtain an all-hexahedral mesh. While this tet-to-hex strategy often results in large meshes with poorly shaped elements, it has been used successfully in challenging high-order and spectral element applications (see, e.g., Yuan et al., 2020). The resulting mesh consists of 1,532,116 hexahedral elements. We solve this problem with polynomial degree p = 4. The high-order H¹ finite element space has approximate 10⁸ degrees of freedom, and the Nédélec space has about 2.9 × 10⁸ degrees of freedom.

The first step of the solution procedure requires the solution of the Poisson problem (6) to compute the electric scalar potential ϕ. The resulting LOR system for the H¹ system has 27 nonzeros per row, leading to a total of 2.6 × 10⁹ nonzeros. Relative and absolute tolerances of 10⁻⁸ were used as stopping criteria for the conjugate gradient solver, resulting in convergence after 45 iterations. After the potential ϕ has been found, the right-hand side for (7) must be computed. First, ∇ϕ is computed by applying the discrete gradient operator, obtaining a vector field in H (curl). This operation is performed as a sparse matrix–vector product with the discrete gradient matrix whose assembly is described in Step S1.2; for equal problem size, the number of nonzeros in this matrix is independent of the polynomial degree. After having computed ∇ϕ, the right-hand side is given by applying the H (curl) mass matrix weighted by the conductivity coefficient β. The mass matrix is applied matrix-free using partial assembly (see Step S2.1).

Given the right-hand side computed above, the curl–curl problem (7) is solved to obtain the magnetic vector potential A. The LOR system used to construct the AMS preconditioner has approximately 33 nonzeros per row, resulting in 9.7 × 10⁹ nonzero entries. The same stopping criteria were used as in the H¹ problem. The CG iteration converged after 22 iterations. Once the magnetic vector potential has been computed, the magnetic field is represented as a function in H (div) by applying the discrete curl operator. As in the case of the discrete gradient operator, this operation is performed as a sparse matrix–vector product using the approach described in Step S1.2. Streamlines of the magnetic field are shown in Figure 13. This problem was run on 80 nodes of Lassen using 320 V100 GPUs. The wall-clock runtime for the entire solution procedure was 26 s.

Figure 13.

Top: illustration of a coarser version of the mesh for the coil problem, cut with a plane passing through the origin. Bottom: magnetic field streamlines colored by electric scalar potential.

While this example has illustrated the use of GPU-accelerated LOR preconditioners for magnetic diffusion problems in the A–ϕ formulation, the solvers are immediately applicable to a range of other problems posed in H¹, H (curl), and H (div). For example, any of the formulations of electromagnetic diffusion that result in positive-definite curl–curl problems can be handled using the LOR–AMS solver (Rieben and White, 2006). Likewise, radiation diffusion problems posed in H (div) can be handled by the LOR–ADS solver; these solvers can also be adapted for use on Darcy-type problems in porous media flow. The framework posed by Pazner et al. (2022) naturally enables the use of LOR preconditioning for coercive problems posed in all spaces of the L² de Rham complex.

Open-source code availability

All of the algorithms and implementations described in this article are available in the open-source MFEM software library under the permissive BSD license at https://mfem.org and https://github.com/mfem/mfem (see also Anderson et al., 2020). The construction and use of matrix-free low-order-refined solvers is illustrated in the included $l o r_s o l v e r s$ mini-application and its parallel counterpart $p l o r_s o l v e r s$ . Generally, the construction of LOR solvers can be performed in one or two lines of code, and GPU acceleration and the high-performance macro-element batching strategies are automatically enabled. Example code illustrating the construction of the low-order-refined discretization, assembling the matrix in parallel CSR format, and creating AMG preconditioners using the hypre library is shown in Figure 14.

Figure 14.

Illustration of MFEM’s API for the construction and use of the low-order-refined solvers. Top: forming the low-order-refined discretization and assembling the matrix in parallel CSR format. Bottom: creating low-order-refined hypre AMG, AMS, and ADS preconditioners given the high-order bilinear form.

Conclusions

In this article, we have described algorithms and implementations for the GPU-accelerated solution of high-order finite element problems using low-order-refined preconditioning. This approach combines matrix-free operator evaluation using the partial assembly approach with algebraic multigrid preconditioners constructed using assembled low-order-refined matrices. The end-to-end GPU acceleration of LOR preconditioning allows for the efficient and scalable solution to high-order finite element problems on GPU-based supercomputing architectures, avoiding the memory-intensive and costly assembly of the high-order system matrices. New LOR matrix assembly algorithms based on macro-element batching were introduced, achieving significant speedup when compared with fully unstructured low-order matrix assembly. A detailed study of kernel throughput for the algorithmic components of the solution procedure was presented. The performance of preconditioners on problems with nonconforming adaptive mesh refinement is considered, showing fast problem setup and convergence even at high orders. The scalability of the solvers was demonstrated on problems with over 1 billion degrees of freedom on 1024 GPUs. Finally, we demonstrated the capability of these solvers on a challenging large-scale electromagnetic diffusion problem with complex geometry using 320 GPUs.

Footnotes

Acknowledgements

The authors thank V. Dobrev for the unfailingly insightful comments and suggestions, and M. Stowell for help with the electromagnetic diffusion problem formulation. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-JRNL-841564). W. Pazner and Tz. Kolev were partially supported by the LLNL-LDRD Program under Project No. 20-ERD-002. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Lawrence Livermore National Laboratory (LDRD 20-ERD-002) and U.S. Department of Energy (Exascale Computing Project (17-SC-20-SC)).

ORCID iDs

Will Pazner

Tzanio Kolev

Author biographies

Will Pazner is an assistant professor in the Fariborz Maseeh Department of Mathematics and Statistics at Portland State University, where his research focuses on high-order finite elements, efficient solvers and preconditioners, and high-performance computing. Previously, he was the Sidney Fernbach Postdoctoral Fellow at Lawrence Livermore National Laboratory’s Center for Applied Scientific Computing.

Tzanio Kolev is a computational mathematician at the Center for Applied Scientific Computing (CASC) in Lawrence Livermore National Laboratory, where he works on finite element meshing, discretizations and solvers for problems arising in computational electromagnetics, elasticity, and compressible shock hydrodynamics. He won R&D 100 award in 2007 as a member of the hypre project and was an R&D 100 finalist for MFEM in 2020. Tzanio is leading the finite element R&D efforts in the MFEM and BLAST projects in CASC and is the director of the co-design Center for Efficient Exascale Discretizations (CEED) in DOE’s Exascale Computing Project (ECP). He won an LLNL mid-career recognition award in 2019.

John Camier is a research scientist in the numerical analysis and simulations group at the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His current research focuses on parallel computer architectures, performance portability, and computing with high-order finite element methods.

References

Abdelfattah

Barra

Beams

, et al. (2021) GPU algorithms for efficient exascale discretizations. Parallel Computing 108: 102841. DOI: 10.1016/j.parco.2021.102841.

Anderson

Andrej

Barker

, et al. (2021) MFEM: a modular finite element methods library. Computers & Mathematics with Applications 81: 42–74. DOI: 10.1016/j.camwa.2020.06.009.

Arnold

Falk

Winther

(2000) Multigrid in H(div) and H(curl). Numerische Mathematik 85(2): 197–217. DOI: 10.1007/pl00005386.

Bell

Dalton

Olson

(2012) Exposing fine-grained parallelism in algebraic multigrid methods. SIAM Journal on Scientific Computing 34(4): C123–C152. DOI: 10.1137/110838844.

Bello-Maldonado

Fischer

(2019) Scalable low-order finite element preconditioners for high-order spectral element Poisson solvers. SIAM Journal on Scientific Computing 41(5): S2–S18. DOI: 10.1137/18M1194997.

Brown

Abdelfattah

Barra

, et al. (2021) libCEED: fast algebra for high-order element-based discretizations. Journal of Open Source Software 6(63): 2945. DOI: 10.21105/joss.02945.

Canuto

(1994) Stabilization of spectral methods by finite element bubble functions. Computer Methods in Applied Mechanics and Engineering 116(1–4): 13–26. DOI: 10.1016/s0045-7825(94)80004-9.

Canuto

Gervasio

Quarteroni

(2010) Finite-element preconditioning of G-NI spectral methods. SIAM Journal on Scientific Computing 31(6): 4422–4451. DOI: 10.1137/090746367.

Casarin

(1997) Quasi-optimal Schwarz methods for the conforming spectral element discretization. SIAM Journal on Numerical Analysis 34(6): 2482–2502. DOI: 10.1137/s0036142995292281.

10.

Červený

Dobrev

Kolev

(2019) Nonconforming mesh refinement for high-order finite elements. SIAM Journal on Scientific Computing 41(4): C367–C392.

11.

Deville

Mund

(1985) Chebyshev pseudospectral solution of second-order elliptic equations with finite element preconditioning. Journal of Computational Physics 60(3): 517–533. DOI: 10.1016/0021-9991(85)90034-8.

12.

Dohrmann

(2021) Spectral equivalence of low-order discretizations for high-order H(curl) and H(div) spaces. SIAM Journal on Scientific Computing 43(6): A3992–A4014. DOI: 10.1137/21m1392115.

13.

Falgout

Sjögreen

, et al. (2021) Porting hypre to heterogeneous computer architectures: strategies and experiences. Parallel Computing 108: 102840. DOI: 10.1016/j.parco.2021.102840.

14.

Falgout

Yang

(2002) hypre: a library of high performance preconditioners. In: Sloot

PMA

Hoekstra

Tan

CJK

, et al. (eds), Computational Science — ICCS 2002, Lecture Notes in Computer Science. Berlin, Germany: Springer, pp. 632–641. DOI: 10.1007/3-540-47789-6_66.

15.

Fischer

Min

Rathnayake

, et al. (2020) Scalability of high-performance PDE solvers. The International Journal of High Performance Computing Applications 34(5): 562–586. DOI: 10.1177/1094342020915762.

16.

Fischer

(1997) An overlapping Schwarz method for spectral element solution of the incompressible Navier–Stokes equations. Journal of Computational Physics 133(1): 84–101. DOI: 10.1006/jcph.1997.5651.

17.

Fischer

Lottes

(2005) Hybrid Schwarz-multigrid methods for the spectral element method: extensions to Navier-Stokes. Lecture Notes in Computational Science and Engineering 35(49): 35–49. DOI: 10.1007/3-540-26825-1_3.

18.

Franco

Camier

Andrej

, et al. (2020) High-order matrix-free incompressible flow solvers with GPU acceleration and low-order refined preconditioners. Computers & Fluids 203: 104541DOI. DOI: 10.1016/j.compfluid.2020.104541.

19.

Haase

Liebmann

Douglas

, et al. (2010) A parallel algebraic multigrid solver on graphics processing units. High Performance Computing and Applications. Berlin, Germany: Springer, pp. 38–47. DOI: 10.1007/978-3-642-11842-5_5.

20.

Helenbrook

Atkins

(2006) Application of p-multigrid to discontinuous Galerkin formulations of the Poisson equation. AIAA Journal 44(3): 566–575. DOI: 10.2514/1.15497.

21.

Henson

Yang

(2002) BoomerAMG: a parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics 41(1): 155–177. DOI: 10.1016/s0168-9274(01)00115-5.

22.

Hiptmair

(2007) Nodal auxiliary space preconditioning in H(div) and H(curl) spaces. SIAM Journal on Numerical Analysis 45(6): 2483–2509. DOI: 10.1137/060660588.

23.

Hutchinson

Heinecke

Pabst

, et al. (2016) Efficiency of high order spectral element methods on petascale architectures. Lecture Notes in Computer Science 449: 449–466. DOI: 10.1007/978-3-319-41321-1_23.

24.

Kolev

Fischer

Austin

, et al. (2021a) CEED ECP Milestone Report: High-Order Algorithmic Developments and Optimizations for Large-Scale GPU-Accelerated Simulations. Washington, DC: U.S. Department of Energy. Technical Report CEED-MS36.

25.

Kolev

Fischer

Min

, et al. (2021b) Efficient exascale discretizations: high-order finite element methods. The International Journal of High Performance Computing Applications 35(6). DOI: 10.1177/10943420211020803.

26.

Vassilevski

TVKPS

Vassilevski

(2009) Parallel auxiliary space AMG for h(curl) problems. Journal of Computational Mathematics 27(5): 604–623. DOI: 10.4208/jcm.2009.27.5.013.

27.

Kolev

Vassilevski

(2012) Parallel auxiliary space AMG solver for H(div) problems. SIAM Journal on Scientific Computing 34(6): A3079–A3098. DOI: 10.1137/110859361.

28.

Kreeft

Palha

Gerritsma

(2011) Mimetic framework on curvilinear quadrilaterals of arbitrary order. ArXiv:1111.4304.

29.

Kronbichler

Kormann

(2019) Fast matrix-free evaluation of discontinuous Galerkin finite element operators. ACM Transactions on Mathematical Software 45(3): 1–40. DOI: 10.1145/3325864.

30.

Kronbichler

Ljungkvist

(2019) Multigrid for matrix-free high-order finite element computations on graphics processors. ACM Transactions on Parallel Computing 6(1): 1–32. DOI: 10.1145/3322813.

31.

Kronbichler

Wall

(2018) A performance comparison of continuous and discontinuous Galerkin methods with fast multigrid solvers. SIAM Journal on Scientific Computing 40(5): A3423–A3448. DOI: 10.1137/16m110455x.

32.

Ljungkvist

(2017) Matrix-free finite-element computations on graphics processors with adaptively refined unstructured meshes. Proceedings of the 25th high performance computing symposium, HPC ’17. San Diego, CA, 23-26 April 2017.

33.

Melenk

Gerdes

Schwab

(2001) Fully discrete hp-finite elements: fast quadrature. Computer Methods in Applied Mechanics and Engineering 190(32–33): 4339–4364. DOI: 10.1016/s0045-7825(00)00322-4.

34.

Naumov

Arsaev

Castonguay

, et al. (2015) AmgX: a library for GPU accelerated algebraic multigrid and preconditioned iterative methods. SIAM Journal on Scientific Computing 37(5): S602–S626. DOI: 10.1137/140980260.

35.

Orszag

(1980) Spectral methods for problems in complex geometries. Journal of Computational Physics 37(1): 70–92. DOI: 10.1016/0021-9991(80)90005-4.

36.

Parter

Rothman

(1995) Preconditioning Legendre spectral collocation approximations to elliptic problems. SIAM Journal on Numerical Analysis 32(2): 333–385. DOI: 10.1137/0732015.

37.

Pazner

(2020) Efficient low-order refined preconditioners for high-order matrix-free continuous and discontinuous Galerkin methods. SIAM Journal on Scientific Computing 42(5): A3055–A3083. DOI: 10.1137/19m1282052.

38.

Pazner

Kolev

(2021) Uniform subspace correction preconditioners for discontinuous Galerkin methods with hp-refinement. Communications on Applied Mathematics and Computation 4: 697–727. DOI: 10.1007/s42967-021-00136-3.

39.

Pazner

Kolev

Dohrmann

(2022) Low-order preconditioning for the high-order de Rham complex. arXiv eprint 2203.02465. (Submitted for publication).

40.

Phillips

Kerkemeier

Fischer

(2022) Tuning spectral element preconditioners for parallel scalability on GPUs. In: Proceedings of the 2022 SIAM conference on parallel processing for scientific computing (PP), Seattle, WA, 23–26 February 2022. DOI:10.1137/1.9781611977141.4.

41.

Rieben

White

(2006) Verification of high-order mixed finite-element solution of transient magnetic diffusion problems. IEEE Transactions on Magnetics 42(1): 25–39. DOI: 10.1109/tmag.2005.860127.

42.

Rønquist

Patera

(1987) Spectral element multigrid. I. Formulation and numerical results. Journal of Scientific Computing 2(4): 389–406. DOI: 10.1007/bf01061297.

43.

Sundar

Stadler

Biros

(2015) Comparison of multigrid algorithms for high-order continuous finite element discretizations. Numerical Linear Algebra with Applications 22(4): 664–680. DOI: 10.1002/nla.1979.

44.

Loan

(2000) The ubiquitous Kronecker product. Journal of Computational and Applied Mathematics 123(1–2): 85–100. DOI: 10.1016/s0377-0427(00)00393-9.

45.

Yuan

Yildiz

Merzari

, et al. (2020) Spectral element applications in complex nuclear reactor geometries: tet-to-hex meshing. Nuclear Engineering and Design 357: 110422. DOI: 10.1016/j.nucengdes.2019.110422.