Energy efficiency of nonlinear domain decomposition methods

Abstract

A nonlinear domain decomposition (DD) solver is considered with respect to improved energy efficiency. In this method, nonlinear problems are solved using Newton’s method on the subdomains in parallel and in asynchronous iterations. The method is compared to the more standard Newton-Krylov approach, where a linear domain decomposition solver is applied to the overall nonlinear problem after linearization using Newton’s method. It is found that in the nonlinear domain decomposition method, making use of the asynchronicity, some processor cores can be set to sleep to save energy and to allow better use of the power and thermal budget. Energy savings on average for each socket up to 77% (due to the RAPL hardware counters) are observed compared to the more traditional Newton-Krylov approach, which is synchronous by design, using up to 5120 Intel Broadwell (Xeon E5-2630v4) cores. The total time to solution is not affected. On the contrary, remaining cores of the same processor may be able to go to turbo mode, thus reducing the total time to solution slightly. Last, we consider the same strategy for the ASPIN (Additive Schwarz Preconditioned Inexact Newton) nonlinear domain decomposition method and observe a similar potential to save energy.

Keywords

Nonlinear FETI-DP nonlinear domain decomposition nonlinear elimination Newton’s method nonlinear problems parallel computing energy efficiency LIKWID

1. Introduction

In recent years, many nonlinear domain decomposition approaches have been introduced and their superiority over the classical combination of a nonlinear solver, e.g. Newton’s method, with a linear domain decomposition approach has been shown for many model problems (Cai and Keyes, 2002; Cai et al., 2002a, 2002b; Dolean et al., 2016; Groß, 2009; Groß and Krause, 2011; Klawonn et al., 2014, 2015, 2016, 2017, 2018; Liu and Keyes, 2015; Liu et al., 2018; Marcinkowski and Cai, 2005; Negrello et al., 2018). Nonlinear domain decomposition methods are solution approaches for nonlinear problems and apply the concepts and ideas of linear domain decomposition methods as, e.g. Overlapping Schwarz (Cai and Keyes, 2002; Cai et al., 2002a, 2002b; Dolean et al., 2016; Groß, 2009; Groß and Krause, 2011; Liu and Keyes, 2015; Liu et al., 2018; Marcinkowski and Cai, 2005), FETI (Finite Element Tearing and Interconnecting) (Negrello et al., 2018), FETI-DP (Finite Element Tearing and Interconnecting—Dual-Primal) (Klawonn et al., 2014, 2015, 2017), and BDDC (Balancing Domain Decomposition by Constraints) (Klawonn et al., 2014, 2017, 2018) directly to a nonlinear problem before a nonlinear solver is applied. Although all these methods behave quite differently in their linear as well as nonlinear variants, they have some common features and properties. We will show that one of these properties makes nonlinear domain decomposition methods typically more energy efficient compared with their linear relatives embedded in some nonlinear solver. This effect is not only caused by a shorter runtime but also the power consumption is typically lower during most of the computation. In this paper, we will show and explain this effect in detail for the example of nonlinear FETI-DP methods. First we will provide a more general description of the concept and show that it is also applicable to other methods as, e.g. ASPIN (Additive Schwarz Preconditioned Inexact Newton) (Cai and Keyes, 2002).

Given is a discrete nonlinear problem

A (x) = 0,

which has been obtained by a discretization of a nonlinear partial differential equation. In contrast to classical approaches, e.g. Newton-Krylov-DD (Domain Decomposition) methods, where eq. (1) is linearized by Newton’s method and the tangential system is solved iteratively using a domain decomposition approach, the discrete nonlinear function eq. (1) is replaced by an alternative formulation

A (x) = 0.

Here, $A$ is obtained by applying domain decomposition strategies, i.e. by decomposing eq. (1) into local nonlinear problems on subdomains. These local nonlinear problems are decoupled or, for numerical scalability, weakly coupled.

Of course, it has to be guaranteed that eq. (1) and eq. (2) have the same solution. Then, instead of eq. (1), eq. (2) is solved by a nonlinear solver. If $A$ is designed properly, the nonlinear solver takes fewer steps to solve the problem and therefore the time to solution is reduced. Another advantage often observed is increased robustness (Cai and Keyes, 2002; Klawonn et al., 2018; Liu et al., 2018). In general, current methods to form $A$ can be interpreted as a nonlinear right-preconditioning approach

A (x) : = A (M (x))

or as a a nonlinear left-preconditioning approach

A (x) : = G (A (x)) .

Examples for the latter case are ASPIN (Cai and Keyes, 2002) and RASPEN (Dolean et al., 2016), while Nonlinear-FETI-DP and Nonlinear-BDDC can be interpreted as nonlinearly right preconditioned methods; see Klawonn et al. (2017). To be efficient, applications of the preconditioners M and G should be cheap compared with an application of A. Both should put the initial value near to the solution and to obtain the correct solution, one has to ensure that $G (0) = 0$ . Let us remark that in the case of nonlinear right preconditioning, we search the solution $M (x)$ instead of x. In most methods, the functions M or G are not given explicitly, but only implicitly by defining local nonlinear problems on subdomains. These local problems are decoupled or, sometimes, coupled through a small coarse space. When solving eq. (2) with Newton’s method, the function G or M has to be evaluated in each Newton step, and therefore all the local nonlinear problems have to be solved, e.g. by local Newton iterations. However, these local iterations are easily parallelizable and can often be performed desynchronized among the processors. A rough sketch of a nonlinear domain decomposition method is given in Figure 1.

Figure 1.

Sketch of a nonlinear domain decomposition method using Newton’s method.

However, some of the local nonlinear problems in G or M can converge fast or even in a single step (nearly linear behavior), while others may take many Newton steps to converge. Under the assumption that each computational core solves exactly one local nonlinear problem, problem dependent load imbalances can thus arise. This can be tolerated as long as the time to solution is faster than using classical approaches with proper load balance, but is nonetheless a topic of current research. Removing the load imbalance completely without loosing convergence properties is difficult since a redistribution of the work, e.g. by resizing subdomains changes the nonlinear problem $A$ completely. However, computational approaches such as the superman strategy (Huber et al., 2016) can help.

In this paper, we investigate a different approach and decide to set cores to sleep (test-and-sleep), when they have finished solving their local nonlinear problem until all other cores catch up. This can save energy compared to classical approaches and is feasible because the desynchronization of the local Newton iterations in nonlinear DD methods results in potential sleep times at the order of seconds; see Section 4. To demonstrate this effect, we compare the classical Newton-Krylov-FETI-DP approach with Nonlinear-FETI-DP-3 with respect to scalability, runtime, load balancing, energy consumption, and power efficiency. Let us remark here that Nonlinear-FETI-DP-3 is a good testing prototype, since completely decoupled local nonlinear problems are solved in each outer Newton step. Nonetheless, all introduced concepts can be carried over to different nonlinear DD methods and, even more generally, to many other applications with a severe load imbalance. Note that, here, we do not propose test-and-sleep as a universal method to save energy. It is rather used to show the energy saving potential of new algorithms, i.e. nonlinear domain decomposition.

The remainder of the paper is organized as follows. In Section 2 in order to provide a self-contained paper, we give a detailed description of Newton-Krylov-FETI-DP and Nonlinear-FETI-DP-3 with a focus on the local solves and the load imbalance in the latter one. We describe the investigated model problems in Section 3 and our approach to save energy as well as corresponding measurements in Section 4. Finally, in Section 5, we provide a model to interprete the measurements resulting from Section 4.

2. An introduction to nonlinear FETI-DP methods

In this section, we briefly introduce the nonlinear FETI-DP (Finite Element Tearing and Interconnecting—Dual Primal) framework (see Klawonn et al., 2014, 2017 for a detailed description) and derive Nonlinear-FETI-DP-3, the variant we use in this paper to represent nonlinear domain decomposition methods. As already mentioned above, instead of solving eq. (1) directly by applying Newton’s method, in nonlinear FETI-DP methods eq. (1) is replaced by the equivalent formulation

A (x) : = A_{N L} (M (\tilde{u}, λ)) = 0,

which is then solved by a Newton-Krylov approach. Here, $A_{N L}$ is a nonlinear saddle point formulation and M a nonlinear preconditioner, both derived by transferring the ideas of the linear FETI-DP domain decomposition approach (Farhat et al., 2000, 2001; Klawonn and Rheinbach, 2007, 2010; Klawonn and Widlund, 2006; Klawonn et al., 2002) to the nonlinear problem eq. (1). If constructed properly and M provides certain properties (see Klawonn et al., 2017), this approach can reduce the time to solution. Additionally, the application of M, which has to be performed in each Newton step, contains a lot of local work and is easily parallelizable. On the other hand, an application of M can introduce a strong load imbalance; see subsections 2.3 and 2.4 for details.

Let us first summarize linear FETI-DP and introduce the classical Newton-Krylov-FETI-DP method in subsection 2.1 before we proceed to describe the nonlinear variant in the following subsections.

2.1. Newton-Krylov-FETI-DP

In general, we assume that eq. (1) is obtained by a finite element discretization of a partial differential equation defined on a computational domain $Ω \subset ℝ^{d}, d = 2, 3$ . We denote our finite element space, which discretizes Ω, by V^h and can obviously rewrite eq. (1) as

\hat{K} (\hat{u}) = \hat{f} \Leftrightarrow A (\hat{u}) : = \hat{K} (\hat{u}) - \hat{f} = 0,

where $\hat{K} : V^{h} \to V^{h}$ is nonlinear in $\hat{u} \in V^{h}$ and $\hat{f} \in V^{h}$ is independent of $\hat{u}$ ; see also section 3 for an example considering a model problem. We replaced x from eq. (1) by $\hat{u}$ to match with the notation of nonlinear FETI-DP in the literature. In Newton-Krylov-FETI-DP, eq. (6) is solved using Newton’s method, i.e. by the iteration ${\hat{u}}^{(k + 1)} = {\hat{u}}^{(k)} - δ {\hat{u}}^{(k)}$ with some initial value ${\hat{u}}^{(0)}$ and the updates defined by

D \hat{K} ({\hat{u}}^{(k)}) δ {\hat{u}}^{(k)} = \hat{K} ({\hat{u}}^{(k)}) - \hat{f} .

Here, $D \hat{K} ({\hat{u}}^{(k)})$ is the Jacobian matrix of $\hat{K}$ evaluated in ${\hat{u}}^{(k)}$ . Finally, in Newton-Krylov-FETI-DP, a linear FETI-DP domain decomposition approach is chosen to solve the linearized system eq. (7), which occurs in each Newton step, iteratively. This automatically classifies Newton-Krylov-FETI-DP as an inexact Newton method.

Let us briefly describe linear FETI-DP and introduce some necessary notation. We assume to have a decomposition of Ω into $N \in ℕ$ nonoverlapping subdomains $Ω_{i}, i = 1, ..., N$ , i.e. $Ω = \cup_{i = 1}^{N} Ω_{i}$ . Each $Ω_{i}$ is the union of finite elements; see also Figure 2 (left). The finite element subspaces associated with $Ω_{i}, i = 1, ..., N$ , are denoted by $W_{i}, i = 1, ..., N$ . We obtain local nonlinear finite element problems $K_{i} (u_{i}) - f_{i} = 0$ with $K_{i} : W_{i} \to W_{i}$ and $f_{i} \in W_{i}$ by restricting the considered differential equation to $Ω_{i}$ and discretizing its variational formulation using the finite element spaces W_i ; see also section 3 for an example using a model problem. The local Jacobian matrices belonging to $K_{i} (\cdot)$ are denoted by $D K_{i} (\cdot)$ . With restrictions $R_{i} : V^{h} \to W_{i}, i = 1, ..., N$ , $R^{T} : = (R_{1}^{T},..., R_{N}^{T})$ , $u^{T} : = (u_{1}^{T},..., u_{N}^{T})$ , $K {(u)}^{T} : = (K_{1} {(u_{1})}^{T},..., K_{N} {(u_{N})}^{T})$ , and $D K (u) = d i a g (D K_{1} (u_{1}), ..., D K_{N} (u_{N}))$ , we have the identities

\hat{K} (\hat{u}) = R^{T} K (R \hat{u})

and

D \hat{K} (\hat{u}) = R^{T} D K (R \hat{u}) R .

Figure 2.

Left: Decomposition of discretized domain Ω into four subdomains $Ω_{i}, i = 1, ..., 4$ and corresponding finite element spaces $W_{i}, i = 1, ..., 4$ . Right: Discretized domain Ω, corresponding finite element space V^h , and interface Γ. The operator R^T acts as a finite element assembly operator on the interface.

The application of R^T in eq. (8) and eq. (9) has thus the effect of a finite element assembly of local finite element functions on the interface $Γ : = (\cup_{i = 1}^{N} \partial Ω_{i}) \ \partial Ω$ ; see Figure 2.

Let us assume, we have sorted and decomposed a solution u from the decoupled space $W = W_{1} \times ... \times W_{n}$ into interface variables $u_{Γ}$ and all remaining interior variables u_I , i.e. $u^{T} = (u_{I}^{T}, u_{Γ}^{T})$ . In FETI-DP, we further subdivide the degrees of freedoms on the interface $u_{Γ}$ into primal variables $u_{Π}$ and dual variables $u_{Δ}$ . Let us remark that all these variables are still local to the subdomains, e.g. we have $u_{Π}^{T} = (u_{1_{Π}}^{T},..., u_{N_{Π}}^{T})$ . Here, $u_{i_{Π}}, i = 1, ..., N$ is the vector of primal solutions on subdomain $Ω_{i}, i,..., N$ . We now introduce another assembly operator $R_{Π}^{T}$ , similar to R^T , which assembles only in the primal variables. We denote the corresponding primally assembled finite element space with $\tilde{W}$ ; see Figure 3 (right) for an illustration of $\tilde{W}$ , where subdomain vertices are chosen to be primal. Therefore, we have $R_{Π} : \tilde{W} \to W$ and any $\tilde{u} \in \tilde{W}$ has the structure ${\tilde{u}}^{T} = (u_{I}^{T}, u_{Δ}^{T}, {\tilde{u}}_{Π}^{T})$ , where ${\tilde{u}}_{Π}$ is now a vector of global variables and will constitute our global coarse problem or second level problem. We define the primally coupled operators by

\tilde{K} (\tilde{u}) = R_{Π}^{T} K (R_{Π} \tilde{u})

and

D \tilde{K} (\tilde{u}) = R_{Π}^{T} D K (R_{Π} \tilde{u}) R_{Π} .

Figure 3.

Left: Decomposition of a discretized domain Ω into four subdomains $Ω_{i}, i = 1, ..., 4$ and corresponding finite element spaces $W_{i}, i = 1, ..., 4$ . Right: Visualization of the primally coupled/assembled space $\tilde{W}$ . The subdomains are strongly coupled in the primal constraints (here vertices; red dots; global variables; index Π) and still uncoupled in the dual variables (blue squares; local variables; index Δ). All remaining variables are considered as inner variables (local variables; index I). The operator $R_{Π}^{T}$ acts as a finite element assembly operator in the primal variables.

Enforcing continuity in the dual variables is done by enforcing $B \tilde{u} = 0$ , using a linear jump operator B (see Klawonn and Rheinbach, 2007 for a detailed definition of B) and Lagrange multipliers. We obtain the equation system

(\begin{matrix} D \tilde{K} ({\tilde{u}}^{(k)}) & B^{T} \\ B & 0 \end{matrix}) (\begin{matrix} δ {\tilde{u}}^{(k)} \\ λ \end{matrix}) = (\begin{matrix} \tilde{K} ({\tilde{u}}^{(k)}) - \tilde{f} \\ 0 \end{matrix})

which is equivalent to eq. (7) and where λ is the vector of the Lagrange multipliers. Of course, several dual variables always belong to a common physical node on the interface and deliver more than a single entry in $\tilde{u}$ . Nevertheless, after solving eq. (12), continuity is guaranteed on the interface since $B \tilde{u} = 0$ is enforced and all the dual variables belonging to the same physical node hold the same value. Therefore, the solution $δ {\hat{u}}^{(k)}$ of eq. (7) can be easily obtained from $δ {\tilde{u}}^{(k)}$ using the restriction $R_{D} : V^{h} \to W$ , which is a weighted variant of R and the weights are defined as the inverses of the multiplicities of the dual variables. We then have $δ {\hat{u}}^{(k)} = R_{D}^{T} R_{Π} δ {\tilde{u}}^{(k)}$ .

By a block elimination in eq. (12) we derive the system

F ({\tilde{u}}^{(k)}) λ = d ({\tilde{u}}^{(k)})

with $F ({\tilde{u}}^{(k)}) = - B {(D \tilde{K} ({\tilde{u}}^{(k)}))}^{- 1} B^{T}$ and $d ({\tilde{u}}^{(k)}) = - B {(D \tilde{K} ({\tilde{u}}^{(k)}))}^{- 1} (\tilde{K} ({\tilde{u}}^{(k)}) - \tilde{f})$ . Finally, eq. (13) is solved iteratively with a CG or GMRES approach using an additional Dirichlet preconditioner $M_{D}^{- 1} ({\tilde{u}}^{(k)})$ . The Newton-update $δ {\tilde{u}}^{(k)}$ is then obtained by solving

D \tilde{K} ({\tilde{u}}^{(k)}) δ {\tilde{u}}^{(k)} = \tilde{K} ({\tilde{u}}^{(k)}) - \tilde{f} - B^{T} λ .

The complete Newton-Krylov-FETI-DP algorithm is also presented in Figure 5 (left).

Figure 4.

Left: Domain decomposition of a computational domain Ω into 16 subdomains $Ω_{i}, i = 1, \dots,16$ . Middle: Typical local Neumann problem $D K_{B B}^{- 1}$ occuring on each subdomain (Neumann Boundary Condition everywhere besides in the coarse nodes). Right: Typical local Dirichlet problem $D K_{I I}^{- 1}$ occuring in the Dirichlet preconditioner $M_{D}^{- 1}$ .

Figure 5.

Left: Newton-Krylov-FETI-DP. Right: Nonlinear-FETI-DP—the part of the code in red, the evaluation of M, collapses in Nonlinear-FETI-DP-3 into local Newton iterations and the load can be imbalanced; see section 2.4.

2.2. The parallel application of F and the Dirichlet preconditioner

Solving eq. (13) iteratively, the matrix $F ({\tilde{u}}^{(k)}) = - B {(D \tilde{K} ({\tilde{u}}^{(k)}))}^{- 1} B^{T}$ has to be applied to a vector in each iteration. This is indeed a parallelizable and highly scalable routine. We therefore have to consider the structure of the matrix $D \tilde{K}$ in more details. We omit the evaluation point ${\tilde{u}}^{(k)}$ here and in the following lines for a better readability. Reordering inner and dual variables and introducing the index set $B : = [I, Δ]$ , we can write

\begin{array}{l} D \tilde{K} & = (\begin{matrix} D K_{1_{B B}} & D {\tilde{K}}_{1_{Π B}}^{T} \\ ⋱ & ⋮ \\ D K_{N_{B B}} & D {\tilde{K}}_{N_{Π B}}^{T} \\ D {\tilde{K}}_{1_{Π B}} & \dots & D {\tilde{K}}_{N_{Π B}} & D {\tilde{K}}_{Π Π} \end{matrix}) \\ = : (\begin{matrix} D K_{B B} & D {\tilde{K}}_{Π B}^{T} \\ D {\tilde{K}}_{Π B} & D {\tilde{K}}_{Π Π} \end{matrix}) . \end{array}

Since $D K_{B B}$ is block diagonal and the blocks $D K_{i_{B B}}$ , $i = 1, ..., N$ , can be stored as sequential matrices, e.g. one on each compute core, sequential sparse direct solves can be used to handle applications of $D K_{B B}^{- 1}$ to a vector in parallel. Additionally, to perform an application of $F ({\tilde{u}}^{(k)})$ to a vector, a sparse direct solver has to be used to solve systems involving the coarse matrix $D {\tilde{S}}_{Π Π} : = D {\tilde{K}}_{Π Π} - D {\tilde{K}}_{Π B} D K_{B B}^{- 1} D {\tilde{K}}_{Π B}^{T}$ , which is a Schur complement in the primal variables of eq. (15). This problem is global and becomes—with a rising size of the coarse space—a parallelization bottleneck. This can be overcome by an inexact coarse solve (Klawonn and Rheinbach, 2010; Klawonn et al., 2017). For more details on the parallel implementation of an application of $F ({\tilde{u}}^{(k)})$ and the block factorization of $D \tilde{K}$ , see Klawonn et al. (2017). In this paper, the considered coarse problems are not too large and we thus always use a sparse direct solver to factorize $D {\tilde{S}}_{Π Π}$ . A visualization of $D K_{B B}$ and the coarse space $D {\tilde{S}}_{Π Π}$ can be found in Figure 4 (left, middle).

For the construction of the Dirichlet preconditioner $M_{D}^{- 1}$ we only consider the restriction of the local matrices $D K_{i}, i = 1, ..., N$ to the inner variables I, i.e. the matrices $D K_{i_{I I}}, i = 1, ..., N$ . A visual representation can again be found in Figure 4 (right). Sparse direct solvers can again be used locally to apply the inverse $D K_{i_{I I}}^{- 1}$ to a vector. The preconditioner $M_{D}^{- 1}$ is then a weighted sum of the $D K_{i_{I I}}^{- 1}, i = 1, ..., N$ . The weights are usually derived from the coefficient functions of the considered partial differential equation.

2.3. Nonlinear-FETI-DP

In the nonlinear FETI-DP approach, we first replace the nonlinear eq. (6) by a nonlinear saddle point formulation using ideas and operators from linear FETI-DP. With the nonlinear system from eq. (10), coupled in the primal variables, and enforcing the linear jump constraint $B \tilde{u} = 0$ with Lagrange multipliers, we obtain the nonlinear Lagrange function

A_{N L} (\tilde{u}, λ) : = (\begin{array}{l} \tilde{K} (\tilde{u}) + B^{T} λ - \tilde{f} \\ B \tilde{u} \end{array}) = (\begin{matrix} 0 \\ 0 \end{matrix}) .

After applying a nonlinear right-preconditioner $M : \tilde{W} \times V \to \tilde{W} \times V$ , with $V : = r a n g e (B)$ , the preconditioned system eq. (5), i.e.

A_{N L} (M (\tilde{u}, λ)) = 0,

is solved by a Newton-Krylov method. We therefore obtain the solution by the iteration

[\begin{matrix} {\tilde{u}}^{(k + 1)} \\ λ^{(k + 1)} \end{matrix}] : = [\begin{matrix} {\tilde{u}}^{(k)} \\ λ^{(k)} \end{matrix}] - α^{(k)} [\begin{matrix} δ {\tilde{u}}^{(k)} \\ δ λ^{(k)} \end{matrix}]

with the update defined by the linearized system

\begin{array}{l} (D A_{N L} (M ({\tilde{u}}^{(k)}, λ^{(k)})) \cdot D M ({\tilde{u}}^{(k)}, λ^{(k)})) [\begin{matrix} δ {\tilde{u}}^{(k)} \\ δ λ^{(k)} \end{matrix}] \\ = & A_{N L} (M ({\tilde{u}}^{(k)}, λ^{(k)})) . \end{array}

Different possible definitions of M are discussed in Klawonn et al. (2017), which all base on a partial nonlinear elimination of variables. It is shown in Klawonn et al. (2017) that eq. (18) can be solved using any linear FETI-DP approach and thus the linear solve in Newton-Krylov-FETI-DP and Nonlinear-FETI-DP only differs by the right hand side. Nonetheless, the preconditioner M has to be applied to $({\tilde{u}}^{(k)}, λ)$ in each iteration, which can make a huge difference. In this paper, for simplicity, we concentrate on Nonlinear-FETI-DP-3, which is a specific choice of M.

2.4. Nonlinear-FETI-DP-3

Using the index set $B : = [I, Δ]$ , we rewrite the nonlinear problem eq. (16) as

(\begin{array}{l} K_{B} (u_{B}, {\tilde{u}}_{Π}) + B_{B}^{T} λ - f_{B} \\ {\tilde{K}}_{Π} (u_{B}, {\tilde{u}}_{Π}) - {\tilde{f}}_{Π} \\ B_{B} u_{B} \end{array}) = (\begin{matrix} 0 \\ 0 \\ 0 \end{matrix}) .

We now decide that the nonlinear preconditioner M is linear in ${\tilde{u}}_{Π}$ and λ and should eliminate the variables u_B from eq. (19). Therefore, we have $M {(\tilde{u}, λ)}^{T} : = (M_{u_{B}} {(u_{B}, \tilde{u}, λ)}^{T}, {\tilde{u}}_{Π}^{T}, λ^{T})$ and $M_{u_{B}} (u_{B}, \tilde{u}, λ)$ solves the equation

K_{B} (M_{u_{B}} (u_{B}, \tilde{u}, λ), {\tilde{u}}_{Π}) + B_{B}^{T} λ - f_{B} = 0

To evaluate the preconditioner, eq. (20) has to be solved by Newton’s method. Due to the completely decoupled block structure of K_B , the solution of eq. (20) collapses to local Newton iterations, one on each subdomain. The different Newton iterations on different subdomains are independent of each other. This process thus exclusively consists of local work. The nonlinear problems act on the red area marked in Figure 4 (middle) and the linearized systems with the tangential matrices $D K_{i_{B B}}, i = 1, ..., N$ , are solved again by sequential sparse direct solvers. Let us note that the local Newton methods on the individual subdomains might need different numbers of iterations to converge. This introduces the load imbalance mentioned above. With the simpler notation

K_{B} (g_{B}, {\tilde{u}}_{Π}) + B_{B}^{T} λ - f_{B} = 0

and $g = (g_{B}, {\tilde{u}}_{Π})$ we have $(g, λ) = M (\tilde{u}, λ)$ and the evaluation of the preconditioner can be found in the algorithmic description in Figure 5 (right).

Finally, solving the linear system eq. (18) is equivalent to solving

(\begin{matrix} D \tilde{K} (g^{(k)}) & B^{T} \\ B & 0 \end{matrix}) (\begin{matrix} δ {\tilde{u}}^{(k)} \\ δ λ^{(k)} \end{matrix}) = (\begin{matrix} r h s_{N L} \\ B g^{(k)} \end{matrix})

with

r h s_{N L} : = (\begin{matrix} 0 \\ {\tilde{K}}_{Π} (g^{(k)}) - {\tilde{f}}_{Π} \end{matrix}) .

A proof can be found in Klawonn et al. (2017). Since the left hand sides in eq. (22) and eq. (12) have the same structure, any linear FETI-DP method can be used. As already described for Newton-Krylov-FETI-DP, we reduce the system to the Lagrange multipliers

F (g^{(k)}) δ λ^{(k)} = d_{N L} (g^{(k)})

with $d_{N L} (g^{(k)}) = B g^{(k)} - B {(D \tilde{K} (g^{(k)}))}^{- 1} \cdot r h s_{N L}$ . Equation 23 is solved iteratively with a CG or GMRES approach using a linear Dirichlet preconditioner $M_{D}^{- 1} (g^{(k)})$ as before. The Newton update $δ {\tilde{u}}^{(k)}$ is obtained by solving

D \tilde{K} (g^{(k)}) δ {\tilde{u}}^{(k)} = r h s_{N L} - B^{T} δ λ^{(k)} .

An overview of the algorithm can be found in Figure 5 (right).

3. Nonlinear model problems

We choose nonlinear partial differential equations based on the p-Laplace operator with $p \geq 2$ as model problems. These problems are excellent model problems for our purpose, since we can simply create large load imbalances, e.g. by enforcing a linear behavior on certain subdomains by choosing $p = 2$ . We basically concentrate on two different setups—the first one has a single nonlinearity in a single subdomain and the second one exhibits a more equal distribution of nonlinear effects. Of course, the first one is designed to emphasize the potential to save energy and the second one is closer to a real application. To describe both model problems in detail, let us first define the scaled p-Laplace operator for $p \geq 2$ by

α Δ_{p} u : = div (α | \nabla u |^{p - 2} \nabla u) .

Our nonlinear model problem is now defined as

\begin{array}{r} - α Δ_{p} u - β Δ_{2} u = b & in Ω \\ u = 0 & on \partial Ω, \end{array}

where $α, β : Ω \to ℝ$ are coefficient functions. By variation, applying Green’s formula, and a discretization using the finite element space V^h we obtain

\int_{Ω} α | \nabla u |^{p - 2} \nabla u^{T} \nabla v + β \nabla u^{T} \nabla v d x = \int_{Ω} b v d x, \forall v \in V^{h} .

A restriction to a subdomain $Ω_{i}, i = 1, ..., N$ , and the corresponding finite element space W_i yields

\begin{array}{l} \int_{Ω_{i}} α | \nabla u_{i} |^{p - 2} \nabla u_{i}^{T} \nabla v_{i} + β \nabla u_{i}^{T} \nabla v_{i} d x \\ = & \int_{Ω_{i}} b v_{i} d x, \forall v_{i} \in W_{i} . \end{array}

Therefore, given a finite element basis ${ψ_{1}, \dots, ψ_{n}}$ of V^h , the operators of the global nonlinear problem (as described in eq. (6)) for the given model problem are defined by

\hat{K} (\hat{u}) : = (\int_{Ω} α | \nabla \hat{u} |^{p - 2} \nabla {\hat{u}}^{T} \nabla ψ_{1} + β \nabla {\hat{u}}^{T} \nabla ψ_{1} d x,..., {\int_{Ω} α | \nabla \hat{u} |^{p - 2} \nabla {\hat{u}}^{T} \nabla ψ_{n} + β \nabla {\hat{u}}^{T} \nabla ψ_{n} d x)}^{T}

and

\hat{f} : = {(\int_{Ω} b ψ_{1} d x,..., \int_{Ω} b ψ_{n} d x)}^{T} .

Restriction to a local finite element space W_i , given a finite element basis ${ϕ_{1}, \dots, ϕ_{N_{i}}} \subset {ψ_{1}, \dots, ψ_{n}}$ , yields

\begin{array}{l} K_{i} (u_{i}) : = (\int_{Ω_{i}} α | \nabla u_{i} |^{p - 2} \nabla u_{i}^{T} \nabla ϕ_{1} + β \nabla u_{i}^{T} \nabla ϕ_{1} d x,..., \\ {\int_{Ω_{i}} α | \nabla u_{i} |^{p - 2} \nabla u_{i}^{T} \nabla ϕ_{N_{i}} + β \nabla u_{i}^{T} \nabla ϕ_{N_{i}} d x)}^{T} \end{array}

and

f_{i} : = {(\int_{Ω_{i}} b ϕ_{1} d x,..., \int_{Ω_{i}} b ϕ_{N_{i}} d x)}^{T} .

For the entries of the tangential matrices $D K_{i} (u_{i})$ , we obtain

\begin{array}{l} {(D K_{i} (u_{i}))}_{j, k} : = \int_{Ω_{i}} α | \nabla u_{i} |^{p - 2} \nabla ϕ_{j}^{T} \nabla ϕ_{k} d x \\ + \int_{Ω_{i}} β \nabla ϕ_{j}^{T} \nabla ϕ_{k} d x \\ + (p - 2) \int_{Ω_{i}} α | \nabla u_{i} |^{p - 4} (\nabla u_{i}^{T} \nabla ϕ_{j}) (\nabla u_{i}^{T} \nabla ϕ_{k}) d x \end{array}

by a direct computation.

For the first configuration with a single nonlinearity, we consider a single inclusion of p-Laplace in a single subdomain and linear 2-Laplace in the remaining domain, i.e. $α = 1$ and $β = 0$ in the inclusion and $α = 0$ and $β = 1$ in the remaining domain. For the second configuration, where the nonlinearities are distributed equally, we have an equivalent inclusion in each subdomain. In the following, we refer to the first model problem as the problem with a single inclusion and to the second one as the problem with inclusions or many inclusions. For a detailed visualization of α and β in both cases, see Figure 6. Let us remark that we divide each subdomain into $800 \times 800$ smaller squares and each of these into two triangular finite elements with linear basis functions. The inclusions always measure $400 \times 400$ smaller squares, i.e. they fill a quarter of a subdomain. We always use $p = 4$ throughout this paper.

Figure 6.

Left: Model problem single inclusion with a single quadratic inclusion of p-Laplace in linear 2-Laplace; Right: Model problem many inclusions with quadratic inclusions of p-Laplace in each subdomain.

4. Reducing the energy consumption of nonlinear-FETI-DP

4.1. Related work and contribution

Reducing energy consumption of HPC applications has gained substantial attraction in the last decade. In the context of MPI several algorithms and runtime systems have been proposed (Hsu and Feng, 2005; Kappiah et al., 2005; Kerbyson et al., 2011; Lavrijsen et al., 2016; Li et al., 2010; Lim et al., 2011; Rountree et al., 2009; Vishnu et al., 2013), which typically follow the same idea: processes exhibiting a high slack time, i.e. time they are blocked with waiting for other processes or communicating, can be clocked down to save energy. This approach is used in Yang and Yang (2008) for an energy optimized barrier. Furthermore they propose a second optimization where the core waiting in the barrier is shutdown for a certain amount of time. However, they give no details in how this is established and how they obtain the energy measurements. A more fine grained approach is presented in Chen and Dong (2012), where certain algorithms for barrier synchronization are coupled with dynamic voltage and frequency scaling (DVFS) to save energy during the idle time.

Basically, we combine the idea of shutting down the core in the barrier with speeding up busy cores by accessing the power budget of the “barrier core” to scale up their frequencies. We propose a simple way to shut down cores entering barriers which are known to have sufficiently long wait times. The corresponding energy gains are measured for different nonlinear decomposition methods and the most important contributors to power consumption are investigated.

4.2. Energy reduction approach

In the following, we divide the cores into two classes. Cores, where the inner local Newton iteration, i.e. the evaluation of the nonlinear preconditioner M, converges fast, we call speeder, whereas the cores which must perform more inner Newton iterations we call laggers. All cores are synchronized with a barrier. Speeder cores arrive early and must wait for the laggers to arrive. The residence time of the speeders in the barrier is in the order of seconds for our nonlinear DD method and is visualized in Figure 7 for Newton-Krylov-FETI-DP (top row) and Nonlinear-FETI-DP-3 (bottom row). As expected, the load imbalance for Newton-Krylov-FETI-DP is insignificant for both model problems and there is no potential for a reduction of the energy consumption. In contrast, for Nonlinear-FETI-DP-3, the load imbalance is present for both model problems. Considering a single inclusion, there is, as expected, exactly one lagger; see Figure 7 (bottom right).

Figure 7.

Time (given in seconds) of each core/subdomain spent in an MPI_Barrier. Example with 20 cores and subdomains on a single node. Top left: Many inclusions problem; solved by Newton-Krylov-FETI-DP. Top right: Single inclusion problem; solved by Newton-Krylov-FETI-DP. Bottom left: Many inclusions problem; solved by Nonlinear-FETI-DP-3. Bottom right: Single inclusion problem; solved by Nonlinear-FETI-DP-3.

As already mentioned, dynamic load balancing is currently not feasible. However, we can exploit the load imbalance to reduce the energy consumption of Nonlinear-FETI-DP-3. Here the typical approach for MPI applications is to clock cores down when they are either not in the critical path, i.e. their execution time does not affect the total runtime, or while they remain inside a barrier. This involves a runtime system determining the critical path or programmatically adjusting a core’s frequency. Both options are not practical as in most compute cluster production environments adjusting frequencies dynamically for selected cores is not allowed by users. Alternatively, one may rely on the Linux operating system which can adjust a core’s frequency automatically depending on the load. To do so a CPU frequency governor must be enabled. This governor is load driven and gracefully adjusts the frequency accordingly. However, we found that cores executing a barrier remained at their highest frequency and were not throttled by the governor. Such behavior is caused by the MPI progress engine when polling the fabrics for new messages. This leads to high instruction issue rates (see barrier benchmark below) which prevents the governor to reduce clock speed. As the polling strategy decreases message latencies it is often the default setting of MPI libraries which should only be changed in non-standard execution modes, e.g. see discussion for Intel MPI library¹ and OpenMPI implementations.² Note, that changing this default behavior would impact all MPI communications in the full application, leading to potential performance degredations.

Instead of reducing a speeder’s core frequency, a more effective strategy is to leverage a core’s (deep) sleep states. The operating system (OS), e.g. uses these states to save energy when cores are idle. This behavior can implicitly be triggered with current Linux kernels and a corresponding core by calling specific functions like sleep, usleep, or nanosleep. Being in that functions allows the OS to put the core into a deep sleep as long as no other processes are scheduled on this core. These functions (we choose usleep) can be combined with a non-blocking MPI barrier (MPI_Ibarrier) as shown in Listing 1 to retain the functionality of the standard barrier (MPI_Barrier) and forcing the core into a sleep state. This method also works in situations where the clock frequency of the cores is fixed since DVFS and C-states are orthogonal concepts.

Listing 1.

Non-blocking barrier with test-and-sleep loop as replacement for MPI_Barrier.

The correct choice of the duration for the sleep (sleep_duration) is crucial for several reasons. If the MPI implementation does not rely on progress threads, progress only occurs when an MPI function is called. This is also the case for a barrier, when communication must be performed. With a large sleep duration this can delay the total execution time a core spends inside the barrier and increase the global cost of the barrier. On the other hand if the sleep duration is too small, deep sleep states are not entered. Entering a deep sleep state comes with a certain overhead which is increased the deeper a sleep state is. If the assumed overhead is larger than the sleep duration the desired sleep state is not used. For Nonlinear-FETI-DP-3, our chosen subdomain size, and our model problems we found that 10 ms as sleep duration are short enough to not extend the waiting time in the barrier and long enough to still enter the deepest sleep state. With shorter sleep times, e.g. 5 ms, the deepest sleep state was no longer used exclusively. Note that this behavior might depend on several factors like the application, the application’s communication pattern, the job size, the underlying MPI library, the OS, and the processor used.

An important side effect of sleeping cores is that laggers on the same chip have access to a higher fraction of the shared power budget of the chip. If DVFS is enabled they can increase clock speed and reduce their runtime. The more speeders are in deep sleep states, the higher laggers can be clocked and can potentially finish earlier.

4.3. Implementation remarks and testbed

All methods are implemented in PETSc (Balay et al., 2016a, 2016b) based on our nonlinear DD software (Klawonn et al., 2015) and using the same basic building blocks. Therefore, the timings and scaling results are comparable. We used PETSc version 3.6.4 and MKL-Pardiso from the MKL (Math Kernel Library) for all sparse direct solves. We refer to Klawonn et al. (2017) for a detailed description of our FETI-DP implementation.

We perform our experiments on the Linux-based Meggie cluster of the RRZE in Erlangen, Germany. All nodes are connected via Intel’s high speed 100 GBit Omni-Path network. One compute node consists of two Intel Xeon E5-2630 v4 processors with 10 cores each operating at a base frequency of 2.2 GHz. See Table 1 for further details on the compute node. Each processor has a thermal design power (TDP) of 85 W maximum (Intel Corp, 2016). However, the processor supports several features adjusting its power consumption which are briefly described in the following paragraphs.

Table 1.

Specifications of a Meggie compute node.

Processor name		Intel Xeon E5-2630 v4
Micro architecture		Broadwell
TDP	[W]	85
Frequency
Min	[GHz]	1.2
Base	[GHz]	2.2
Cores		10
ISA		AVX2
Sockets		2
L1 cache	[KiB]	32
L2 cache	[KiB]	256
L3 cache	[MiB]	25

The processor supports DVFS which allows for dynamically adjusting each cores frequency individually. A core’s frequency can be as low as 1.2 GHz.

Furthermore, Intel’s Turbo Boost is supported and enabled. This technique allows for dynamically overclocking a core as long as the chip stays inside its power envelope and does not exceed certain thermal constraints. The maximum clock speed depends on several factors like the temperature of the processor, the current workload, and the number of active cores.

Table 2 lists the maximum Turbo Boost frequencies depending on the number of active cores (Intel Corp, 2016). Note that these frequencies specify only an upper limit and might be lower because of the previously named reasons.

Table 2.

Maximum turbo frequency in GHz depending on the number of active cores for the Intel Xeon E5-2630 v4 processor according to Intel Corp (2016). Reported frequencies specify an upper limit and can be lower depending on e.g. current power consumption or temperature.

Active cores	1	2	3	4	5	6	7	8	9	10
Maximum Turbo frequency	3.1	3.1	2.9	2.8	2.7	2.6	2.5	2.4	2.4	2.4

A core executing AVX workloads, precisely executing 256 bit AVX instructions, gets its frequency dynamically reduced to the so called AVX frequency. If, for a certain amount of time, no AVX 256 bit instruction was encountered the core’s frequency is raised back again to the non-AVX frequency. For the processor model used, it seems that Intel does not provide detailed information about the AVX frequencies. Only Microway (2018) lists 1.8 GHz as the base AVX frequency and 3.1 GHz as the highest AVX Turbo Boost frequency.

For advanced energy saving, each core supports (deep) sleep states, also known as C-states. A state higher than the normal operating state C0 denotes a core as inactive and thus requires less power. The power consumption decreases with increase in the sleep state level. The processor used supports four states namely C1, C1E, C3, and C6, where the last one denotes the deepest one.

The cluster runs CentOS with Linux kernel 3.10-862. The CPU frequency governor which adapts the clock frequencies is in “conservative mode” on all Meggie nodes. The Intel C/C++ compiler “17.0 update 1” and the Intel MPI library version “2017 update 1” was used. We used the likwid tool suite (Treibig et al., 2010) to measure performance, power and clock speeds. To determine power and energy consumption likwid uses the Intel running average power limit (RAPL) interface which delivers data of high quality (Hackenberg et al., 2015). All energy numbers and power consumption measurements include the contributions of the processor chip(s) and the main memory. No other devices are monitored (e.g. power supply or networks).

At the time of the first installation of Meggie the true power consumption of the complete cluster was measured. A consumption of 210 kW was observed while performing the linpack benchmark, 65 kW when the cluster was idle, and still 7 kW when the cluster was in the power down state, i.e. with active management cards. Therefore, the energy saving potential due to algorithms and software is up to 145 kW for Meggie, i.e. approximately 70% of the power consumption is due to algorithm and software. The discussion on energy savings in this paper refers to this portion of the energy consumption. We do not include the power consumption of main memory as its maximum contribution to overall power was always below 10%.

4.4. Single node measurements

Measurements of Newton-Krylov-FETI-DP and Nonlinear-FETI-DP-3 with the standard MPI_Barrier “b” and its replacement by a non-blocking barrier with a test-and-sleep loop “b-ts” for the two model problems are shown in Table 3 for 20 processes on a single Meggie node. We do not separately report the power consumption of main memory as its maximum contribution to overall power was always below 10%. As expected, Nonlinear-FETI-DP-3 has a shorter runtime and a lower energy consumption than Newton-Krylov-FETI-DP for both problems. For the many inclusions scenario with a moderate load imbalance the two barrier implementations have nearly no impact on the runtime of Nonlinear-FETI-DP-3, but for “b-ts” the energy consumption is reduced by 7%. Let us also remark that the Nonlinear-FETI-DP-3 implementation with the standard barrier already reduces the consumed energy compared to Newton-Krylov-FETI-DP. This is a combined result of shorter runtime and lower power consumption (see Table 3). Though waiting in the standard barrier (MPI_Barrier), a speeder requires less power than a lagger doing computations (see also discussion in section 5 for power consumption analysis of these operations). The more pronounced load imbalance of the single inclusion case strongly enhances these effects: Energy savings of up to 53% and a runtime reduction of 33% compared with Newton-Krylov-FETI-DP are measured. Replacing the barrier by the test-and-sleep barrier reduces the runtime of Nonlinear-FETI-DP-3 by 8%.

Table 3.

Energy to solution, power consumption and runtime of Newton-Krylov-FETI-DP (NK) and Nonlinear-FETI-DP-3 (NL) for the many inclusions and the single inclusion problem on one Meggie compute node with standard MPI_Barrier “b” and the non-blocking barrier with test-and-sleep loop “b-ts”. The δ column shows the improvement of the test-and-sleep loop over the original barrier and the Δ column the improvement over classical Newton-Krylov-FETI-DP.

	Many inclusions					Single inclusion
	NK	NL, b	NL, b-ts	δ	Δ	NK	NL, b	NL, b-ts	δ	Δ
Energy [kJ]	57	44	41	7%	28%	47	29	22	24%	53%
Power [W]	136	126	120	5%	12%	136	114	94	18%	31%
Runtime [s]	419	348	340	2%	19%	347	251	232	8%	33%

For a more detailed analysis of Nonlinear-FETI-DP-3, we performed time-resolved measurements. While executing the code, we measure the average core frequency using likwid (Treibig et al., 2010) (averaging interval is 1 s) and we determine the fraction of time spent in the deepest sleep state C6 in each interval (using the cpu_idle driver). Figure 8 shows the time-resolved plot for both model problems on a full node of Meggie using 20 cores. Whenever no core is in C6 state (lower panels in Figure 8), all cores achieve the maximum turbo frequency of 2.4 GHz for fully loaded sockets (see upper panel in Figure 8). However, if speeders are entering sleep states they make room for overclocking of the laggers. If some cores’ frequencies go down to 1.2 GHz and their fraction of time spent in C6 state is close to 100%, the remaining cores get a boost in clock speed. A nice side effect of the measurements presented in Figure 8 is that one can read out the structure of a typical application of a nonlinear domain decomposition method: An inner iteration, where—one after another—the cores run into some synchronization point followed by a linear solve, where all cores participate with similar effort. For example, for the many inclusions case, four outer Newton iterations and linear solves are executed. It is also typical that, with convergence of the outer loop, the inner iterations need fewer steps, since the initial values get closer to the solutions.

Figure 8.

Time-resolved execution of the Nonlinear-FETI-DP-3 algorithm with the test-and-sleep barrier for the many inclusions (a) and the single inclusion problem (b) for each of the 20 cores of a Meggie (different colors) compute node. Top row: average frequency during measurement intervals of 1 s; Bottom row: fraction of the 1 s time interval spent in the deepest sleep state C6.

Depending on how many cores are already sleeping, i.e. how many cores have finished the inner loop, the processor core frequency is raised to around 3.0 GHz according to the graphs. Even in the case of only a single inclusion in Figure 8b, where only one lagger is present, it seems that the possible 3.1 GHz Turbo Boost frequency is not reached exactly. Several reasons may account for that slight deviation, e.g. the finite length of the averaging interval and the frequent wake ups of the speeders to check for barrier progress.

The measurements in Figure 8 indicate a frequency of 1.2 GHz for cores spending nearly their complete fraction of the 1 s averaging intervals in a deep sleep state. During that time cores are effectively halted and they reduce clock speed down to (almost) zero. The reported value is an artifact of the finite averaging interval being much longer than the sleep states (10 ms). In such scenario likwid reports the clock speed of the core when it wakes up for barrier testing which is done at the lowest possible frequency of 1.2 GHz.

4.5. Multi node measurements

For multi node measurements, we employ weak scaling by doubling the number of subdomains in each spatial dimension. Note that the problem with the single inclusion keeps only a single inclusion in a single subdomain. Figure 9 shows the total energy consumption and runtime of Newton-Krylov-FETI-DP and Nonlinear-FETI-DP-3 whereas Figure 10 shows the average power per core. As already observed for the single node measurements, Nonlinear-FETI-DP-3 outperforms Newton-Krylov-FETI-DP in terms of time to solution and total energy consumption. Additionally, the weak scaling behavior is superior, not only considering the runtime but also with respect to the power per core; see Figure 10. Of course, the weak scalability is not perfect and the time to solution is slightly increasing. This is a numerical effect and caused by slightly growing numbers of inner iterations. Also the size of the global coarse problem grows proportional to the number of subdomains, which is a well known bottleneck of the method; see also section 1.

Figure 9.

Total energy consumption (top panels) and runtime (bottom panels) of Newton-Krylov-FETI-DP and Nonlinear-FETI-DP-3 for both model problems—many inclusions (left) and single inclusion (right)—on the Meggie cluster. “NK” is the traditional Newton-Krylov-FETI-DP method where all cores are active at all times. “NL, b” is the Nonlinear-FETI-DP-3 method with standard MPI barriers. “NL, b-ts” is the new implementation making use of test-and-sleep barriers to save energy and increase the thermal budget of other cores of the same processor.

Figure 10.

Average power consumption per core of Newton-Krylov-FETI-DP and Nonlinear-FETI-DP-3 for two model problems—many inclusions (left) and single inclusion (right)—on the Meggie cluster.

Let us note that the power per core is even decreasing during the weak scaling study as with increasing core counts barrier time increases. This characteristic increase of barrier time in overall runtime emphasizes the need for power efficient barriers in large scale computations. Thus, in our study, the effect of the test-and-sleep barrier is more significant on 256 nodes than on a single node. Here, the energy per core is reduced by 23% for the many inclusions case and even 37% for the single inclusion example. The runtime for Nonlinear-FETI-DP-3 using the test-and-sleep barrier is also reduced by 6% for the single inclusion problem, since the governor has the opportunity to overclock the single lagger. Let us finally remark that Nonlinear-FETI-DP-3 using a test-and-sleep barrier can save up to 77% of energy on 256 nodes compared to Newton-Krylov-FETI-DP (see Figure 9).

4.6. Other solvers

To illustrate that the concepts introduced here can be carried over to other nonlinear domain decomposition methods besides Nonlinear-FETI-DP-3, we briefly present some single node measurements for ASPIN (Additive Schwarz Preconditioned Inexact Newton) (Cai and Keyes, 2002). We therefore use the ASPIN implementation available in PETSc (Balay et al., 2016a, 2016b) and the p-Laplace example provided by PETSc as ex15.c; see Brune et al. (2015, section 6.5) for details. Let us remark that in ex15 finite differences are used instead of finite elements. Written in our notation from section 3, we have $β = ε$ and $α = 1$ on the whole domain. In contrast to Brune et al. (2015), we use $p = 4$ and an overlap of 10 finite elements.

For a detailed description of ASPIN, we refer to Cai and Keyes (2002). Let us just remark that ASPIN is a nonlinearly left-preconditioned method and that the nonlinear preconditioner reduces to independent nonlinear Dirichlet problems on overlapping subdomains.

The ASPIN solver exhibits a load imbalance as shown in Figure 11 when evaluating the nonlinear preconditioner. In contrast to our FETI-DP implementation, where the time was spent in barriers, here the imbalance is visible through long waiting times in MPI_Waitany (red bars in Figure 11). We applied the same scheme as for FETI-DP, but instead of replacing the barrier by a test-and-sleep loop we did this for MPI_Waitany. Furthermore, we employ the PMPI interface to transparently replace calls to MPI_Waitany by the test-and-sleep loop.

Figure 11.

Trace of ASPIN obtained with Intel Trace Analyzer on one Meggie node. Blue bars denote time spend in application, whereas red bars denote time spend in MPI; here predominantly MPI_Waitany.

The energy consumption and runtime for the wait and test-and-sleep cases are reported in Table 4. As for Nonlinear-FETI-DP-3, test-and-sleep reduces the energy consumption by 14% and has no negative impact on the performance.

Table 4.

Energy consumption and runtime of ASPIN for PETSc example ex15.c on one Meggie compute node with MPI_Waitany “wa” and the test-and-sleep loop “wa-ts”.

		wa	wa-ts	Δ
Energy	[kJ]	16	14	−12.5%
Runtime	[s]	136	134	−1%

5. Analysis of basic power contributions

In a final step we make sense of the power measurements presented in Figure 10 by low level benchmarking and simplistic power modeling. We choose the single inclusion scenario in the limit of large processor counts to identify the relevant power contributions. For the Nonlinear-FETI-DP-3 method this is the extreme case where only one processor is computing throughout while the remaining ones execute the standard barrier or its test-and-sleep replacement most of the time. Thus, overall power consumption will be determined by executing the standard barrier or the baseline power of the processors being in sleep state. For the Newton-Krylov-FETI-DP method time to solution and power consumption will be mostly governed by computation and it can serve as a reference for the power contributions of computations of all methods considered in this paper. To separate these effects we first run several benchmarks on full sockets (10 cores) and report their power consumption and power variation across 500 sockets in Figure 12:

barrier: An artificial micro-application where 10 cores on the first socket are waiting inside an MPI_Barrier for the 1st core on the second socket, which is sleeping for several seconds. Power drawn by the first socket is measured only.

NK: A Newton-Krylov-FETI-DP solver used to solve the problem with many inclusions. This represents a typical instruction mix for an inexact Newton method using a DD approach as linear solver.

Figure 12.

Distribution of power consumption of processors and their associated memory from the Meggie cluster under different workloads: DGEMM, Newton-Krylov (NK), and barrier. Note the different scaling on the x axis.

In addition we provide measurements for dense matrix matrix multiplication (using DGEMM from Intel mkl) which is considered to be an upper limit for application power consumption. Note, it is not sufficient to measure a single chip as there can be a substantial power variation between different chips of the same processor type running at the same clock speed (Inadomi et al., 2015; Rountree et al., 2012). Accordingly, for all benchmarks we find substantial power fluctuations across the processor chips in our compute cluster (see Figure 12). The different power levels drawn by these three corner cases directly relate to their hardware utilization as confirmed by likwid measurements for typical hardware utilization metrics such as instruction throughput (IPC) and cache/memory bandwidths. The barrier benchmark (IPC ≈ 0.4 inst./cycle³; memory bandwidth < 100 MB/s) and DGEMM (IPC ≈ 3.3 inst./cycle; cache bandwidths > 50 GB/s) represent the extreme cases of hardware utilization and power consumption. The NK benchmark is in between drawing a memory bandwidth of approximately 31 GB/s and executing 1.5 instructions per cycle. Extracting the average power per core from Figure 12 and comparing with Figure 10 we find very good agreement between the NK benchmark (approx. 7 Watt/core) and the Newton-Krylov-FETI-DP solver as well as between the barrier benchmark (approx. 4.5–5 Watt/core) and the Nonlinear-FETI-DP-3 with a standard barrier from MPI. This is clear indication that power consumption of Nonlinear-FETI-DP-3 with prominent load imbalances can be substantially impacted by the MPI barrier.

In a final step we attempt to understand the power level of Nonlinear-FETI-DP-3 with our test/sleep barrier implementation (approx. 3.6 Watt/core; single inclusion in Figure 10). We expect that this barrier implementation should have marginal power contributions as the idle cores are in a deep sleep state. Here, the baseline power of the chip $P_{b a s e}$ should be the dominating factor. At constant clock speed, the total power consumption P_t of a processor chip with c active cores can be approximated by

P_{t} (c) = P_{base} + c P_{core},

where $P_{c o r e}$ is the power required to activate an additional core (see e.g. Hager et al., 2013). The unknown values of $P_{c o r e}$ and $P_{b a s e}$ can be determined by fitting eq. (31) to power measurements on single sockets when running the Newton-Krylov-FETI-DP solver with varying the numbers of cores from 1 to 10, i.e. c=1,…,10. Doing these experiments on 500 chips of our Meggie cluster for single and many inclusions benchmark we find $P_{b a s e} \approx 31$ Watt for a full socket (see Figure 13). This is in good agreement with the measured value of 3.6 Watt per core in Figure 10 for the single inclusion benchmark at 5 120 cores; see Figure 10, i.e. the $P_{b a s e}$ value estimated above is a lower limit for any application running on the system, i.e. 3.1 Watt per core is a lower limit in Figure 10. Executing a barrier (Newton-Krylov-FETI-DP) adds approximately 50% (120%) of dynamic power on top of that (cf. Figure 12).

Figure 13.

Fitted values of base $P_{base}$ and core $P_{core}$ power from single socket measurements on the Meggie cluster for the problem with many inclusions (left) and a single inclusion (right) solved with the Newton-Krylov-FETI-DP solver. All values are showing the median. We find $P_{b a s e} \approx$ 31 Watt for a full socket and $P_{c o r e} \approx$ 3.6 Watt per core.

Our results clearly substantiate the potential high impact of standard barrier implementations on power consumption of applications inhibiting load imbalances. As future architectures are expected to be more dynamic in terms of power consumption including lower baseline powers, the need for power efficient implementations of solvers and barriers will increase accordingly.

6. Conclusion

We have shown that nonlinear DD methods can reduce the energy consumption, compared to standard Newton-Krylov-DD approaches, first by a reduction of time to solution, and, additionally, by a better power efficiency.

For nonlinear DD methods the efficiency can be improved further by using a nonblocking barrier and actively setting cores to sleep mode.

For the example of Nonlinear-FETI-DP-3 and different model problems, energy savings up to 77% can be reached, as measured by the RAPL hardware counter, without affecting the runtime. The concepts introduced in this paper can be easily carried over to many nonlinear DD approaches, as, e.g. ASPIN or nonlinear BDDC, and can be combined with approaches to reduce the load imbalance.

Footnotes

Acknowledgements

The authors gratefully acknowledge the compute resources granted on Meggie and support provided by the Erlangen Regional Computing Center (RRZE).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the German Research Foundation (DFG) through the Priority Programme 1648 “Software for Exascale Computing” (SPPEXA) in the DFG projects 230723766 (EXASTEEL II) and 230898523 (ESSEX II).

ORCID iD

Axel Klawonn

Notes

References

Balay

Abhyankar

Adams

, et al. (2016a) PETSc Web page. Available at: http://www.mcs.anl.gov/petsc (acceessed May 2018).

Balay

Abhyankar

Adams

, et al. (2016b) PETSc users manual. Technical Report ANL-95/11, Revision 3.7, Argonne National Laboratory. Available at: http://www.mcs.anl.gov/petsc (acceessed May 2018).

Brune

Knepley

Smith

, et al. (2015) Composing scalable nonlinear algebraic solvers. SIAM Review 57(4): 535–565.

Cai

Keyes

(2002) Nonlinearly preconditioned inexact Newton algorithms. SIAM Journal on Scientific Computing 24(1): 183–200.

Cai

Keyes

Marcinkowski

(2002a) Non-linear additive Schwarz preconditioners and application in computational fluid dynamics. International Journal for Numerical Methods in Fluids 40(12): 1463–1470. LMS Workshop on Domain Decomposition Methods in Fluid Mechanics (London, 2001).

Cai

Keyes

Young

(2002b) A nonlinear additive Schwarz preconditioned inexact Newton method for shocked duct flows. In: Domain Decomposition Methods in Science and Engineering (Lyon, 2000), Theory and Engineering Applications of Computational Methods , Barcelona: International Centre for Numerical Methods in Engineering (CIMNE), pp. 345–352.

Chen

Dong

(2012) Energy optimization of representative barrier algorithms. Journal of Central South University 19(10): 2823–2831.

Dolean

Gander

Kheriji

, et al. (2016) Nonlinear preconditioning: How to use a nonlinear Schwarz method to precondition Newton’s method. SIAM Journal on Scientific Computing 38(6): A3357–A3380.

Farhat

Lesoinne

LeTallec

, et al. (2001) FETI-DP: a dual-primal unified FETI method—part I: a faster alternative to the two-level FETI method. International Journal for Numerical Methods in Engineering 50: 1523–1544.

10.

Farhat

Lesoinne

Pierson

(2000) A scalable dual-primal domain decomposition method. Numerical Linear Algebra with Applications 7: 687–714.

11.

Groß

(2009) A unifying theory for nonlinear additively and multiplicatively preconditioned globalization strategies: convergence results and examples from the field of nonlinear elastostatics and elastodynamics. PhD Thesis, Rheinische Friedrich-Wilhelms-Universität Bonn, Germany.

12.

Groß

Krause

(2011) On the globalization of ASPIN employing trust-region control strategies—convergence analysis and numerical examples. Technical Report 2011-03, Institute Computer Centre, Universita della Svizzera Italiana.

13.

Hackenberg

Schöne

Ilsche

, et al. (2015) An energy efficiency feature survey of the Intel Haswell processor. In: 2015 IEEE international parallel and distributed processing symposium workshop, Hyderabad, India, 25–29 May 2015, pp. 896–904.

14.

Hager

Treibig

Habich

, et al. (2013) Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience 28(2): 189–210.

15.

Hsu

Feng

(2005) A power-aware run-time system for high-performance computing. In: Proceedings of the 2005 ACM/IEEE conference on supercomputing, SC ’05. Washington, DC, USA: IEEE Computer Society. ISBN 1-59593-061-2. DOI: 10.1109/SC.2005.3.

16.

Huber

Gmeiner

Rüde

, et al. (2016) Resilience for massively parallel multigrid solvers. SIAM Journal on Scientific Computing 38(5): S217–S239.

17.

Inadomi

Patki

Inoue

, et al. (2015) Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing. In: Proceedings of the international conference for high performance computing, networking, storage and analysis, SC ’15. New York, NY, USA: ACM, pp. 78:1–78:12. ISBN 978-1-4503-3723-6. DOI: 10.1145/2807591.2807638.

18.

Intel Corp (2016) Intel Xeon processor E5-2600 v4 product family specification update. Version: December 2016.

19.

Kappiah

Freeh

Lowenthal

(2005) Just in time dynamic voltage scaling: exploiting inter-node slack to save energy in MPI programs. In: Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 conference, pp. 33–33. DOI: 10.1109/SC.2005.39.

20.

Kerbyson

Vishnu

Barker

(2011) Energy templates: exploiting application information to save energy. In: Proceedings of 2011 IEEE international conference on cluster computing CLUSTER 2011. DOI: 10.1109/CLUSTER.2011.33.

21.

Klawonn

Lanser

Rheinbach

(2014) Nonlinear FETI-DP and BDDC methods. SIAM Journal on Scientific Computing 36(2): A737–A765.

22.

Klawonn

Lanser

Rheinbach

(2015) Toward extremely scalable nonlinear domain decomposition methods for elliptic partial differential equations. SIAM Journal on Scientific Computing 37(6): C667–C696.

23.

Klawonn

Lanser

Rheinbach

(2016) A highly scalable implementation of inexact nonlinear FETI-DP without sparse direct solvers. In: Numerical mathematics and advanced applications—ENUMATH 2015, Lecture Notes in Computational Science and Engineering , vol. 112. Cham: Springer, pp. 255–264.

24.

Klawonn

Lanser

Rheinbach

(2018) Nonlinear BDDC methods with approximate solvers. Electronic Transactions on Numerical Analysis 49: 244–273.

25.

Klawonn

Lanser

Rheinbach

, et al. (2017) Nonlinear FETI-DP and BDDC methods: a unified framework and parallel results. SIAM Journal on Scientific Computing 39(6): C417–C451.

26.

Klawonn

Rheinbach

(2007) Robust FETI-DP methods for heterogeneous three dimensional elasticity problems. Computer Methods in Applied Mechanics and Engineering 196(8): 1400–1414.

27.

Klawonn

Rheinbach

(2010) Highly scalable parallel domain decomposition methods with an application to biomechanics. ZAMM Journal of Applied Mathematics and Mechanics 90(1): 5–32.

28.

Klawonn

Widlund

(2006) Dual-primal FETI methods for linear elasticity. Communications on Pure and Applied Mathematics 59: 1523–1572.

29.

Klawonn

Widlund

Dryja

(2002) Dual-primal FETI methods for three-dimensional elliptic problems with heterogeneous coefficients. SIAM Journal on Numerical Analysis 40(1): 159–179.

30.

Lavrijsen

Iancu

de Jong

, et al. (2016) Exploiting variability for energy optimization of parallel programs. In: Proceedings of the 11th European conference on computer systems, EuroSys ’16. New York, NY, USA: ACM, pp. 9:1–9:16. ISBN 978-1-4503-4240-7.

31.

de Supinski

Schulz

, et al. (2010) Hybrid MPI/OpenMP power-aware computing. In: 2010 IEEE international symposium on parallel distributed processing (IPDPS), pp. 1–12. DOI: 10.1109/IPDPS.2010.5470463.

32.

Lim

Freeh

Lowenthal

(2011) Adaptive, transparent CPU scaling algorithms leveraging inter-node MPI communication regions. Parallel Computing 37(10): 667–683.

33.

Liu

Keyes

(2015) Field-split preconditioned inexact newton algorithms. SIAM Journal on Scientific Computing 37(3): A1388–A1409.

34.

Liu

Keyes

Krause

(2018) A note on adaptive nonlinear preconditioning techniques. SIAM Journal on Scientific Computing 40(2): A1171–A1186.

35.

Marcinkowski

Cai

(2005) Parallel performance of some two-level ASPIN algorithms. In: Domain decomposition methods in science and engineering, Lecture Notes in Computational Science and Engineering , vol. 40. Berlin: Springer, pp. 639–646.

36.

Microway (2018) Detailed specifications of the Intel Xeon E5-2600v4 “Broadwell-EP” processors. Available at: https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-intel-xeon-e5-2600v4-broadwell-ep-processors/ (accessed 6 May 2018).

37.

Negrello

Gosselet

Rey

(2018) Nonlinearly preconditioned FETI solver for substructured formulations of nonlinear problems. Available at: https://hal.archives-ouvertes.fr/hal-01774589. Working paper or preprint (accessed May 2018).

38.

Rountree

Ahn

de Supinski

, et al. (2012) Beyond DVFS: a first look at performance under a hardware-enforced power bound. In: 2012 IEEE 26th international parallel and distributed processing symposium workshops and PhD forum, pp. 947–953. DOI: 10.1109/IPDPSW.2012.116.

39.

Rountree

Lownenthal

de Supinski

, et al. (2009) Adagio: making DVS practical for complex HPC applications. In: Proceedings of the 23rd international conference on supercomputing, ICS ’09. New York, NY, USA: ACM. ISBN 978-1-60558-498-0, pp. 460–469. DOI:10.1145/1542275.1542340.

40.

Treibig

Hager

Wellein

(2010) Likwid: a lightweight performance-oriented tool suite for x86 multicore environments. In: Proceedings of PSTI2010, the first international workshop on parallel software tools and tool infrastructures, San Diego, CA, 13–16 September 2010.

41.

Vishnu

Song

Marquez

, et al. (2013) Designing energy efficient communication runtime systems: a view from PGAS models. The Journal of Supercomputing 63(3): 691–709.

42.

Yang

(2008) Exploiting energy saving opportunity of barrier operation in MPI programs. In: Proceedings of the 2008 second Asia international conference on modelling simulation (AMS), pp. 144–149. DOI: 10.1109/AMS.2008.80.