Combining lattice Boltzmann and discrete element methods on a graphics processor

Abstract

The current generation of graphics cards allows great flexibility in programming. With the introduction of general purpose programming languages for graphics cards, many fields of scientific application will benefit greatly from adapting to this new programming model. Due to the differences in the memory and execution models, not all algorithms can be applied. However, the lattice Boltzmann method can be used to great effect. It allows the simulation of fluids using basic arithmetic operations with a linear complexity, as will be demonstrated. Additionally, the discrete element method can also be adapted to the new model. After outlining the methods themselves and the integration of these two methods into a single simulation, this article will show a way to implement it on graphics cards using the CUDA platform.

Keywords

GPGPU LBM DEM CUDA lattice Boltzmann

1 Introduction

In the pharmaceutical industry, medicine is often administered in the form of pills. To produce these pills, a combination of agents has to be mixed in a certain predefined ratio. The mixing process must ensure that these ratios are the same in every single pill, avoiding problems like separating the heavy agents into one half, and the light agents in the other half of the produced lot.

In order to accomplish this, mixing equipment has to be specifically designed to avoid these issues. Currently, this process is done by using existing experience and doing expensive real-world tests. Further applications in this area are outlined by Radeke et al. (2010).

Mixing machinery designed for this application uses rotating blades and air injectors to mix the agents. Powder can be simulated as particles (simplified to spheres in this article). Usually, millions of particles will have to be used to properly simulate the agents, with different classes of particles (differing in size and mass). The injected air lends itself to a fluid simulation, and the rigid parts can be thought of as boundary conditions for both the particles and the fluid.

Current computational simulation research in this area is thwarted by the computational performance currently available. However, the current generation of graphics cards has the potential to overcome this limitation. This means it can be reasonably expected that three-dimensional computational fluid dynamics simulations can be used in a real-time (meaning in this context that the simulated duration is equal to or larger than the computational duration needed) or near-real-time simulation system. This paper demonstrates such a system.

As the Compute Unified Device Architecture (CUDA) technology was invented by Nvidia to simplify the development of these graphics processor-based general computations (GPGPU), it is used in the implementation explained this article. Another option for achieving this goal is the OpenCL standard, which is supported across multiple vendors, but otherwise equivalent. All findings of this article also apply to implementations based on that standard.

As explained in the CUDA programmer’s documentation by Nvidia, the major performance bottleneck in graphics card programming is memory access. Thus, a special emphasis in this article is the memory layout. An even larger issue is transferring data from the computer’s main memory to the graphics processor’s memory, or vice versa. Since the visualization can be done on the graphics processor, no memory copies are necessary for this task. On the other hand, a fluid simulation requires input from a rigid- (or soft-) body physics simulation in order to enable a full dynamic simulation that does not require artificial domain manipulation like creating or removing fluid particles at a specific location. Traditional physics engines like Intel Havok, Nvidia PhysX, ODE and bullet are either fully or partially central-processor-based and thus an integration would require copying the rigid- or soft-body information between the two memory areas (Nvidia PhysX uses the graphics hardware for some of the calculations, but the interface to the application remains on the CPU only, and thus here even two-way copies are necessary). There has been some research on implementing a complete rigid-body simulation (Harada, 2007), so another focus of this article is how to integrate the results of these simulations with the fluid calculations.

The following section delivers a brief overview over the hardware interface of CUDA in order to establish the mindset of developing such a platform. Section 3 introduces the discrete element method for simulating rigid particles, Section 4 explains the fluid simulation and Section 5 expands on this by introducing solid macroscopic obstacles. Section 6 outlines the program flow used in our implementation and Section 7 evaluates it. Section 8 concludes this paper.

2 CUDA

CUDA is an acronym for ‘Compute Unified Device Architecture’, although it is never referred to by its full name. It was invented by Nvidia for their graphics cards, because they saw a large market in applications in general computing that could be accomplished on their device architecture in a much more efficient way than on traditional central-processing units. Before its existence, researchers had to map their problems to graphics-processing tasks like pixel shaders and blend operations.

CUDA allows the implementation of an algorithm in a language very similar to C++, even having source-level integration (interleaving device- and host-functions, using the same data structures and even compiling a function for both at the same time). However, since the graphics card has differing calling semantics, the language had to be adapted. The differences in architecture and the consequences of these differences will be outlined in this section.

2.1 Processor architecture

Unlike traditional processor architectures, graphics cards are designed for parallelization from the ground up.

On a single card, many multiprocessors operate in concert. Traditionally, a single-instruction multiple-data (SIMD) architecture is used to allow commands like vector addition to operate on multiple elements at the same time. The multiprocessors here use a different concept, called single-instruction multiple-threads (SIMT). In this architecture, multiple threads running the same code with partially differing input values are batched together (into a block) and executed. When an arithmetic instruction or a small memory-fetch operation on consecutive memory addresses is to be executed by multiple threads at the same time, these operations are done in parallel (or using a larger memory-fetch operation in the second case). Otherwise, the threads are serialized. Unlike SIMD, this concept allows an arbitrary number of parallel operations to be executed by the hardware, without any adaptations in the program code. Furthermore, it is easier to program, since the programmer only has to handle a single processing unit in a regular linear fashion, parallelization is done automatically where possible.

However, a side-effect is that optimization is harder to do than in SIMD architectures, since optimized operations are not directly visible in the source code. Whenever two threads diverge by going into different code-paths due to branching, the operations cannot be parallelized and thus, performance suffers. Thus, the control flow of these programs should be as linear as possible. The lattice Boltzmann method has this property, as will be demonstrated in Section 4.

3 Discrete element method

The discrete element method (DEM) is an algorithm for calculating the behavior of a large number of particles (usually in the range of 100 000 to millions) together in confined space. It accounts for the velocity and rotation of these particles and aims to simulate the behavior of fluid-like masses like sand, powders and cereals. It is also possible to integrate linear, rolling and radial friction into the collision-handling algorithm.

Note that these particles behave differently to a fluid. For example, the forces between sand particles flowing through a funnel cause the effect used in an hourglass, which is that the flow is constant over time, irrespective of the amount of sand pushing onto the point of constriction. This is not the case for fluids.

It was introduced by Cundall and Strack (1979) for two dimensions and later extended to three. The particles are simplified to spheres, although the model can be extended to other shapes.

The basic idea behind DEM is to use discrete time-steps in the simulation, and assume that the velocity and acceleration between those steps is constant. When choosing a small enough time-step, this assumption causes a negligible error in the results. When two particles collide in the real world, they deform, exerting a force in the direction of the contact. This deformation is simulated by letting the particles overlap. When the collision between two particles can be adequately simulated (meaning that it is within the error tolerances), a domain with many concurrent collisions can be simulated by handling these collisions sequentially (essentially having an atomic time-step, where the collisions happen at once but independently in the simulation). For the mathematical details on how to calculate the forces resulting from a collision, the reader is referred to Radeke (2006).

3.1 The contact list

The tangential overlap $δ_{t}$ between two colliding particles is affected by previous contacts, which account for forces that have built up previously (like a spring that has expanded over time). This is expressed as:

δ_{t} (t) = \int_{t_{0}}^{t} {\dot{δ}}_{t} (x) d x .

Since the simulation runs in discrete time-steps, the integral has to be replaced by a sum:

δ_{t} (t) = \sum_{x = t_{0}}^{t} {\dot{δ}}_{t} (x) .

Since it is infeasible to record ${\dot{δ}}_{t}$ for every time-step and contact, a separate contact list is kept where the sum of all ${\dot{δ}}_{t}$ is kept for every contact, and zeroed when the contact ceases to exist. Since the number of contacts at a single point in time is variable and the data structures for it cannot be allocated dynamically using CUDA, a theoretical maximum number of contacts has to be determined and space allocated for it. When all particles are equally sized, only 12 contact pairs per particle are possible. When this is not the case, the worst-case scenario has to be calculated depending on the particle parameters first.

This contact list requires a significant amount of space in global memory on the graphics card, which is already used for other aspects of the simulation, like the particles themselves and the fluid simulation. However, we assume that most contacts only last for a single time-step (as is the case in a sparse particle field with few collisions), ${\dot{δ}}_{t} (x)$ for $x < t$ is considered zero. This introduces a small error into the simulation in exchange for increased performance and reduced memory requirements. This error is reduced when combining the DEM with a fluid simulation, since the particles are not compressed into a tight package, even when gravity is enabled.

3.2 Locating contacts

Locating the contacts in a single simulation step is also non-trivial and thus has to be considered separately.

In order to verify whether two particles with indicies $i$ and $j$ collide, it is necessary to compare the euler distance to the sum of the radii. When the distance $δ_{i j}$ is zero, the two particles touch each other at a single point. When it is larger than zero, they intersect, and otherwise no collision takes place.

Using this approach, the naïve solution is to check every possible particle pair using this equation. However, this algorithm has a complexity of $O (n^{2})$ and thus there is room for improvement for the task at hand.

A more efficient approach is to insert the particles into a spatial data structure to eliminate as many potential contact candidates as possible. Since global memory access on GPUs is expensive, the number of read operations has to be kept to a minimum to allow faster operation. The simplest data structure where accessing a location is $O (1)$ is a regular grid. When the grid’s cell edge size is at least the simulation’s smallest particle’s diameter, colliding particles can only have their center in the same cell as the particle itself or in the surrounding cells. The maximum number of particles that can exist in a single cell can be calculated based on the particles’ diameters. Thus, a worst-case size of the grid can be determined and statically allocated.

Inserting the particles into the data structure has to be done on every simulation step, since particles might move to different cells in a step. Determining the cell the particle belongs to can be done by rescaling the domain to the grid using floating-point arithmetics and then using a rounding operation. Since this operation is efficient, no inter-step spatial correlation effects have to be used to optimize it.

However, since the GPU operates on the data in parallel, the insertion itself is non-trivial. Every cell contains a certain number of ‘slots’ that have to be filled. When one thread is writing into the next free slot, no other thread must access this location. Since there are no global locking mechanisms, this cannot be implemented in that way.

CUDA-capable graphics cards of the second generation and later support atomic operations on global memory. These allow using a usage counter for each field, doing an atomic increment on it. The thread first increments the counter, then writes the particle to the position the counter previously pointed to.

However, NVidia implemented this approach for the CUDA example code called ‘particle’ and came to the conclusion that an alternative algorithm without atomic operations is more efficient. It uses a standard parallelized sorting algorithm provided by the CUDPP library (known as radix sort) instead.

First, for every particle, the cell it shall belong to is calculated (by using the previously mentioned method, and then converting that location to an integer cell index). Then, the particles are sorted by that cell index. For the third step, a separate list is required with the size of one integer times the number of cells, called the cell array. First, this list is initialized to a value denoting that the cell is empty. Then, the sorted particle list is iterated in parallel. In this list, particles belonging to the same cell are next to each other (due to the sorting). Whenever a boundary between two cells is detected (by comparing the current cell index and the cell index of the next particle), the index in the sorted particle array is stored to the cell array at the cell’s location. This way, looking up the particles in a cell requires a lookup in that cell array, and then this index can be used to get the first particle from the sorted particles list, and all subsequent particles, until the cell index does not match any more.

4 Lattice Boltzmann method

The lattice Boltzmann method (LBM) is a numerical algorithm used to solve the Navier–Stokes (NS) equations. It has been demonstrated that it is possible to approximate them directly on a graphics processor using numerical integration schemes (Stam 1999), but in this section it will be demonstrated that a direct mapping of the LBM allows a very efficient and easily comprehensible (and thus, debuggable) implementation, providing a stable basis for a simulation solution.

4.1 The concept behind lattice Boltzmann

The lattice Boltzmann method directly simulates the microscopic fluid particles¹ underlying the macroscopic differential equations of NS. As such, it does not require a solver, but instead, basic iterative mathematical operations.

The LBM can be calculated in both two and three dimensions. The two-dimensional approach is preferable for simulating shallow water in real-time due to the significant reduction of computing and memory requirements. However, since the goal of this article is a physically accurate reproduction of real fluids, the three-dimensional approach is highlighted here.

The domain to be simulated has to be spatially separated into tightly packed cells. Many cell molds are possible, like the dodecahedron (Tolke and Krafczyk, 2008), but the most basic one is the cube, which has been extensively researched, is easily visualizable and will be the focus of this article. Specifically, the D3Q19 geometry is used, since it is the most-researched variant and is a good balance between performance and accuracy.

For an explanation of the equations used in our implementation of the LBM, the reader is referred to Li (2004).

4.2 Initial state

When starting the simulation, the per-cell particle distribution $f_{i}$ has to be initialized. Given an initial pressure $ρ_{0}$ with the velocity $u_{0} = (0, 0, 0)$ , the equilibrium distribution can be used:

f_{i} = ω_{i} ρ_{0} .

When no further forces are applied, the simulation will stay in this stable configuration.

4.3 Boundaries

Since the simulation domain has to be finite, the boundaries have to be handled. This is problematic on the GPU, since boundaries need special treatment for just the cells at the edge, and branches should be avoided. One way to handle this is to not simulate the cells at the edge, and just set them to constant values. However, this poses issues with byte-alignment on fetching the data (since threads would be off-alignment by one cell with the grid). Our implementation uses the required branches for bounceback boundaries. Since this branch is only executed for a few cells compared to the number of total cells in the whole domain, the thread divergence is not significant and tends towards zero for increasing domain sizes.

4.4 Gravity and external forces

Buick and Greated (2000) outline several methods of varying complexity for integrating external forces like gravity, but the most accurate extends the Bhatnagar-Gross-Krook (BGK) collision operator by another factor:

Ω_{i} = - \frac{1}{τ} (f_{i} (x, t) - f_{i}^{e q} (x, t)) + \frac{2 τ - 1}{2 τ} \frac{3}{ω_{i}} F \cdot e_{i}

where

F

is the force to be applied. Note that gravity itself is usually not expressed as a force, but as an acceleration (which is actually a simplification due to the mass difference between the Earth and the particle the force acts between). Using Newton’s law

F = m a

, a force can be derived.

This external force representation can also be used for interaction with solid macroparticles, which will be explained in Section 6.1.

5 Complex and moving obstacles

In order to incorporate non-cubic containers and other solids like rotors, complex obstacles have to be supported in the fluid simulator. In the work by Monitzer (2008), multiple approaches like the popular one by Mei et al. (2000) were demonstrated, but the approach by Noble and Torczynski (1998) has unique properties that lend themselves well to a GPU-based implementation. Firstly, the modification to the collision step is branch-free, so the performance impact is neglible (at least for the fluid calculation itself, more on that later in this section), and secondly, since a single voxel-based data structure is required for storing the information about the solids, it can easily be stored along with the fluid cells.

For this algorithm, a weighting function $B$ is defined as in Strack and Cook (2007)

B (ϵ, τ) = \frac{ϵ (τ - \frac{1}{2})}{(1 - ϵ) + (τ - \frac{1}{2})}

where

ϵ

defines the amount of solid material in a single cell, ranging from

0

(all fluid) to

1

(all solid).

The collision step is modified to include a second collision operator² $Ω_{i}^{s}$ :

f_{i}^{n e w} (x, t) - f_{i} (x, t) = (1 - B) Ω_{i} + B Ω_{i}^{s} .

The concept here is that when a solid is in this fluid cell, some fluid particles are unaffected by this solid (as they do not collide with the object), and some are reflected using the no-slip bounceback method. The mixture depends on the amount of solid material in this cell (this was originally designed for porous materials like sand).

Holdych (2003) provides a stable equation (Strack and Cook, 2007) for the bounceback collision operator, which can be simplified to:

Ω_{i}^{s} = f_{- i} (x, t) - f_{i} (x, t) + 6 ω_{i} ρ (e_{i} \cdot u_{s}) .

As explained by Monitzer (2008, 2010), a disadvantage of this approach is that fully solid cells are still handled as fluid cells. This means that the collision and streaming steps are still applied, even when the collision result is multiplied by $0$ and the streaming step just bounces the result between two neighboring nodes. Analytically this does not cause problems, but incoming fluid molecules that cannot escape the solid cause the density in border cells to rise. When the float parameter overflows, the GPU treats the value as infinity, which is not rectified by multiplying by $0$ (which results in not-a-number). Additionally, this can cause problems when a solid moves away from a fluid cell holding high density values. By capping the density value, this can be avoided.

Holdych (2003) proposed a solution to this problem for every cell whose $ϵ > 0.95$ which applies the assumption that the cell density difference between two neighboring cells is negligible.

A fluid fraction $\overset{ˉ}{ϵ}$ is defined as

\overset{ˉ}{ϵ} = \sum_{i} (1 - ϵ_{i})

where $ϵ_{i}$ is the solid fraction $ϵ$ of the neighboring cell in direction $e_{i}$ .

The fluid density for these cells is then calculated depending on $\overset{ˉ}{ϵ}$ :

ρ = \{\begin{matrix} \frac{1}{\overline{ϵ}} \sum_{i} (1 - ϵ_{i}) ρ_{i} & i f \overset{ˉ}{ϵ} > 0.01 \\ 0 & o t h e r w i s e \end{matrix}

where

ρ_{i}

is the density value of the cell in direction

e_{i}

5.1 Improving the voxelization process

Generating the voxel-based data structure for $ϵ$ requires extra attention. In the implementation shown by Monitzer (2008), the triangle-based depth-slicing voxelization process outlined by Crane et al. (2007) proves to be the major performance limitation. A different approach has to be considered to limit the performance impact.

One approach is to use parametrical representations of the objects immersed in the fluid, and doing an inside/outside determination analytically. However, since the algorithm by Noble and Torczynski (1998) requires knowledge about the volume occupied by the solid in only a single cube, this is non-trivial. One example of this is presented in Section 6.1.1.

A more versatile approach was presented by Eisemann and Décoret (2008). It uses a single-render pass using the regular graphics pipeline with custom shaders to generate a voxelization of the objects, including their interior.

In order to determine not only a binary inside/outside classification for every cell but a density value, Eisemann and Décoret (2008) proposed increasing the voxel density two-fold, and then looking at $2 \times 2$ blocks of voxels, summing them up and dividing them by eight. This way, eight density steps can be determined. Increasing the resolution even further allows more density steps. In our implementations this has been proven to provide an adequate solution.

Note that thin objects like sheets of paper might cause fluid to trickle through when using sub-voxel precision, since the end result would be $50 %$ solid in those cells (when the solid is an axis-aligned plane), and thus this kind of object should either be avoided or should receive special treatment.

6 Implementation on the GPU using a new data structure layout

NVidia provides sample code for DEM called ‘particle’ as part of the CUDA SDK, which has been optimized for the GPU and so is used as a basis for the implementation described here. The program has been improved by adding the calculation of rotational velocity for the particles, as outlined by Radeke (2006).

When implementing the LBM on a GPU, the first focus should be to look at the data structures used. The following information is known per cell:

The velocity u as a vector

The pressure $ρ$

The fluid distribution $f_{0} \dots f_{18}$

The solid fraction

All except the velocity can be represented by single floating numbers. In traditional scientific simulations, double precision is used, but this is not supported by the graphics architecture available to the authors (the new Fermi architecture does support it). However, Tolke and Krafczyk (2008) explain how to avoid the downsides of single precision, which has the side-effect of allowing simulation domains that are double the size (since the data structures occupy half the memory) and improving performance, since only half the amount of data has to be transferred for every element.

Since the old data is accessed in parallel while the new simulation data is being written, all data structures have to be duplicated, one collection storing the information of the last step and one storing the information of the new step. After finishing the calculation, the two collections are swapped. This technique is known as flip-flopping.

However, $u$ and $ρ$ are derived values based on $f_{0} \dots f_{18}$ , which means that they do not have to be stored in permanent memory. For the equations introduced in Section 5 this is not true, but flip-flopping is not required, since they are not read in the same kernel as they are written.

This technique allows handling the whole operation in GPU memory, so no device-to-host or host-to-device copies are necessary, except for recording the results in permanent storage when necessary (e.g. for replaying the simulation afterwards).

Monitzer (2008, 2010) uses the data structures introduced by Li (2004), which split the distribution values into blocks of four into textures containing red, green, blue and alpha channels (RGBA) in a way to allow implementing bounceback accessing only a single texture. However, since we are using CUDA, the order of the $f$ -values is irrelevant, but other considerations are required, which will be outlined later in this section.

The streamlined algorithm introduced in this article can be broken down into the following pieces:

for every (x,y,z) in the grid do

// load values into local registers

for i = 0..18 do

f_i = f[grid2index(x,y,z,i)];

// calculate the cell’s macroproperties

rho = sum(f_i);

velocity = sum(f_i/e_i) / rho; // note that e_i is a 3-component vector!

// calculate the force transferred from the particles to the fluid

particleF = (0, 0, 0);

startIndex = index of the first particle in the cell x,y,z in the sorted DEM array

if(cell at startIndex not empty) {

endIndex = index of the last particle in the cell x,y,z in the sorted DEM array

for every particle between startIndex and endIndex do

particleF += (velocity - particleVelocity) * particleRadius^2;

}

particleF *= 100 * -ParticleFluidCoupling * rho * pi * 6/pi;

// the value 100 above is a experimental constant described in the text below

extF = particleF + gravity; // the final external force

// collide

for i=0..18 do

f_i -= bgk(f_i, velocity, rho, e_i, extF, dt);

// stream with bounceback boundary and output to global memory

f[grid2index(x,y,z,0)] = max(0, f_0);

if(x < dimx - 1)

f[grid2index(x+1,y, z, 1)] = max(0, f_1);

else

f[grid2index(x, y, z, 2)] = max(0, f_1);

if(x > 0)

f[grid2index(x-1,y, z, 2)] = max(0, f_2);

else

f[grid2index(x, y, z, 1)] = max(0, f_2);

if(y < dimy-1)

f[grid2index(x, y+1,z, 3)] = max(0, f_3);

else

f[grid2index(x, y, z, 4)] = max(0, f_3);

if(y > 0)

f[grid2index(x, y-1,z, 4)] = max(0, f_4);

else

f[grid2index(x, y, z, 3)] = max(0, f_4);

// etc. for all i=0..18

function bgk(f, velocity, rho, e, extF, dt)

return dt / tau * (f - feq(velocity, rho, e)) - (1 - 1 / (2*tau)) * 3/omega(e) * <extF, e>;

function feq(velocity, rho, e)

return omega(e) * rho * (1 - 3/2 * <u,u> + 3*<e,u> + 9/2 * <e,u>^2; where $⟨ a, b ⟩$ is a standard inner vector product. f_i, rho and velocity are local variables, to be stored in registers by the compiler. dimx/dimy/dimz are the dimensions of the grid. omega(i) is the constant $ω_{i}$ as defined in the LBM.

In the particle–fluid interaction described above, it is assumed that every particle is exactly in a single cell. In reality, this is not always the case. Thus, a constant 100 (which depends on the grid size compared to the particle size and the time-step size) is introduced, which is used to counteract this discrepancy. A more accurate simulation would have to replace this constant by more accurate calculations, as described in Section 6.1.1.

Note that the outer loop is the basic kernel invocation in CUDA and not an explicit for-loop. Since the streaming and the collision step are integrated into a single kernel, all intermediate values can be kept in processor registers, and so only minimal access to memory is required.

However, GPUs do not employ any caching when reading from or writing to global memory, and so these operations still come with a significant speed penalty. The above algorithm requires a scattered read of 19 float values, which can impact performance in a significant way. As an optimization, when using n arrays each holding one $f_{i}$ with $0 \leq i < n$ per cell, these read operations can be coalesced automatically by CUDA (since the memory addresses are all adjacent), resulting in one 128 bit fetch for every four 32 bit fetches.³ Further optimization is not possible, since all values have to be loaded from global memory in some way.

The final memory layout used in the implementation is shown in Figure 1. It uses a single allocation block to remove unnecessary memory allocation overheads and can easily be addressed using the following function (suitable for both the host and device in CUDA):

Figure 1.

The final memory layout used for the implementation presented in this article. The structure is a single linear array of floating point values. Split into blocks here to improve readability. The dashed lines indicate where elements were left out of the visualization.

unsigned grid2index(unsigned x, unsigned y, unsigned z, unsigned i) {

return x + ((y + z * dy) * dx) + (dx * dy * dz * i);

}

The constants dx, dy and dz indicate the grid size in the corresponding dimension. They were stored in constant memory, since they do not change during the lifetime of the application.

Note that two of these arrays are necessary for the flip-flopping technique mentioned earlier in this section.

6.1 Rigid object handling variations

Handling solid objects was explained in Section 5. However, the macroscopic particles used by the DEM primarily require the fluid to apply a force to them, which is not implemented by the solid collision operator. The force could be calculated by taking into account how many fluid particles were bounced back via the $Ω^{s}$ operator, however, the form of the particles allows us to take them into account from a macroscopic perspective.

The force of aerodynamic drag is classically defined as

F = \frac{1}{2} ρ v^{2} A C_{d}

where

v

is the relative velocity of the particle compared to the fluid,

A

is the area of the particle (

4 π r^{2}

for a sphere) and

C_{d}

is the drag coefficient (which depends on the form of the object).

According to Newton, for every force applied there is an equal force applied in the opposite direction. Thus, the interaction between the particles and the fluid can be calculated by taking the force calculated by Equation (9), inverting it and applying it as an external force in Equation (4).

This approach has certain limitations; it only works in this way when every particle interacts only with no more than one fluid cell (the reverse does not have to be the case). However, depending on the lattice size relative to the particle size, it is more likely that a particle overlaps with multiple fluid cells. In this case, $A$ and $C_{d}$ from Equation (9) would have to be adapted. As such, a different approach as reported by Monitzer (2008, 2010) could lead to a simplified implementation.

In experiments it was possible to use this drag approach to create a working simulation. However, since every particle only affects a single fluid cell, the force applied to the fluid dissipates too quickly. A workaround is to multiply the force applied by the particle on the fluid by a factor ( $100$ was used in experiments, but note that this depends on the ratio between the particle radius and cell size). This leads to a visibly believable simulation, but has no correlation with real-world effects.

6.1.1 Approaching an exact calculation of subcell solid fills

Ultimately, the approach of Noble and Torczynski (1998) is needed for larger solid objects.

The challenge in using this approach is finding the $ϵ$ value of a cell. For spherical particles, the volume of the intersection of the particle sphere and the fluid cell has to be found. Han et al. (2007) touch the subject and mention determining the ratio in an implementation for two dimensions. Our approach for three dimensions here is to treat the fluid cell as a volume bounded by six planes, which potentially intersect the sphere.

To calculate the volume of this intersected sphere, it is not possible to use the traditional equations $A = r^{2} π$ for the area of a disk and $V = \frac{4}{3} π r^{3}$ for the volume of a sphere, but it is necessary to move one step back and see from which equations these were derived. The area of a disk with radius $r$ is

A (r) = \int_{- r}^{r} 2 \sqrt{r^{2} - x^{2}} d x

based on the Pythagorean theorem that the quadratic hight of the disk at distance

x

from the center is

2 \sqrt{r^{2} - x^{2}}

When bounding the integral not to the range $[- r, r]$ but to a range limited by the distance of two of the six planes from the center of the sphere, the area of the disk limited by these two planes can be derived.

Since the six planes are orthogonal to each other this is a sufficient first step. In the axis orthogonal to the primary axis on the disk’s plane, two more planes (whose distances to the sphere’s origin are denoted by $p_{1}$ and $p_{2}$ ) might intersect. These can be taken into account by

A (r) = \int_{a}^{b} max (p_{1}, min (p_{2}, 2 \sqrt{r^{2} - x^{2}})) d x

where

a

and

b

are the values determined as described in the previous paragraph.

Now the area of the disk (as slice of the sphere) is known, and four of the six planes are taken into account. The general equation for the volume of a sphere is

V (r) = \int_{- r}^{r} A (\sqrt{r^{2} - x^{2}}) d x

where

A

is the area of a disk as given in Equation (11). This integral can also be bounded not by

[- r, r]

but the final two intersecting planes. When assuming the distances

p_{1} \dots p_{6}

from the planes to the center of the sphere (assuming that they are ordered in the proper way to act as an upper and lower boundary), the full equation is

\begin{array}{c} i_{1} = min (r, p_{6}) \\ a_{1} = max (- r, p_{5}) \\ i_{2} = min (r, p_{4}) \\ a_{2} = max (- r, p_{3}) \end{array}

V (r) = \int_{a_{1}}^{i_{1}} \int_{a_{2}}^{i_{2}} max (p_{1}, min (p_{2}, 2 \sqrt{r^{2} - x^{2} - y^{2}})) d y d x .

As this approach has a non-continuity due to the $min$ and $max$ operations, no implementable formulation of the intersection volume could be found, and further research is necessary. Approximating the result by replacing the integral by a sum could be achieved, but this would be an iterative approach and thus be inefficient in addition to being inexact.

These intersection volumes would have to be calculated for each particle intersecting the fluid cell, added up and then divided by the volume of the cell ( $δ x^{3}$ ) to get the $ϵ$ -value required for the approach by Noble and Torczynski (1998).

7 Results

Time measurements were done using the Linux-specific function clock_gettime. The first 4000 simulation steps are skipped, then 1000 steps are measured and then the average is printed out. This is done to avoid taking the overhead of the initial configuration into account, which randomly distributes the DEM particles in the domain, thus causing a non-predictable number of collisions.

Only the time difference between the two update methods is measured, so the rest of the application is included in these figures, including visualization.

The testing equipment used was an Intel Core2 Quad running at 2.83 GHz on Ubuntu Linux with a Nvidia GeForce GTX 295 (also running the display) with 896 MB RAM. The application was compiled in 64 bit mode for CUDA 2.3.

Different domain sizes and particle numbers were tested, the results are shown in Table 1. The visual result can be seen in Figures 2 and 3.

Figure 2.

The implementation simulating a $128 \times 128 \times 128$ domain with 120 000 particles.

Figure 3.

The implementation simulating a $128 \times 128 \times 128$ domain with 1000 particles, to show the velocity texture of the fluid simulation.

Table 1.

The measurement results. The top row shows the number of particles, the leftmost column shows the fluid and particle domain size. The results are in milliseconds per simulation step.

	1	100	1000	5000	10 000	50 000	100 000	200 000
$32 \times 32 \times 32$	3.80	3.97	4.06	4.50	5.04	11.06	18.96	38.68
$64 \times 32 \times 32$	3.87	4.01	4.13	4.52	4.98	10.17	16.44	32.12
$64 \times 64 \times 32$	4.16	4.30	4.41	4.81	5.24	9.99	15.42	29.44
$64 \times 64 \times 64$	4.64	4.79	4.89	5.28	5.71	10.11	15.49	27.74
$128 \times 64 \times 64$	6.20	6.32	6.41	6.84	7.29	11.75	17.36	28.18
$128 \times 128 \times 64$	8.80	8.94	9.04	9.43	9.86	14.18	19.55	31.09
$128 \times 128 \times 128$	13.77	13.86	13.95	14.37	14.82	19.25	24.48	44.04

Since there are two simulations with different behavioral patterns mixed in the simulation, a non-monotonic behavior can be seen. On the one hand, the fluid simulation processes every cell of the domain independently, so the number of cells directly influences the performance. On the other hand, the particle simulation does not process every cell, it only processes every particle. For every particle, all other particles in neighboring cells have to be iterated, which means that a larger number of cells, but with a smaller cell size (and thus, fewer particles per cell), reduces the amount of computation required. Thus, measurements with a single particle were carried out, in order to profile the LBM implementation without influences from the DEM code (since no collisions are possible with a single particle, and no neighbors have to be iterated), still keeping the memory and inter-simulation overhead.

Additionally, the DEM simulation without the LBM simulation was profiled, which can be seen in Table 2. Here, only the simulation call itself was removed, the memory layout and initialization code remain the same. Note that the two tables cannot be directly compared, since without the fluid, the particles fall to the bottom of the domain in a faster fashion and so there are more contacts than when also using a fluid simulation.

Table 2.

The measurement results with the fluid simulation disabled. The top row shows the number of particles, the leftmost column shows the fluid and particle domain size. The results are in milliseconds per simulation step.

	1	100	1000	5000	10 000	50 000	100 000	200 000
$32 \times 32 \times 32$	3.55	3.69	3.84	4.28	4.98	11.84	20.75	172.59
$128 \times 128 \times 128$	4.62	4.72	4.82	5.19	5.63	9.79	14.85	37.96

7.1 Interpretation

The graphical representations for the combined simulation can be seen in Figures 4 and 5. The performance penalty for an increasing number of particles is non-linear, as can be expected (since the number of contacts increases non-linearly, which is a defining criterion for the performance of a DEM implementation). Of special importance is that a small domain is faster for a small number of particles, but this advantage is reversed when the number of particles increases.

Figure 4.

The measurements from Table 1 displayed as a linear plot. The horizontal axis is the number of particles, the vertical axis is the execution time. The cell counts are shown in the legend.

Figure 5.

The measurements from Table 1 displayed as a linear plot (transposed to Figure 4). The horizontal axis is the number of cells in the domain, the vertical axis is the execution time. The particle counts are shown in the legend.

As can be seen in Figure 5, as the number of particles increases, the non-linearity increases. This stems from the fact that the LBM performance is linear to the number of cells. As before, this figure also shows that a smaller domain with a large number of particles hinders performance.

In terms of scaling to larger domains, Figure 6 shows that the number of cells per unit time is logarithmic. However, the memory limit is more important in this regard, as all of the simulation data has to be stored in the graphics memory, and larger domains than $128 \times 128 \times 128$ were already impossible to simulate on the test equipment.

Figure 6.

The measurements from Table 1 transformed so that the number cells per millisecond is shown as a measure of performance. The particle counts are shown in the legend.

Figure 7 displays the same information for the simulations with the LBM step turned off. It displays the same trend in a more pronounced fashion, which stems from the fact that the linear time increase of the LBM is removed.

Figure 7.

The measurements from Table 2 displayed as a linear plot. The horizontal axis is the number of particles, the vertical axis is the execution time. The legend shows the number of cells.

8 Conclusion

It has been demonstrated that combining a DEM simulation with an LBM simulation is possible using standard methods from both scientific fields, with a simple interface between them. This allows a full physics simulation on a graphics processor, reducing the data transfer necessary between the host and the graphics card, which is usually the limiting factor for graphics processor performance.

The LBM implementation was improved compared to the implementation by Monitzer (2008, 2010) by leveraging the thread scheduler’s optimization capabilities in the area of memory access patterns. Furthermore, a simplified solid interaction model was implemented, which removes the requirement for an expensive voxelization process. Additional options for integrating more accurate interactions were explored. A full implementation was shown and explained, which demonstrates that the LBM is easily integrated into the CUDA model.

As was demonstrated in Section 7, the LBM simulation time increases linearly with an increase in domain size, while the DEM simulation exhibits a global minimum, which depends on the number of particles. A split between the cell size of the LBM and the DEM is therefore recommended, in order to achieve optimal performance for both simulations.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Notes

References

Buick

Greated

(2000) Gravity in a lattice Boltzmann model. Physical Review E 61(5): 5307–5320.

Crane

Llamas

Tariq

(2007) Real-time simulation and rendering of 3D fluids. In: GPU GEMS 3. New York: Addison Wesley, chapter 30.

Cundall

Strack

ODL

(1979) A discrete numerical model for granular assemblies. Geotechnique 29(1): 47–65.

Eisemann

Décoret

(2008) Single-pass GPU solid voxelization for real-time applications. In: 34th graphics interface conference (GI ‘08), Windsor, Canada, 28–30 May 2008, pp. 73–80. Toronto: Canadian Information Processing Society.

Han

Feng

Owen

DRJ

(2007) Coupled lattice Boltzmann and discrete element modelling of fluid–particle interaction problems. Computers and Structures 85: 1080–1088.

Harada

(2007) Real-time rigid body simulation on GPUs. In: GPU GEMS 3. New York: Addison Wesley, chapter 29.

Holdych

(2003) Lattice Boltzmann methods for diffuse and mobile interfaces. PhD Thesis, University of Illinois at Urbana, USA.

(2004) Accelerating simulation and visualization on graphics hardware. PhD Thesis, Stony Brook University, USA.

Mei

Shyy

Luo

(2000) Lattice Boltzmann method for 3-D flows with curved boundary. Journal of Computational Physics 161: 680–699.

10.

Monitzer

(2008) Fluid rendering on the GPU with complex obstacles using the lattice Boltzmann method. MSc Thesis, Vienna University of Technology, Austria.

11.

Monitzer

(2010) Fluid simulation with CUDA using the lattice Boltzmann method. In: 1st international workshop on GPUs and scientific applications (GPUScA’10), Vienna, Austria, 11 September 2010, pp. 35–42. New York: ACM Press.

12.

Noble

Torczynski

(1998) A lattice-Boltzmann method for partially saturated computational cells. In: 7th international conference on the discrete simulation of fluids, Oxford, UK, 14–18 th July 1998.

13.

Radeke

(2006) Statistische und mechanische analyse der kräfte und bruchfestigkeit von dicht gepackten granularen medien unter mechanischer belastung. PhD Thesis, TU Bergakademie Freiburg, Germany.

14.

Radeke

Glasser

Khinast

(2010) Large-scale mixer simulations using massively parallel GPU architectures. Chemical Engineering Science, 65(24): 6435–6442.

15.

Stam

(1999) Stable fluids. In: 26th international conference on computer graphics and interactive techniques (SIGGRAPH ‘99), Los Angeles, USA, 8–13th August 1999, pp. 121–128. New York: Addison Wesley.

16.

Strack

Cook

(2007) Three-dimensional immersed boundary conditions for moving solids in the lattice-Boltzmann method. International Journal for Numerical Methods in Fluids 55: 103–125.

17.

Tolke

Krafczyk

(2008) Teraflop computing on a desktop PC with GPUs for 3D CFD. International Journal of Computational Fluid Dynamics 22(7): 443–456.