A framework for GPU accelerated deformable object modeling

Abstract

We have developed a framework that uses multicore CPUs and GPUs found on personal computers to accelerate the computations needed for a class of deformable object modeling algorithms. In recent years there has been a growing interest in using deformable objects in computer applications such as animation, video games, garment CAD, and surgical simulation. Deformable object modeling is quite expensive computationally. However, since most of the related calculations can be parallelized, we have developed a framework that utilizes NVIDIA’s CUDA technology to accelerate a set of deformable object modeling algorithms by transferring their core computations to the GPU. Our results show that frame rates can be improved more than 20 times using GPU compared with using a multicore CPU. In addition, we have developed a method called Local Shape Matching which is an extension to the Shape Matching method. Using this new method we have achieved fast and robust simulations whose implementations have good numerical stability.

Keywords

CUDA deformable object modeling GPGPU GPU programming

1 Introduction

GPUs are becoming a natural platform for computationally demanding tasks in a wide variety of application non-graphical, scientific computation domains. This is due to the increased performance of graphics hardware, and to recent improvements in their programmability.

Even though CPUs have evolved so much and their price has declined significantly, commodity GPUs are delivering better performance with respect to cost for parallelizable computations.

Not only is current graphics hardware fast, but it is growing faster than for CPUs as well. Semiconductor technology, driven by advances in fabrication technology, is increasing at the same rate for both CPU and GPU. The reason that the performance of graphics hardware is increasing faster than that of CPUs is due to the scaled enhancement given by the higher parallelism. CPUs are optimized for high performance for sequential computing; therefore, many of their transistors are dedicated to supporting non-computational tasks such as branch prediction and caching. On the other hand, the highly parallel nature of graphics computations enables GPUs to use additional transistors for computation and capitalize, achieving, higher arithmetic intensity with the same transistor count (Owens et al. 2007).

Owing to the high parallelism that exists in the physical simulation of deformable bodies, we can use the GPU to reduce the calculation time. GPUs have been used for general-purpose computing (GPGPU) for several years. However, since they previously were designed specially for rendering and rasterizing, programmers had to use complicated tricks in order to take advantage of their stream processing architecture. However, efforts has been made to perform the computations of deformable object modeling on GPUs using shaders. Georgii and Westermann (2005) implemented a mass–spring system on GPU. The Nonlinear finite element method has even been implemented on GPU (Taylor et al. 2008). However, since GPU hardware and shaders were not designed for this kind of operation, coding was complicated. NVIDIA’s CUDA¹ (NVIDIA 2008) GPGPU technology is a fundamentally different computing architecture for solving complex computational problems. CUDA supports development using high-level language, further enhancing its popularity.

In the next section we provide background on deformable object modeling and review the literature on their GPU implementation. In Section 3 we explain four methods that are implemented in our framework and introduce a Local Shape Matching (LSHM) method. In Section 4 we explain our GPU-based framework for deformable object modeling. In Section 5 we have shown our results and we have concluded in the last section.

2 Background

In this paper we extend the preliminary study that was presented in Shahingohar and Eagleson (2010) and provide comprehensive results and analysis. We begin by reviewing the literature and motivating our selection of the set of deformable object modeling algorithms that will be implemented comparatively in this paper.

Mass–spring models have been popular in Computer Animation for over 20 years. In SIGGRAPH 87, John Lasseter shocked the computer graphics industry by presenting Luxo Jr. his first 3D animation which was produced at Pixar (Lasseter 1987). His ideas allowed animators to extend traditional 2D storyboarding techniques, keyframe animation, ‘inbetweening’, and scan/print to the 3D realm. At the same venue Terzopoulos et al. (1987) presented the first paper on physically based animation. They employed elasticity theory to construct differential equations to model the behavior of non-rigid curves, surfaces, and solids as a function of time. Since then, many researchers have taken advantage of various scientific concepts, in animation of deformable objects, hair, cloth and fluids (Gibson and Mirtich 1997; Nealen et al. 2006).

2.1 Physically based methods

Most physically based methods for modeling deformable objects are based on continuum elasticity. Under the assumptions of continuum mechanics, the behavior of a deformable object can be expressed as follows. Suppose that the rest shape of a continuous deformable object is a subset of $Ω$ of $R^{3}$ . Here $Ω$ consists of all of the particles with positions $x_{0} = [x, y, z]^{τ}$ , where $x_{0} \in Ω$ . $x_{0}$ denotes the material coordinate of a point in the rest position. When a force is applied, the shape deforms. We suppose that a point at location $x_{0}$ moves to a new location with a displacement denoted by $u = [u, v, w]^{τ}$ such that the new location of particle $x_{0}$ is $x = u + x_{0}$ . Usually a measure for deformation is defined (strain) and at each iteration of simulation internal forces are calculated such that linear and angular momentum is preserved or strain energy stored in the object is minimized. Once these forces are calculated, displacements are calculated by integration.

2.2 Non-physically based methods

Although physically based deformable object modeling methods are generally based on simple continuum elasticity, precise solutions to these methods cannot be implemented in real time. In addition, approximations can make the simulations either very inaccurate, or unstable numerically. Therefore, non-physically based methods that are based on simplifying smoothness constraints are still an attractive choice. The mass–spring method (MSM) and shape matching technique (Müller et al. 2005) are two examples of non-physically based methods. For a complete review of deformable object modeling you can refer to Nealen et al. (2006).

2.3 NVIDIA CUDA and AMD FireStream

In 2006, ATI launched FireStream as the industry’s first commercially available hardware stream processing solution. At first it was designed as a virtual machine abstraction for GPUs that provided policy-free, low-level access to the hardware designed for high-performance, data-parallel applications (Peercy et al. 2006). In the same year AMD acquired ATI and re-branded the API to the AMD Stream Processor, but it was changed to AMD FireStream in 2007. NVIDIA lunched their own GPU parallel computing architecture CUDA in 2007. CUDA gives developers access to the native instruction set and memory of the parallel computational elements in CUDA GPUs. Using CUDA, the latest NVIDIA GPUs effectively become open architectures as programmable as CPUs. CUDA and FireStream both provide similar functionality, however, CUDA seems to have been absorbed by the scientific community. It is shown that typical applications such as traffic simulation, thermal simulation, and $k$ -means can be accelerated using the GPU, demonstrating as high as 40 times speedup when compared with a CPU implementation (Che et al. 2008). Hence, researchers have used CUDA stream processing for a variety of applications.

Recently, there has been a growing interest toward cross-platform GPU computation. OpenCL^TM is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices.² OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL is managed by the non-profit technology consortium Khronos Group and is being developed in collaboration with technical teams at Apple, AMD, IBM, Intel, and NVIDIA. OpenCL is still in its early stages of development but it is highly anticipated that it is going to become the main approach for cross-platform stream processing on the GPU.

CUDA has been used for accelerating some existing deformable object modeling techniques as well. Rasmusson et al. (2008) investigated multiple implementations of volumetric mass–spring–damper systems in CUDA. They compared the performance of CUDA with previous implementations utilizing the GPU through the OpenGL graphics API and showed that performance and optimization strategies differ widely between the OpenGL and CUDA implementations (Rasmusson et al. 2008) The nonlinear finite element method mentioned by Taylor et al. (2008) has been accelerated with CUDA by Comas et al. (2008) within their Sofa Framework.

3 Implemented methods

In this section, we provide an overview of the four methods that are implemented in our framework.

3.1 Weighted MSM

The MSM is the simplest, most intuitive and most common method used in deformable object modeling. Since the early works by Terzopoulos and Fleischer (1988); Terzopoulos et al. (1987) mass–spring method has been used for modeling of various deformable objects such as cloth simulation (Provot 1995; Baraff and Witkin 1998), face animation (Khler et al. 2001) and soft tissue (Gibson and Mirtich 1997; Zhang et al. 2005).

In this system, the deformable object is considered in a discrete space. This model consists of point masses connected together with massless springs and dampers. In the original mass–spring system no weight function is used but we have added weight so nearest neighbors will have a greater effect. By adding weights and considering constant stiffness coefficient $k_{s}$ the elastic force is obtained by

f_{i}^{s} = \sum_{j \in N (x_{i})} k_{s} w_{i j} (x_{i j}) (1 - \frac{l_{i j}^{0}}{∥ x_{i j} ∥}),

where

x_{i j} = x_{j} - x_{i}

in which

x_{i}

is the position of node

i

l_{i j}^{0}

is the initial length of spring between node

i

and node

j

N (x_{i})

is the set of neighbors of node

i

and

w_{i j}

is the weight of

j

th neighbor of node

i

. By considering weights and constant damping coefficient

k_{d}

, the damping is approximated by

f_{i}^{d} = \sum_{j \in N (x_{i})} k_{d} w_{i j} \frac{v_{i j}^{τ} x_{i j}}{∥ x_{i j} ∥} x_{i j},

where

v_{i}

is the velocity of node

i

and

v_{i j} = v_{j} - v_{i}

3.2 LSHM

A special meshless non-physically based model for deformable objects was developed by Müller et al. (2005) that is able to provide a robust simulation. We have developed a method for modeling deformable objects based on the ‘Shape Matching’ method. We have extended the concept of clusters in Shape Matching, such that a cluster is defined for each point. An overview of this approach is shown in Figure 1 . At the beginning of the simulation for each node i the center of mass $C_{i}^{0}$ is calculated for that node and its neighbors; also the vector $ν_{i}$ from the center of mass to each node is stored. At each iteration of the simulation, rotation should be extracted from the deformation. The rotation is approximated using the least squares optimization explained by Müller et al. (2005). As shown in Figure 1(c), extracting the rotation is equivalent to rotating the coordinate system in reverse $x^{R} = R^{- 1} x$ . For each node i in the new rotated coordinate system, the goal position is located at $g_{i}^{R} = ν_{i} + C_{i}^{R}$ , where $C_{i}^{R}$ is the new center of mass of node i and its neighbors in the rotated coordinate system. Once the goal positions are found in the new coordinate system, the goal positions are transformed to the original coordinate system $g_{i} = {R g}_{i}^{R}$ . Instead of using the integration schema proposed by Müller et al. (2005), we introduce a restoring force from the current position to the calculated goal position:

f_{i}^{s} = k_{s} (g_{i} - x_{i}) .

Figure 1.

Local Shape Matching. (a) Node $x_{i}$ and its neighbors at the beginning of the simulation. $C_{i}^{0}$ is the center of mass node $i$ and its neighbors, $ν_{i} = x_{i}^{0} - C_{i}^{0}$ . (b) Node $x_{i}$ and its neighbors after deformation. (c) The rotation is extracted by rotating the coordinate system. Then the goal position is calculated at the rotated coordinate system; it is located at the same vector position from the center of the mass $g_{i}^{R} = ν_{i} + C_{i}^{R}$ . (d) Goal positions are rotated back to the original coordinate system and a force is applied in the direction of $x_{i} - g_{i}^{R}$ .

While rotation can be approximated for each node, we approximate the rotation for the whole shape and use it for all nodes for simplicity. On the other hand, it turns out that we do not have to transform all of the nodes to the rotated coordinate system. The same results can be achieved if we only rotate $ν_{i}$ . An overview of our algorithm is given in Algorithm 1.

Algorithm 1 Our LSHM algorithm.

{Initialization}

for all nodes $i$ do

$\begin{aligned} C_{i}^{0} = \frac{1}{∥ N (x_{i}) ∥} \sum_{j \in N (x_{i})} x_{j}^{0} \\ ν_{i} = x_{i}^{0} - C_{i}^{0} \end{aligned}$

end for

{At each iteration}

Approximate the rotation $\to R$ .

for all nodes $i$ do

$\begin{aligned} C_{i} = \frac{1}{∥ N (x_{i}) ∥} \sum_{j \in N (x_{i})} x_{j} \\ g_{i} = C_{i} + R ν_{i} \\ f_{i} = k_{s} (g_{i} - x_{i}) \end{aligned}$

end for

{Explicit integration}

for all nodes $i$ do $\begin{aligned} {\dot{x}}_{i} \leftarrow {\dot{x}}_{i} + Δ t f_{i} / m_{i} \end{aligned}$ $\begin{aligned} x_{i} \leftarrow x_{i} + Δ t {\dot{x}}_{i} \end{aligned}$

end for

3.3 Meshless finite element method

If the finite element method is applied to a linear system it produces a linear system of algebraic equations. Assuming linear strain and assuming that our material is isotropic the governing equation of the material ( $ρ \ddot{x} = \nabla \cdot σ + f$ ) can be simplified as follows:

ρ \ddot{x} = μ Δ u + (λ + μ) \nabla (\nabla . u),

where

u_{i} = x_{i} - x_{i}^{0}

is the displacement of each node,

ρ

is the density of the material and

λ

and

μ

are Lamé’s coefficients. Debunne et al. (1999) used a mesh-free method to solve this equation. Their method is based on calculating the Laplacian (

\nabla = Δ^{2}

) and gradient of divergence (

\nabla (\nabla .)

) in a discrete fashion. Using their approach these geometric operators are defined based on the neighboring nodes only (not restricted to the grid imposed by a regular mesh; hence the term ‘meshless’) given as follows:

Δ u_{i} = \frac{2}{\sum j \in N (x_{i}) l_{i j}} \sum_{j \in N (x_{i})} \frac{u_{j} - u_{i}}{l_{i j}}

\nabla (\nabla . u) = \frac{2}{\sum j \in N (x_{i}) l_{i j}} \sum_{j \in N (x_{i})} \frac{[(u_{j} - u_{i}) . \frac{l_{i j}}{l_{i j}}] \frac{l_{i j}}{l_{i j}}}{l_{i j}}

where

l_{i j} = u_{j} - u_{i}

and

l_{i j} =∥ u_{j} - u_{i} ∥

is the distance between sample points

i

and

j

In the original method mentioned all neighbors have the same weight. We have modified this such that closer neighbors will have a greater effect as expected. On the other hand, the original method does not handle significant rotations. In order to fix this, we perform the calculations on the objects coordinate system as we did in the previous section. Therefore, at each iteration we approximate the global rotation of the object. To calculate the Laplacian (Equation (5)) and gradient of divergence (Equation (6)) we need to calculate the displacements ( $u_{i} = x_{i} - x_{i}^{0}$ ). To compensate for the rotation, before calculating the displacements we rotate the original points with the approximated rotation of the object:

{\tilde{x}}_{i}^{0} = R x_{i}^{0},

u_{i} = x_{i} - {\tilde{x}}_{i}^{0} .

3.4 Point-based animation

Point-based graphics is an active research area in computer graphics in which the surface is rendered as point sampled surfaces (surfels) rather than polygonal surfaces (Pfister et al. 2000). There has been growing interest in combining mesh-free methods and point-based graphics. Müller et al. (2004) proposed point-based animation (PBA) a mesh-free continuum mechanics method for animation of elastic, plastic and melting objects. In their approach the geometrical strain is approximated around each node based on the deformation of neighbors nodes. Then the derivation of elastic energy is approximated in the area around each point (phyxel) to calculate the resulting elastic force. They use moving least squares optimization to approximate strain, stress and strain energy. To calculate the elastic forces they use the same approach to calculate the derivation of strain energy with respect to displacement. The following is an overview of their method:

\begin{matrix} u_{t} & \to & \nabla u_{t} & \to & ϵ_{t} & \to & σ_{t} \\ ↘ & ↙ \\ u_{t + Δ t} & \leftarrow & f_{t} = \nabla_{u} U_{t} & \leftarrow & U_{t} \end{matrix}

4 Our GPU-based framework

We have developed a framework for animating deformable objects based on particle-based methods. Our framework was specifically designed for efficient implementation on CUDA GPU architectures, and can be mapped easily onto other multicore CPUs. An overview of our algorithm is shown in Figure 2 . We made use of the OpenMP (Chapman et al. 2007) library to map the kernel loop onto all available CPU cores. Numerically, we made use of explicit integration and restrict each step to 10 iterations. That means we enforce the boundary conditions only once per 10 iterations which reduces the number of transfers between host and device.

Figure 2.

Overview of our simulation framework.

Ideally we want to perform all of the tasks on the GPU; however, some tasks cannot be parallelized and, consequently, they would be even slower on the GPU. Our goal is to utilize the GPU parallel computing capabilities fully by transferring the heavy computation of internal interaction of nodes and numerical integration to the GPU. By doing so, the CPU is reserved for performing sequential tasks such as collision detection.

Transferring data (such as positions of the nodes) between host (CPU side) and device (GPU side) is slow and can easily become a bottleneck for processing. Therefore, we limit these transfers as much as possible.

4.1 Data structure

In our framework, for each node a set of arrays are generated to store variables associated with the simulation. These variable are stored for all nodes and include position, velocity, elastic force, list of neighbors, and weight of each neighbor. In addition to these variables, there are others that are specific to each of the methods that we compare, and there fore are declared separately. In the MSM the original length of each spring is stored. One of the features of our LSHM method is that, instead, the vector from center of mass of neighborhood to each node is stored. In the Debunne method only the original positions are saved. In PBA the matrix $M^{- 1} = {(\sum_{j} x_{i j} x_{i j}^{T} w_{i j})}^{- 1}$ is retained. All variable arrays described previously are created and initialized at the beginning of simulation on the host (CPU) and copied into the device (GPU). However, during simulation, only the positions and velocities are sent back and forth between host and device. Positions and velocities are transferred to host to perform collision detection, and then the corrected values are sent back to GPU for each next iteration. CUDA has provided the OpenGL inter-operability which enables the user to map a CUDA buffer to OpenGL buffer and vice versa. Unfortunately since this operation is not implemented efficiently, it turns out to be much faster for us to send the position to the GPU twice, once for rendering and another time for the simulation. The calculated forces are also transferred to host at each iteration if haptic rendering is required.

Table 1 shows the properties of each variable that is created on the GPU. Velocity, position and force are of float size 4 instead of 3. The reason is that in our CUDA kernel we use float4 since the platform is not optimized for accessing float3.³ The neighbors array keeps the index of $n_N e i g h b o r s$ closest nodes to each node and effect of each neighbor is weighted according a function. We have used two kernel functions in our simulation. The first kernel calculates the forces (or accelerations) and the second kernel is used for explicit integration. For each node, a thread is generated; therefore, each thread gathers the required data from the memory updates the force, velocity and the position of its corresponding node. Some of the arrays do not change during the simulation, and therefore we have used the read-only cached texture memory for them. The rest of the arrays are allocated to the global memory. The global memory has high latency compared with the shared memory. Unfortunately we cannot use the shared memory for those arrays since the shared memory is only accessible within threads of the same thread block. The size of the thread block is limited and it is chosen based on shared memory and register requirements. For a typical mesh the threads are scattered on several thread blocks; therefore, neighbors of a node might reside on another thread block.

Table 1.

Data arrays and their sizes.

Array name	Type	Size	GPU memory	Transfer
Position	float4	$4 Ã— n_N o d e s$	Global	Each iteration
Velocity	float4	$4 Ã— n_N o d e s$	Global	At initialization
Force	float4	$4 Ã— n_N o d e s$	Global	Each iteration
Neighbors	Int	$n_N e i g h b o r s Ã— n_N o d e s$	Texture	At initialization
Nei_Weights	float	$n_N e i g h b o r s Ã— n_N o d e s$	Texture	At initialization
Method specific
Node distances (MSM)	float	$n_N e i g h b o r s Ã— n_N o d e s$	Texture	At initialization
$v_{i}$ (LSHM)	float4	$4 Ã— n_N o d e s$	Texture	At initialization
Original positions (DEB, PBA)	float4	$4 Ã— n_N o d e s$	Texture	At initialization
$M^{- 1}$ (PBA)	float	$9 Ã— n_N o d e s$	Texture	At initialization

4.2 Weight function

Different formulas have been used for the weights of the neighbors. Some of these functions are shown in Table 2 . The weight function should be symmetric ( $w_{i j} = w_{j i}$ ) and it should be smooth, positive, and monotonically decreasing. All of the functions in Table 2 satisfy these conditions. We have chosen Cubic Spline Function as our weight function. On the other hand we have to make sure that the neighborhood relationship is mutual to ensure the third law of Newton (action equals reaction) is preserved. Therefore, if for example j is in the list of neighbors of i, then i must be in the list of neighbors of j.

Table 2.

Weight functions used in mesh-free methods (Li and Liu 2007).

Linear	$w_{i j} = \{\begin{matrix} 1 - q & i f q < h \\ 0 & o t h e r w i s e \end{matrix}$
Gaussian Function	$w_{i j} = \{\begin{matrix} e^{\frac{- q^{2}}{σ^{2}}} & i f q < 1 \\ 0 & o t h e r w i s e \end{matrix}$
Cubic Spline Function	$w_{i j} = \{\begin{matrix} 1 - 3 / 2 q^{2} + 3 / 4 q^{3} & i f 0 \leq q \leq 1 \\ 1 / 4 (2 - q)^{3} & i f 1 \leq q \leq 2 \\ 0 & o t h e r w i s e \end{matrix}$
Quartic Spline Function	$w_{i j} = \{\begin{matrix} 1 - 6 q^{2} + 8 q^{3} - 3 q^{4} & i f q < 1 \\ 0 & o t h e r w i s e \end{matrix}$
Point Based Animation Müller et al. (2004)	$w_{i j} = \{\begin{matrix} (1 - q^{2})^{3} & i f q < 1 \\ 0 & o t h e r w i s e \end{matrix}$

where $h$ is a threshold and $q = | x_{i j} | / h$

5 Results

5.1 Accuracy

In order to compare the accuracy of different methods quantitatively, we have compared our results with the Truth Cube (Kerdok et al. 2003) experiment. Kerdok et al. developed a physical standard to validate soft tissue deformation models. They took CT images of a cube of silicone rubber with a set of embedded Teflon spheres that underwent uniaxial indentation and spherical indentation tests. They used silicone rubber (RTV6166, General Electric Co.) which exhibits linear behavior to at least 30% strain. Two experiments was performed on the Truth Cube. In the first experiment the silicone rubber phantom with embedded fiducial markers (Teflon spheres) was imaged under uniaxial compression. In the second experiment a spherical indentation loading condition was performed. They scanned the experimental setup when the cube was unloaded.

In the first experiment the flat compression plate was held level as it was lowered onto the oiled top surface of the cube to a set displacement. Three loading conditions were scanned: 4 mm, 10 mm, and 14.6 mm displacements producing 5.0%, 12.5%, and 18.25% nominal strain, respectively.

A similar setup was used for the large deformation spherical indentation test except that a 2.54 cm diameter spherical indentor mounted on a 1.9 cm diameter by 4.5 cm long cylinder was added to the compression plate. The first scan was also performed with the cube in an unloaded state for reference locations of the internal spheres. The compression plate with the indentor was then lowered in a manner similar to that of the uniaxial test. Two loading conditions were scanned: 18 mm and 24 mm displacements producing 22% and 30% local nominal strain.

In order to make sure that we track the right positions, we add 343 nodes to each reference mesh. The positions of these additional mesh nodes were initialized based on the position of measured Truth Cube data with zero displacement. In the Truth Cube experiments the cube is in equilibrium state with the presence of gravity force. In other words the stress and strain is not equal to zero from the beginning. For our simulation, we assume that gravity is zero, since if we do not neglect the gravity, our model will deform as soon as the simulation starts, while in the Truth Cube experiment there is no deformation if there is no contact. For the used silicon rubber the Young’s modulus is equal to 15 kPa and a Poisson’s ratio is close to 0.5. In our simulation we have assumed that Poisson’s ratio is equal to 0.499. MSM and LSHM are not based on continuum elasticity, therefore they cannot be expressed by Young’s modulus of elasticity. For these methods the spring coefficient is manually adjusted to achieve reasonable results for those comparisons.

We then simulated the truth cube experiment by lowering a plate to simulate the uniaxial compression test and a sphere to simulate the spherical indentation test. The plate/sphere was moved manually and the position of nodes corresponding to the Truth Cube experiment were saved in a file to be processed later. By comparing the result of our simulation with the Truth Cube we can have a performance measure for linear elastic state. We define a relative error measure for each point as follows:

e_{i} = 100 \frac{∥ x_{i}^{s} - x_{i}^{t c} ∥}{80},

where

x_{i}^{s}

is the simulation result for node

i

and

x_{i}^{t c}

is the position of corresponding node in truth cube experiment. We have divided the error by 80 mm to normalize the error with respect to the dimension of the cube. The average is then defined as follows:

E = \frac{1}{343} \sum_{i = 1}^{343} e_{i} .

Uniaxial compression

Figure 3 shows the result of simulation of the first Truth Cube experiment on a cube with 2207 nodes with 16 neighbors for each method. The figure shows the results for four strain stages using four different methods. In Figure 5 the error for different combinations of the mesh sizes and number of neighbors are compared together for the uniaxial compression test. The PBA was unstable for the Poisson ratio of 0.49; we were forced to use a Poisson ratio at times as low as 0.40, which causes concern for the applicability of such methods in general. We are pursuing this general modeling issue in our current research.

Figure 3.

Simulation of uniaxial compression of a cube with four methods in our framework. Number of nodes: 2207; number of neighbors: 16.

Figure 4.

Simulation of the Truth Cube spherical indentation with four methods in our framework. Number of nodes: 2207; number of neighbors: 16.

Figure 5.

Accuracy of the simulation in the uniaxial compression experiment.

Spherical indentation

The second experiment has a higher importance for us since it better resembles the interactions in surgical simulation. Figure 4 shows the result of simulation of the second Truth Cube experiment on a cube with 2207 nodes with 16 neighbors for each method. The figure shows the results for three strain stages using four different methods. In Figure 6 the error for different combinations of the mesh sizes and number of neighbors are compared together for the spherical indentation test. Similar to the previous simulation the Poisson ratio of 0.40 was used instead of 0.49 for the PBA.

Figure 6.

Accuracy of the simulation in the spherical indentation experiment.

Discussion

As we expect the error is higher where there is greater deformation (strain). By manually adjusting the stiffness with trial and error we were able to achieve good results most of the time for MSM, however we had to use different values for different meshes. For example, in the first experiment (Figure 5) for the mesh with 3232 nodes and 32 neighbors although the selected stiffness coefficient has resulted in low error for 5% and 12.5% strain it has resulted in a total collapse in the cube for 18.25% strain. The LSHM method has resulted in lower accuracy compared with MSM since it does not conserve the volume. The error for LSHM is in the same range for all different mesh sizes. This is a desirable feature since it ensures we get the same behavior when the resolution is increased or decreased. The Debunne method has attained lower errors in general; however, it shows sensitivity to the mesh size. Although PBA is the most sophisticated method in our framework it has not acheived the lowest error. This method is suitable for materials with chosen physical properties. As mentioned by Gerszewski et al. (2009) PBA is not appropriate for large elastic deformations. When there is large elastic deformations, using moving least squares for approximation results in ill-conditioned deformation gradient and instability of the simulation. Another reason for instability of this method could be singularity or near-singularity condition in matrix $M = (\sum_{j} x_{i j} x_{i j}^{T} w_{i j})$ .

In general, increasing the resolution of the meshes did not result in improvement in accuracy in all cases. By increasing the resolution of the mesh we have had to increase the number of neighbors to obtain better results. In our simulations, increasing the number of neighbors from 16 to 32 resulted in better accuracy for the mesh with 2207 nodes. In fact we had to use 32 neighbors for bigger meshes to obtain better stability. Another factor that we should take to account is that we have used 32-bit single-precision floating point numbers since our CUDA card did not support DP floating point numbers. That could, potentially, affect the simulations for high-resolution meshes.

It should be noted that the mentioned comparison only provides a measure for linear elastic simulation. Different results might be achieved in the dynamic simulation. On the other hand, in our simulation there was no significant differences between accuracy in CPU and CUDA implementation. As mentioned in Section 4 the boundary conditions in the CUDA implementation is enforced every 10 iterations to get better speed. This might have an effect on accuracy in rapid movements; however, this deficiency is compensated partly in CUDA implementation since time steps will be smaller resulting in a more accurate explicit integration.

5.2 Performance

In Figure 8 the calculation time is compared for different implementations. We have used a logarithmic scale to better observe the differences. The simulation is repeated for all four methods for different mesh sizes (639, 2207, 3232, 5567, and 10,932 nodes) and the average calculation time is given in milliseconds when running the algorithm on a single core of a CPU, on four cores of the CPU, and on the GPU. Our LSHM method is faster than any other method since there is no square root operation in its calculations. On a typical processor square root requires 18 cycles while addition, subtraction and multiplication require 2 cycles and division requires 12 cycles.

Figure 7.

Stability of the simulation. The bars show the maximum time step (in ms) that can be used without instability.

Figure 8.

Comparison of calculation time for different methods when 16 neighbors were considered: (a) MSM; (b) LSHM; (c) discretized finite element method (Debunne); (d) PBA.

For all methods the calculation time grows as the number of nodes increases. Using all four cores of the CPU we were able to accelerate the simulation almost three times but it will not be fast enough for real-time applications if the number of nodes is too high. On the other hand by using CUDA we have an acceptable frame rate for all methods even when there are 10,932 nodes. As shown, the slopes of the curves are almost the same for single-core and multicore CPU but CUDA has resulted in a much lower slope for all methods. There is a sudden increase in CUDA calculation time for all methods around 3000. We can conclude that at lower sizes the speed is bounded by the calculation while at higher mesh sizes it is bounded by the bandwidth of the host–device transfer.

5.3 Stability

The explicit integration is not unconditionally stable. Therefore, the time step should be kept low to ensure a stable simulation. In this section we have compared the different methods based on the time step in which they become unstable. For each simulation we started with a small time step and gradually increased the time step until the simulation became unstable. We used the global damping ratio of 0.995 and allowed the simulation to run on a cube which was placed on a surface while gravity was applied. In Figure 7 we have compared the threshold time steps for different methods applied on different meshes. We have also repeated the simulation both on the CPU and the GPU. As is shown, we have been able to achieve the best stability from LSHM. MSM and the Debunne method became unstable at around the same time step. In order to get a better stability in PBA we had to use a Poisson ratio of 0.45 instead of 0.499 that was used in the Debunne method, but the stability is still poor in this method. In general the model has become unstable at lower time steps while using GPU. The reason is that floating point operations are implemented differently on the GPU. The floating point operations have lower accuracy on the GPU than the corresponding operations on the CPU.

6 Conclusion

In this paper we have introduced an efficient framework to implement mesh-free deformable object modeling methods on CUDA. We have shown how deformable object modeling methods can be implemented in this framework. Our framework is more suitable for mesh-free methods; however, simple mesh-based methods such as MSM can be implemented with minor adjustments. Four different methods for modeling deformable objects including the new LSHM method where included in the framework to take advantage of GPU parallel processing. We have shown that while using multicore CPUs the calculation can be accelerated, but it grows with the same rate as a single-core CPU. However, we were able to achieve up to 20 times faster simulations when the number of nodes was more than 10,000. The reason CUDA is faster than CPU is not just having more processing cores, it is also related to the calculation method. When a CPU processing core is waiting for data from the memory it tries to keep itself busy by running awaiting non-dependent instructions out of order; the GPU on the other hand hides the memory loading time by performing the same instruction on other threads.

We have compared the accuracy and stability of the implemented methods in linear static state by comparing the simulation results with the Truth Cube (Kerdok et al. 2003) experiment results. The Debunne method is promising since it results in low errors while it is reasonably fast. LSHM is very fast and robust; however, it results in high errors since it does not conserve the volume. In the future we will add volume-conserving force to this method, to converge towards the desired property of volume preservation in tissue.

Footnotes

Acknowledgement

This work was funded by NSERC operating grant A2630 to RE; and the Canadian NCE on Graphics and New Media, under the HLTHSIM project.

Notes

References

Baraff

Witkin

(1998) Large steps in cloth simulation. In SIGGRAPH’98: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques. New York: ACM Press, pp. 43–54.

Chapman

Jost

van der Pas

(2007) Using OpenMP. Portable Shared Memory Parallel Programming. Cambridge, MA: MIT Press.

Che

Boyer

Meng

Tarjan

Sheaffer

Skadron

(2008) A performance study of general-purpose applications on graphics processors using CUDA. J Parallel Distrib Comput 68: 1370–1380.

Comas

Taylor

Allard

Ourselin

Cotin

Passenger

(2008) Efficient nonlinear FEM for soft tissue modelling and its GPU implementation within the open source framework SOFA. In ISBMS’08: Proceedings of the 4th international symposium on Biomedical Simulation. Berlin: Springer-Verlag, pp. 28–39.

Debunne

Desbrun

Barr

Cani

M-P

(1999) Interactive multiresolution animation of deformable models. In Magnenat-Thalmann

Thalmann

(eds), Eurographics Workshop on Computer Animation and Simulation’99, September, 1999, Computer Science. Milan: Springer, pp. 133–144.

Georgii

Westermann

(2005) Mass–Spring Systems on the GPU. Amsterdam: Elsevier. Available at: http://wwwcg.in.tum.de/Research/Publications/MassSpringGPU2.

Gerszewski

Bhattacharya

Bargteil

(2009) A point-based method for animating elastoplastic solids. In SCA’09: Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. New York: ACM Press, pp. 133–138.

Gibson

SFF

Mirtich

(1997) A Survey of Deformable Modeling in Computer Graphics. Technical report, Mitsubishi Electric Research Laboratories.

Kerdok

Cotin

Ottensmeyer

Galea

Howe

Dawson

(2003). Truth Cube: Establishing physical standards for soft tissue simulation. Med Image Anal 7: 283–291.

10.

Kähler

Haber

Seidel

(2001) Geometry-based muscle modeling for facial animation. In Proceedings of Graphics Interface 2001, pp. 37–46.

11.

Lasseter

(1987) Principles of traditional animation applied to 3D computer animation. SIGGRAPH Comput Graph 21(4): 35–44.

12.

Liu

(2007) Meshfree Particle Methods. New York: Springer.

13.

Müller

Heidelberger

Teschner

Gross

(2005) Meshless deformations based on shape matching. ACM Trans Graph 24: 471–478.

14.

Müller

Keiser

Nealen

Pauly

Gross

Alexa

(2004) Point based animation of elastic, plastic and melting objects. In SCA’04: Proceedings of the 2004 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. Aire-la-Ville, Switzerland: Eurographics Association, pp. 141–151.

15.

Nealen

Muller

Keiser

Boxerman

Carlson

(2006). Physically based deformable models in computer graphics. Computer Graphics Forum 25: 809–836.

16.

NVIDIA (2008) NVIDIA CUDA (Compute Unified Device Architecture) Programming Guide. http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf.

17.

Owens

Luebke

Govindaraju

Harris

Krüger

Lefohn

al.

(1997) A survey of general-purpose computation on graphics hardware. Computer Graphics Forum 26: 80–113.

18.

Peercy

Segal

Gerstmann

(2006) A performance-oriented data parallel virtual machine for GPUs. In SIGGRAPH’06: ACM SIGGRAPH 2006 Sketches. New York: ACM Press, p. 184.

19.

Pfister

Zwicker

van Baar

Gross

(2000) Surfels: surface elements as rendering primitives. In SIGGRAPH’00: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques. New York: ACM Press/Addison-Wesley, pp. 335–342.

20.

Provot

(1995) Deformation constraints in a spring–mass model to describe rigid cloth behavior. In Proceedings of Graphics Interface, pp. 147–154.

21.

Rasmusson

Mosegaard

Sørensen

(2008) Exploring parallel algorithms for volumetric mass–spring–damper models in CUDA. In ISBMS’08: Proceedings of the 4th International Symposium on Biomedical Simulation. Berlin: Springer-Verlag, pp. 49–58.

22.

Shahingohar

Eagleson

(2010) International Workshop on GPUs and Scientific Applications (GPUSCA 2010). Technical report, Department of Computer Science, The University of Vienna. Available at: http://eprints.cs.univie.ac.at/27/.

23.

Taylor

Cheng

Ourselin

(2008) High-speed nonlinear finite element analysis for surgical simulation using graphics processing units. IEEE Trans Med Imaging 27: 650–663.

24.

Terzopoulos

Fleischer

(1988) Modeling inelastic deformation: viscolelasticity, plasticity, fracture. In SIGGRAPH’88: Proceedings of the 15th Annual Conference on Computer Graphics and Interactive Techniques. New York: ACM Press, pp. 269–278.

25.

Terzopoulos

Platt

Barr

Fleischer

(1987) Elastically deformable models. In SIGGRAPH’87: Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, pp. 205–214.

26.

Zhang

Huang

(2005) Real-time simulation of deformable soft tissue based on mass–spring and medial representation. In Computer Vision for Biomedical Image Applications: First International Workshop, CVBIA 2005. Berlin: Springer-Verlag, pp. 419–426.