Fast event-based epidemiological simulations on national scales

Abstract

We present a computational modeling framework for data-driven simulations and analysis of infectious disease spread in large populations. For the purpose of efficient simulations, we devise a parallel solution algorithm targeting multi-socket shared-memory architectures. The model integrates infectious dynamics as continuous-time Markov chains and available data such as animal movements or aging are incorporated as externally defined events. To bring out parallelism and accelerate the computations, we decompose the spatial domain and optimize cross-boundary communication using dependency-aware task scheduling. Using registered livestock data at a high spatiotemporal resolution, we demonstrate that our approach not only is resilient to varying model configurations but also scales on all physical cores at realistic workloads. Finally, we show that these very features enable the solution of inverse problems on national scales.

Keywords

Computational epidemiology discrete event simulation multicore implementation stochastic modeling task-based computing

1 Introduction

Livestock diseases have a major economic impact on farmers, the livestock industry, and countries (Hasonova and Pavlik, 2006; Knight-Jones and Rushton, 2013). Modeling and simulation of infectious disease spread are important in designing cost-efficient surveillance and control (Willeberg et al., 2011). One challenge is that disease dynamics and transmission routes for various pathogens are fundamentally different. Indirect transmission of pathogens via the environment for fecal–oral diseases requires a different model compared to diseases that spread with direct contact between individuals (Brooks-Pollock et al., 2015). Another challenge is to incorporate the increasing amount of epidemiologically relevant data into the models (Pellis et al., 2015). It is therefore desirable to have simulation tools that are flexible to various disease spread models yet efficient to handle the large amounts of available livestock data.

Due to uncertainties in the exact details in pathogen transmission (Greenwood and Gordillo, 2009) and the inherent random nature of animal interactions, stochastic modeling is natural and often required. Spatial models that include proximity to infected farms with local clustering of disease spread gained popularity during the foot-mouth-disease epidemic in 2001 (Keeling, 2005; Harvey et al., 2007; Stevenson et al., 2013). Another important route for disease spread is animal trade, creating a temporal network of contacts between farms (Masuda and Holme, 2013). It has been shown that the topology and connectivity of the network have great impact on the disease spread and on the effect of control measures (Büttner et al., 2016; Shirley and Rushton, 2005).

Stochastic models on discrete state-spaces are typically simulated using discrete event simulation (DES), a general approach to evolve dynamical systems consisting of discrete events including, in particular, continuous-time Markov chains (CTMCs) (Cassandras and Lafortune, 2008). As most realistic epidemiological models are formulated on a large state-space and/or need to be studied over comparably long periods of time, parallelization is desirable. The highest degree of parallelism is typically achieved by a decomposition of the spatial information, often represented as a graph or network, into a set of subdomains (Fujimoto, 1990). It is then up to the strategy for event handling at domain boundaries how well the concurrent execution scales and which overall degree of parallelism is extractable. As it may hinder scalability, a constraint that plays a crucial role in the design of parallel DES is to maintain the sequential ordering of events, that is, to preserve the underlying causality of the model.

In general, there are two types of boundary events that can occur during a simulation, which ultimately decide what will be the optimal parallelization strategy: those which are deterministic and essentially of fully predictable character and those which are stochastic and not predictable at an earlier simulation time (Fujimoto, 1999). To parallelize events that belong to the latter group, sophisticated approaches such as optimistic parallel DES algorithms have been proposed (Jefferson, 1985). These approaches may use speculative execution to enable scalability but must implement rollback mechanisms in case the event causality is violated (Carothers et al., 1999). Alternatively, in simulations where the domain crossing events are deterministic and thus predictable, conservative simulation may be used as it is possible to avoid causal violations altogether (Fujimoto, 1990; Heidelberger and Nicol, 1993). In particular, a parallel scheduler (Drozdowski, 2009) can be used to create an execution order which guarantees causality, as has been previously shown by Nicol and Liu (2002) and Xiao et al. (1999), notably with the focus on simulation of telecommunication networks.

In this article, we present an efficient and flexible framework for data-driven modeling of disease spread simulations. The model integrates disease dynamics as CTMCs and real livestock data as deterministic events. This allows us to create a temporal network of disease transmission, which has been shown to be a key aspect in modeling and simulation of spatial disease spread (Büttner et al., 2016; Shirley and Rushton, 2005). Previously, agent-based simulations based on synthetic data have been studied by others (Barrett et al., 2008; Yeom et al., 2014).

The way the model is defined allows us to predict future boundary events at any simulation time, and hence we are able to create parallel execution traces which respect causality. In particular, we find that dependency-aware task computing (Leijen et al., 2009; Subhlok et al., 1993) can be used to implement this approach with high efficiency, as all the necessary information to maintain spatial and temporal causality of events can be specified via dynamic creation of tasks and dependencies.

This is in contrast to previous approaches (Nicol and Liu, 2002; Xiao et al., 1999), as the scheduler is not an implicit part of the parallel simulation algorithm but can be chosen by the user from a wide selection of openly available libraries (e.g. OpenMP 4.0 and OmpSs, Duran et al., 2011; or StarPU, Augonnet et al., 2011). We show how the selected library is integrated into our simulation framework, by assigning parts of the sequential algorithm to independent tasks that are scheduled using a certain set of rules. We evaluate this approach using the task-parallel run-time library SuperGlue (Tillenius, 2015), which has been demonstrated to be an efficient scheduler of fine-grained tasks. Using our simulator on models with realistic workloads, we demonstrate scalability on a multi-socket shared-memory system and investigate when this approach is preferable in comparison to traditional parallelization techniques. As the achievable scalability clearly depends on the properties of the individual model, we in particular choose to investigate the influence of the model’s connectivity pattern.

The article is organized as follows. In Section 2, we introduce the mathematical foundation for our framework. In Section 3, we discuss the sequential simulation algorithm and the strategy for parallelization. In Section 4, we present numerical experiments carried out on benchmarks consisting of a recently proposed epidemiological model incorporating large amounts of registered data. We also include an example of an inverse problem for an epidemic model on national scales. Finally, in Section 5, we offer a concluding discussion around the central themes of the article.

2 Epidemiological modeling

We consider in this section a highly general approach to epidemiological modeling. Proceeding stepwise we start with a description of single-node stochastic SIR-type models in the form of CTMCs, using a compact notation that also encompasses externally defined events. We next couple an ensemble of such single-node models into a network with prescribed transitions in between the nodes to arrive at a global description. Finally, since most realistic models on multiple scales will typically incorporate also quantities for which a continuous description is more natural, we consider a mixed approach in which CTMCs are coupled to ordinary differential equations.

2.1 Discrete states

We shall use a compact notation for jump stochastic differential equations (jump SDEs) as follows. We assume a probability space $(Ω, F, P)$ where the filtration $F_{t \geq 0}$ contains Poisson processes of any finite dimensionality. The time-dependent state vector $X_{t} = X (t; ω) \in Z_{+}^{N_{c}}$ , with $ω \in Ω$ , counts at time t the number of individuals of each of $N_{c}$ different categories or compartments. Since the random process is of discrete character, the map $t \to X (t)$ is right continuous only; by $X (t -)$ , we therefore denote the value of the state before any events scheduled at time t.

Given a rate function $r : Z_{+}^{N_{c}} \to R_{+}$ and a stoichiometric coefficient $s \in Z^{N_{c}}$ , we write a CTMC in the form:

d X_{t} = s μ (d t),

(1)

with scalar counting measure $μ (dt) = μ (r (X (t -)); dt)$ . This notation expresses a dynamics consisting of events with exponentially distributed waiting times of intensities $r (X (t -))$ ; specifically $E [μ (dt)] = E [r (X (t -)) dt]$ . An event at time t implies that the state is to be changed according to the prescription $X (t) = X (t -) + s$ . In equation (1), note that if some stoichiometric coefficient $s_{i} < 0$ , then we must have that $r (x) = 0$ for $x_{i}$ small enough, or otherwise the chain will reach negative states with positive probability.

The generalization of equation (1) to non-scalar counting measures is straightforward. Assuming $N_{t}$ different transitions specified by a vector intensity $R : Z_{+}^{N_{c}} \to R_{+}^{N_{t}}$ and a stoichiometric matrix $S \in Z^{N_{c} \times N_{t}}$ , we simply write:

d X_{t} = S μ (dt),

(2)

with $μ (dt) = [μ_{1} (dt), \dots, μ_{N_{t}} (dt)]^{T}$ and, for each k, $μ_{k} (dt) = μ (R_{k} (X (t -)); dt)$ .

As a concrete example, consider the classical SIR model (Kermack and McKendrick, 1927):

\begin{matrix} S + I & \overset{β}{\to} 2 I \\ I & \overset{γ}{\to} R \end{matrix}} .

(3)

With state vector $X = [S, I, R]$ this can be understood as:

S = [\begin{matrix} - 1 & 0 \\ 1 & - 1 \\ 0 & 1 \end{matrix}],

(4)

R (x) = [β x_{1} x_{2}, γ x_{3}]^{T} .

(5)

With one small additional convention, the above notation also encompasses events that have been defined externally. Suppose, for example, in the SIR model, that susceptible individuals are to be added one by one at known deterministic times $(t_{i})$ . To accomplish this, we replace equation (4) with:

S = [\begin{matrix} - 1 & 0 & 1 \\ 1 & - 1 & 0 \\ 0 & 1 & 0 \end{matrix}],

(6)

and additionally define in terms of the Dirac measure:

μ_{3} (dt) = \sum_{i} δ (t_{i}; dt) .

(7)

Equation (2) now evolves the full dynamics of the coupled stochastic–deterministic model. Note that when removing individuals using this scheme, some care is required to be able to guarantee a non-negative chain.

2.2 Network model

Although the previous discussion is of completely general character, it makes sense to handle the collective dynamics of a possibly very large collection of nodes in a slightly more streamlined fashion. Assuming $N_{n}$ nodes in total, we consider the state matrix $X \in Z_{+}^{N_{c} \times N_{n}}$ and evolve the local dynamics by a version of equation (2):

d X_{t}^{(i)} = S μ^{(i)} (dt) .

(8)

Given an undirected graph $G$ each node i is modeled to affect the state of the nodes in the connected components $C (i)$ of i, and in turn, to be affected by all nodes j such that $i \in C (j)$ . The interconnecting dynamics can then be written as:

d X_{t}^{(i)} = - \sum_{j \in C (i)} C ν^{(i, j)} (dt) + \sum_{j; i \in C (j)} C ν^{(j, i)} (dt) .

(9)

Note that in equation (9), global consistency is enforced as follows. The kth “outgoing” event is a change of state according to $X^{(i)} (t) = X^{(i)} (t -) - C_{k}$ and, for some $j \in C (i)$ , $X^{(j)} (t) = X^{(j)} (t -) + C_{k}$ . By inspection, the intensity for this transition is $E [ν_{k}^{(i, j)} (dt)] = E [N_{k}^{(i, j)} (X^{(i)} (t -)) dt]$ , say, where the dependency is only on the state of the “sending” node i.

Using superposition of equations (8) and (9) the overall dynamics becomes:

d X_{t}^{(i)} = S μ^{(i)} (dt) - \sum_{j \in C (i)} C ν^{(i, j)} (dt) + \sum_{j; i \in C (j)} C ν^{(j, i)} (dt) .

(10)

As before, we conveniently allow externally defined deterministic events to be included in this description using the equivalent construction in terms of Dirac measures.

2.3 Continuous states

In the previous description, we assumed essentially that individuals were counted, such that a discrete stochastic model was needed to accurately capture the dynamics of a possibly small and noisy population. In a multiscale model, however, it makes sense to allow also continuous state variables, representing, for example, environmental properties more naturally described in a macroscopic language.

Assuming an additional continuous state matrix $Y \in R^{N_{d} \times N_{n}}$ to be available, we find that a general model corresponding to equation (10) is:

\begin{matrix} \frac{d Y^{(i)} (t)}{dt} = f (X^{(i)} (t -), Y^{(i)} (t)) \\ - \sum_{j \in C (i)} g (X^{(i)} (t -), Y^{(i)} (t)) + \sum_{j; i \in C (j)} g (X^{(j)} (t -), Y^{(j)} (t)) . \end{matrix}

(11)

Importantly, with this addition, equation (10) can now depend on the continuous state variable:

E [μ_{k}^{(i)} (dt)] = E [R_{k} (X^{(i)} (t -), Y^{(i)} (t)) dt],

(12)

E [ν_{k}^{(i, j)} (dt)] = E [N_{k}^{(i, j)} (X^{(i)} (t -), Y^{(i)} (t)) dt],

(13)

where of course k is in the range where the dynamics is stochastic rather than defined externally from a database.

Equations (10), (11), (12), and (13) form the basis for our epidemiological computational framework, next to be described.

3 Implementation

In the following section, we discuss the implementation details of our computational framework. We begin with indicating how numerical methods can be consistently designed to approximate the mathematical model arrived at previously. A description of the sequential solution algorithm and a presentation of the chosen parallelization strategy based on domain decomposition then follow. We propose to process events that cross domain boundaries as tasks and thus conclude with the introduction of dependency-aware task computing and an associated scheduling scheme.

3.1 Numerical methods

In order to be able to effectively incorporate finitely resolved temporal data as well as to obtain a parallelizable framework, we discretize time as $0 = t_{0} < t_{1} < t_{2} < \dots$ . We thus write the epidemiological model in equations (10) and (11) in integral form, using a global notation which incorporates the whole network:

X_{n + 1} = X_{n} + \int_{t_{n}}^{t_{n + 1}} G Λ (ds),

(14)

Y_{n + 1} = Y_{n} + \int_{t_{n}}^{t_{n + 1}} F (X (s), Y (s)) ds,

(15)

with the understanding that $(X, Y)_{n} = (X, Y) (t_{n})$ .

Typical numerical approaches to equations (14) and (15) are constructed via operator splitting and finite differences (Engblom, 2015). As a representable example we take:

X_{n + 1} = X_{n} + \int_{t_{n}}^{t_{n + 1}} G Λ (X (s -), Y_{n}; ds),

(16)

Y_{n + 1} = Y_{n} + \int_{t_{n}}^{t_{n + 1}} F ([X_{n} + X_{n + 1}] / 2, Y (s)) ds .

(17)

In equation (16), we freeze the variable Y at a previous time step and integrate the stochastic dynamics only. Next, in equation (17), we insert an average effective value of $X$ and integrate the deterministic part using any suitable deterministic numerical method.

To describe a more concrete numerical method, some assumptions are in order. Firstly, in equation (10), we assume that events connecting two nodes have been externally defined. In particular, this assumption is satisfied for the important case of domesticated herds of animals who move between nodes due to human interventions only. Secondly, in equation (11), we put $g = 0$ and thus remove all direct influence between continuous variables in connected nodes. This is reasonable for macroscopic variables that are not easily transported, like bacteria in soil, but could of course be violated for other media like groundwater or air.

For this scenario, we can write down a concrete numerical method per node i as follows:

{\tilde{X}}_{n + 1}^{(i)} = X_{n}^{(i)} + \int_{t_{n}}^{t_{n + 1}} S μ_{s}^{(i)} ({\tilde{X}}^{(i)} (s -), Y_{n}^{(i)}; ds),

(18)

\begin{matrix} X_{n + 1}^{(i)} = {\tilde{X}}_{n + 1}^{(i)} + \int_{t_{n}}^{t_{n + 1}} S μ_{d}^{(i)} (X^{(i)} (s -), Y_{n}^{(i)}; ds) \\ - \int_{t_{n}}^{t_{n + 1}} \sum_{j \in C (i)} C ν_{d}^{(i, j)} (X^{(i)} (s -), Y_{n}^{(i)}; ds) \\ + \int_{t_{n}}^{t_{n + 1}} \sum_{j; i \in C (j)} C ν_{d}^{(j, i)} (X^{(i)} (s -), Y_{n}^{(i)}; ds), \end{matrix}

(19)

Y_{n + 1}^{(i)} = Y_{n}^{(i)} + f ({\tilde{X}}_{n + 1}^{(i)}, Y_{n}^{(i)}) Δ t_{n} .

(20)

In equation (18), the stochastic part (subscript s) of the measure is evolved in time to produce the temporary variable $\tilde{X}$ . Next, equation (19) incorporates all externally defined deterministic events (subscript d), both locally on the node, and according to the connectivity of the network. Finally, equation (19) is just the usual Euler forward method in time with time step $Δ t_{n} = t_{n + 1} - t_{n}$ evolving the continuous state Y. The particular splitting method (equations (18) to (20)) forms the basis for much of the results reported in Section 4.

3.2 External events

Similar to the epidemiological events (equation (4)), the external events modify the discrete state according to a transition vector (equation (6)), but at a predefined time t.

We divide external events into two types: events of type $E_{1}$ operate on the state of a single node, while events of type $E_{2}$ operate on the states of two nodes. It is meaningful to distinguish between these types of events, as they are processed differently by the parallel algorithm discussed later. They are defined by a set of attributes:

E_{1} = {R, t, n, i},

(21)

E_{2} = {R, t, n, i, j},

(22)

where t is the time of the event, $R$ the transition vector, n the number of individuals affected, and i and j the indices of the affected nodes. This is a minimal set of attributes which can be further extended for specific models. As an example, within the context of the SIR model (equation (3)), we can define a birth event ${R, t, n, i}$ of type $E_{1}$ with the transition vector $R = [1, 0, 0]^{T}$ .

In the actual implementation, the transition vector is a column of the stoichiometric matrix (equation (6)) that is indexed by the event. When the event is processed at time t, it changes the state of node i according to:

X_{t + 1}^{(i)} = X_{t}^{(i)} + R n .

(23)

The overall spatial domain of the model can be understood as a graph $G = (V, E)$ . The edges E result from events of type $E_{2}$ acting on source and destination nodes $X^{(i)}$ and $X^{(j)}$ .

3.3 Sequential simulation algorithm

The sequential simulation algorithm is divided into three parts: the processing of stochastic events (equation (18)), hereafter referred to as the stochastic step, the processing of external events (equation (19)), or deterministic step, and the update of the continuous state variable (equation (20)). These steps are processed repeatedly in the abovementioned order until the simulation reaches the end.

The stochastic step (Algorithm 1) is an adaptation of Gillespie’s Direct Method (Gillespie, 1977). The algorithm generates a trajectory from a CTMC. At first, the rates $ω_{n}^{(i)}$ for all stochastic events $n = 1 \dots N_{t}$ are evaluated in all nodes $X_{i}, i = 1 \dots N_{n}$ . Then, in each node, we sum up transition rates into $λ^{(i)} = \sum ω_{n}^{(i)}$ . Next, the algorithm uses inverse transform sampling to obtain an exponentially distributed random variable representing the next stochastic event time $τ^{(i)}$ for each node $X^{(i)}$ ,

τ^{(i)} = - \log (rand) / λ^{(i)} .

(24)

Here, $rand$ denotes a uniformly distributed random number in the range $(0, 1)$ . To obtain the index of the stochastic event that occurred within the node $X^{(i)}$ , we generate a new random number $rand$ and find n such that:

\sum_{j = 1}^{n - 1} ω_{j} (X^{(i)}) < λ^{(i)} rand \leq \sum_{j = 1}^{n} ω_{j} (X^{(i)}) .

(25)

Algorithm 1: Sequential simulation loop

1: Initialize: Compute all stochastic rates

ω_{i}

in all nodes

X^{(i)}, i = 1, \dots, N_{n}

2: while

t < T_{End}

3: for all nodes i = 1 to

N_{n}

4: while

t < (t_{n} + Δ t)

5: Compute the sum

λ

of all transition intensity functions.

6: Sample the next stochastic event time by

τ = - \log (rand) / λ

using a uniformly distributed random variable

rand

7: Determine which event happened. Sample the next event (by inversion); find n such that

\sum_{j = 1}^{n - 1} ω_{j} (X^{(i)}) < λ rand \leq \sum_{j = 1}^{n} ω_{j} (X^{(i)})

8: Update the state

X^{(i)}

using the stoichiometric matrix

S

9: Update

ω_{n}

using the dependency graph G to recalculate only affected stochastic rates.

10: end while

11: end for

12:

t_{n + 1} = t_{n} + Δ t_{n}

13: Incorporate externally defined events in lists

E_{1, 2}

14: Loop over all nodes

X^{(i)}

and update the continuous state variable

Y^{(i)}

15: end while

When n is found, we compute the state update according to the transition matrix (equation (4)), setting $X_{t + τ}^{(i)} = X_{t}^{(i)} + S_{n}$ and simulation time $t = t + τ^{(i)}$ . Finally, to obtain the new next event time, the rate $ω_{n}^{(i)}$ of the event just occurred and all its dependent events need to be recomputed as in equation (24). For fast execution, these dependencies are stored in a dependency graph that is traversed at this stage. The algorithm repeats until a defined stopping time is reached, where the external events will next be processed.

The deterministic step works as a read and incorporate algorithm. It moves through the list of external events and processes them at the defined event time. In particular, if the event specifies a single compartment where the transition occurs, it can be directly applied to $X^{(i)}$ .

Finally, the continuous state variable is updated. As discussed in Section 3.1, in this step, different numerical methods can be applied. Note that the thus updated continuous state generally affects the rate of stochastic events $λ^{(i)}$ . Thus, before the simulation proceeds with the next iteration of the stochastic step, the event times need to be rescaled (Gibson and Bruck, 2000) using:

τ_{new}^{(i)} = t + (τ_{old}^{(i)} - t) \frac{λ_{old}^{(i)}}{λ_{new}^{(i)}} .

(26)

The implementation of the algorithm is written in C. The overall design is inspired and partly adapted from the unstructured mesh reaction-diffusion master equation framework (Drawert et al., 2012; Engblom et al., 2009).

3.4 Parallel simulation algorithm

The parallelization starts with a decomposition of the spatial domain of the model understood as a graph $G = (V, E)$ . The target of this graph partitioning problem is to divide the set of vertices V of size $N_{n}$ into k approximately equally sized subdomains $V_{1}, V_{2}, \dots, V_{k}$ . The cutting of edges E follows straightforwardly from the consecutive assignment of vertices to subdomains. This partitioning strategy does not guarantee a minimum amount of edge cuts, but as the distribution of edges is predominantly homogeneous in our data, we believe that the partitioning will not benefit from more sophisticated approaches. Nonetheless, if edges are distributed heterogeneously, a minimum bisection algorithm (Andreev and Räcke, 2004) may generate an optimized cut that contributes to a better performance of the parallel solver.

After partitions $V_{1 \dots k}$ are defined, the preprocessing algorithm continues to rearrange the external events into a structure that is more convenient for parallel processing. Firstly, the external events of type $E_{1}$ are assigned to k lists $E_{1}^{k}$ , such that all $E_{1}$ events affecting the nodes $X^{(i)} \in V_{k}$ are stored in the kth list.

Second, external events of type $E_{2}$ are divided in two categories: external events of type $E_{2}$ where the source and destination nodes lie within the same subdomain $V_{k}$ are assigned to lists $E_{2}^{k}$ . Events in lists $E_{1}^{k}$ and $E_{2}^{k}$ can be processed by a thread assigned to the kth subdomain in private. Events of type $E_{2}$ where the source node and destination node do not lie within the same subdomain $V_{k}$ are assigned to a second list $E_{2}^{c}$ . This list then contains domain crossing external events that have to be handled by the simulator in a special way.

The complexity of the data rearrangement is $O (n)$ , where n is the number of external events. Although n can be large, the workload is typically negligible for real-scale models. For example, in the national scale model presented in Section 4.1, the operation takes ∼0.1% of the total simulation time on one core and ∼1% of the simulation time on 32 cores, respectively. Moreover, rearranged lists can be stored and reused for models that are simulated using the same decomposition, as is done in the simulation study in Section 4.3.

Finally, the decomposed problem can be simulated in parallel. For simplicity, let us assume that each subdomain k is bound to one computing thread. Then every thread processes the stochastic step (equation (18)) and the update of the continuous variable (equation (20)) on private nodes of $V_{k}$ as well as the deterministic step (equation (19)) on external event lists $E_{1}^{k}$ and $E_{2}^{k}$ (Algorithm 2). Since time has been discretized, these computations are embarrassingly parallel in which no communication between neighboring threads is necessary during the processing of the nth time window $t_{n} + Δ t_{n}$ . The potential bottleneck of the simulation lies in the simulation of the cross-boundary events in $E_{2}^{c}$ .

Algorithm 2: Parallel simulation loop

1: Initialize: Decompose the nodes V into k subdomains

V_{k}

. Rearrange the external events of type

E_{1}

into private lists

E_{1}^{k}

of each subdomain k, where all n affected nodes

X_{n} \in V_{k}

. Further divide all

E_{2}

events into the private list

E_{2}^{k}

or the list of domain-crossing events

E_{2}^{c}

2: while

t < T_{End}

3: for alli = 1 tokdo

4: % Parallel task

T_{S}

;

5: Execute line 14 of Algorithm 1 for all nodes in subdomain

V_{k}

n > 1

6: Execute lines 3–10 of Algorithm 1 for all nodes in subdomain

V_{k}

evolving time

t \in [t_{n}, t_{n} + Δ t]

7: Execute line 13 of Algorithm 1 for all events in lists

E_{1}^{k}

and

E_{2}^{k}

at time

t \in [t_{n}, t_{n} + Δ t]

8: % End of parallel task

T_{S}

9: end for

10: % Parallel task

T_{M}

;

11: Execute line 13 of Algorithm 1 for all events in the list

E_{2}^{c}

at time

t \in [t_{n} (t_{n} + Δ t)]

12: % End of parallel task

T_{M}

;

13:

t_{n + 1} = t_{n} + Δ t_{n}

14: end while

In our study, we handle events in $E_{2}^{c}$ in different ways. The first possibility is to compute them entirely in serial. This is a valid approach if there are very few events in $E_{2}^{c}$ in relation to the private events as the scaling of the private computations will not be affected. On the other hand, if the overall simulation is dominated by the processing of $E_{2}^{c}$ events, it can be regarded as serialized, as little concurrency will be extractable using such an approach. Hence we focus on an intermediate ratio of private and global work, where events in $E_{2}^{c}$ occur at every deterministic time step $Δ t$ but at a lower frequency than the other private events. We will also investigate whether scaling in this regime is achievable and whether $E_{2}^{c}$ events are scheduled using dependency-aware task-computing.

3.5 Task-based computing

An increasing amount of scientific computations are parallelized using task-based computing (Berry et al., 2012; Haidar et al., 2014; Meng and Berzins, 2014). In order to apply this pattern, the programmer typically has to divide a larger chunk of work into a group of smaller tasks which can be processed asynchronously. A run-time library (Leijen et al., 2009; Subhlok et al., 1993) is then used to create an execution schedule of the tasks on the available parallel hardware.

If the granularity of tasks is sufficiently fine, the schedule will be denser and the idle time shorter. On the other hand, the scheduler synchronizes a larger number of small tasks which usually implies more overhead. See Gerasoulis and Yang (1993) for a thorough discussion on the impact of granularity.

If the scheduler supports dependency awareness (Perez et al., 2008), the programmer can further define a number of task dependencies. This is a critical feature if data are shared between tasks and therefore a processing order has to be enforced. The scheduler then manages the dependencies in the form of a directed acyclic graph and spawns tasks whenever all dependencies are met.

We believe that the usage of task-based computing is beneficial in our computational framework, as a small granularity of processes is given by the underlying modeling. In our approach, we aim to divide our computations into tasks and define a scheduling policy which guarantees causality of events although they are processed in parallel.

These scheduling rules can be implemented on any dependency-aware task scheduler, the only requirement for some of the scheduling policies is the support for dynamic addressing of a subset of dependencies, for example, via an array of pointers. For example, OpenMP 4.0 does not support this (OpenMP Architecture Review Board, 2013). In our computational experiments, we make use of the run-time library SuperGlue (Tillenius, 2015). In SuperGlue, dependencies are assigned to data and expressed via data versioning (Zafari et al., 2012). If a chunk of data is being processed by a task, a version counter representing the data access will be increased. Other tasks that are dependent on the chunk will be spawned whenever the new version becomes available. SuperGlue has been demonstrated to be an efficient shared-memory task scheduler that it is capable of operating at a comparably low synchronization overhead. The processing of dependencies and spawning of tasks is dynamic, and SuperGlue additionally supports load balancing by work stealing from over-utilized threads.

3.6 Scheduling and dependencies

We now define the tasks and their dependencies that are used in the task-based implementation of the parallel algorithm. Task $T_{S} (k, n)$ executes private computations on the decomposed data of the kth subdomain (lines 5 to 7 of Algorithm 2). That is the stochastic step (equation (18)) on all nodes $X_{n} \in V_{k}$ , the processing of the private external events in lists $E_{1}^{k}$ and $E_{2}^{k}$ (equation (19)) as well as the update of the continuous variables (equation (20)). The counter n indicates the iteration of the time window $t_{n + 1} = t_{n} + Δ t_{n}$ .

Task $T_{M}$ processes state updates due to the domain crossing events stored in list $E_{2}^{c}$ . In order to estimate the possible impact of granularity of $T_{M}$ tasks, we compare two different scheduling policies; if the task is constructed for coarse-grained processing, we compute all $E_{2}^{c}$ events occurring at the nth time window $Δ t$ in one single task. Thus, the task takes only one argument, $T_{M} (n)$ .

If tasks are constructed for fine-grained processing, we schedule each event in $E_{2}^{c}$ as a distinct task. We then denote the task by $T_{M} (k_{1}, k_{2}, i)$ , where $k_{1}$ and $k_{2}$ are the subdomains subject to an $E_{2}$ -event update, in which two nodes $X^{(n)} \in V_{1}$ and $X^{(m)} \in V_{2}$ are affected. The counter i now denotes the total order of the $E_{2}$ events in $E_{2}^{c}$ as given by the model input. This implies that if $1 \dots n$ events exist in the window $[t_{n}, t_{n + 1}]$ , they have to be processed by the task in this order.

Both tasks $T_{S}$ and $T_{M}$ are scheduled repeatedly until the simulation reaches its end time. Precedence dependencies between tasks are expressed using the “≺” operator. For example, $T_{S} (k_{1}, n) ≺ T_{S} (k_{1}, m)$ means that task $T_{S} (k_{1}, n)$ must complete its execution before task $T_{S} (k_{1}, m)$ is spawned. Our task-based implementation contains the following dependencies:

$T_{S} (k, n) ≺ T_{S} (k, m)$ if $n < m$ , to maintain the causality of private updates of subdomain $V_{k}$ .

To maintain the causality of domain crossing events:

$T_{M} (n) ≺ T_{M} (m)$ if $n < m$ , at coarse-grained processing.

$T_{M} (k_{a}, k_{b}, n) ≺ T_{M} (k_{c}, k_{d}, m)$ if $n < m$ and $k_{c} \in {k_{a}, k_{b}}$ or $k_{d} \in {k_{a}, k_{b}}$ , at fine-grained processing.

To maintain the causality between private subdomain updates and domain-crossing events:

${T_{S} (k_{1}, m), T_{S} (k_{2}, m), \dots, T_{S} (k_{n}, m)} ≺ T_{M} (m)$ for all subdomains $V_{1}, \dots V_{n}$ that will be affected by an $E_{2}$ events processed in task $T_{M} (m)$ , at coarse-grained processing.

(b) ${T_{S} (k_{a}, n), T_{S} (k_{b}, n)} ≺ T_{M} (k_{c}, k_{b}, i)$ if $i \in [t_{n}, t_{n + 1}]$ and $k_{c} \in {k_{a}, k_{b}}$ or $k_{d} \in {k_{a}, k_{b}}$ , at fine-grained processing.

The presented processing policies lead to a different utilization of the task scheduler. Firstly, the task $T_{M}$ will be of different size, which leads to a different synchronization behavior. Second, rules $3 (a)$ and $3 (b)$ imply that a different amount of dependencies will be created for each single task $T_{M}$ . In the fine-grained case, a task $T_{M}$ is spawned when two dependencies are met. In the coarse-grained case, the number of dependencies per task is set dynamically at run time and can potentially be larger. This can clearly have an impact on the bookkeeping overhead.

4 Computational experiments

In the following section, we present results of computational experiments of our simulator. The following measurements were obtained on Sandy; a Dell Power Edge R820 computer system equipped with four Intel Xeon E5-4650 processors and eight cores on each socket. We restricted the execution to available physical cores, as timing results on hyper-threads were strongly fluctuating. We begin with a real-world simulation using animal movement data on national scales, followed by a synthetic benchmark for scalability at varying connectivity load, and we conclude with a compute-intensive parameter estimation example.

4.1 National scale simulation of VTEC bacteria spread

Verotoxigenic Escherichia coli O157: H7 (VTEC O157) is a zoonotic bacterial pathogen with the potential to cause severe disease in humans, notably children (Karmali et al., 1983a, 1983b; Riley et al., 1983). Cattle infected with VTEC O157 are an important reservoir for the bacteria and they shed the bacteria in the feces without any signs of clinical disease (Hancock et al., 2001). Reducing the prevalence of infected cattle in the population could potentially reduce the number of human cases. However, the epidemiology of VTEC O157 in cattle is complex and targeted interventions to control the bacteria require a thorough understanding of the source and transmission routes (Hancock et al., 2001).

To explore the feasibility of national scale simulations to improve the understanding of the underlying disease spread mechanisms, we have created a model of the VTEC O157 dynamics, using the presented framework. European Union legislation requires member states to keep register of bovine animals including the location and the date of birth, movements between holdings, and date of death or slaughter (Anonymous, 2000, 2004). These records enable data-driven disease spread simulations that include spatiotemporal dynamics of the cattle population with regard to age structures, births, herd size, slaughter, and trade patterns.

The present computational experiment is based on all cattle reports to the Swedish Board of Agriculture over the period July 1, 2005 to 31 December, 2013. From these reports, three types of $E_{1}$ external events (enter, exit, and aging) and a single type of $E_{2}$ event (animal movements) were condensed. In total, there were $~ 10^{8}$ external events processed during the total run time of $T = 3106$ days. We let each integer in $0, 1, \dots, T$ to represent a synchronization window for external events, where in each window $3707 \pm 670$ $E_{1}$ events and $235 \pm 104$ $E_{2}$ events were processed. A subset of the spatial network consisting of $N_{n} = 37221$ nodes is visualized in Figure 1.

Figure 1.

Visualization of cattle movements in the VTEC O157 disease spread simulation (Section 4.1). The arcs shown are a random subset of the complete dataset of ~10⁸ recorded events. The source of the data is the national cattle register at the Swedish Board of Agriculture. VTEC O157: verotoxigenic Escherichia coli O157: H7.

Most infected cattle shed the bacteria less than 30 days before returning to the susceptible state, but calves shed for a longer period than adult cattle (Cray and Moon, 1995; Davis et al., 2006). To capture this, we let the intensity of the transitions between the states depend on the jth age category:

\begin{matrix} S_{j} & \overset{η_{j}}{\to} I_{j}, \\ I_{j} & \overset{γ_{j}}{\to} S_{j} . \end{matrix}

(27)

The rate for a susceptible individual on the ith node to become infected per unit of time is given by:

η_{j} = u υ_{j} φ_{i} (t),

(28)

for $i = 1, \dots, N_{n}$ and $j \in {calves, youngstock, adults}$ . In turn, the expected time an infected individual is in an infected state before it returns to the susceptible state is:

γ_{j} = \frac{u}{δ_{j}},

(29)

where $δ = [28, 25, 22]$ and $υ = [8, 7, 1] \times 10^{- 3}$ are age-dependent constants. The factor u can be understood as a time scale and is difficult to estimate accurately; in our experiments, it is in fact varied such that $u = 1$ closely resembles the parameterization of the model found in Widgren et al.’s (2016) study.

Finally, the continuous variable $φ_{i}$ represents the environmental bacterial concentration that asserts an infectious pressure on each individual at the ith node. A suitable model is given by:

\frac{d φ_{i}}{dt} = \frac{α \sum_{j} I_{i, j} (t)}{\sum_{j} S_{i, j} (t) + I_{i, j} (t)} - β (t) φ_{i} (t) .

(30)

Again, $i = 1, \dots, N_{n}$ are the nodes and $S_{i, j}$ and $I_{i, j}$ refer to the number of susceptible and infected individuals in the jth age compartment at the ith node, respectively. The constant $α$ is the average shedding rate of bacteria to the environment per infected individual, while $β$ captures the decay and removal of bacteria. In our experiments, we used the constant value $α = 1$ while $β (t)$ varied according to the season:

β (t) = {\begin{matrix} \log (2) / 14 : & 0 \leq (t mod 365) \leq 91 \\ \log (2) / 26 : & 91 < (t mod 365) \leq 182 \\ \log (2) / 20 : & 182 < (t mod 365) \leq 273 \\ \log (2) / 12 : & 273 < (t mod 365) \leq 364 \end{matrix}

(31)

We first parallelized the simulation by spreading tasks $T_{S}$ over multiple cores using OpenMP and serially processing the intermediate $E_{2}^{c}$ events, hereafter referred to as the fork–join approach. Next, we simulated the model using the task-based approach, scheduling tasks with coarse-grained and fine-grained policies as described in Section 3.6. We chose the number of subdomains k to be a multiple of the number of threads c. Note that this is also the number of tasks $T_{S}$ scheduled for each time window $Δ t$ . As a higher factor u creates a higher load for the tasks $T_{S}$ , we vary u to inspect boundary regions of the parallel performance.

The scaling of the different approaches is shown in Figure 2. For the case of $u = 1$ , we find that task sizes are too small to be efficient in both task-based approaches, and thus the fork–join approach reaches a higher efficiency.

Figure 2.

Performance measurements of the VTEC O157 model simulation on Sandy at varying scheduling approaches, task sizes, and scale factor u. The number of tasks k is chosen to be proportional to the number of threads c. In the OpenMP parallelization (“fork–join”), cross-boundary events are processed entirely in serial. Error bars represent the standard error in mean (n = 10). VTEC O157: verotoxigenic Escherichia coli O157: H7.

At $u = 10$ , we observe that coarse-grained processing performs better than the fork–join parallelization, optimally at task sizes $k = 16 c$ and $k = 32 c$ . In the case of fine-grained task processing, we found that the choice of k has a strong impact on the performance scaling. While all task densities scale strongly at a lower thread count, only the $k = 16 c$ density reaches a high efficiency of 0.58 at full thread consumption. Thus the efficiency is more than doubled in comparison to the efficiency of the fork–join parallelization, which was found to be 0.23.

The dependency of the parallel efficiency on the factor u is further detailed in Figure 3. We observe that scheduling overhead and small task sizes prohibit a high efficiency of both task-based approaches if $u < 1$ while the full potential of the approaches is extractable at $u > 1$ and a larger thread count. Note that the thread affinity of tasks was varied throughout the performed experiments in order to investigate the impact of data locality.

Figure 3.

Parallel efficiency of the VTEC O157 model simulation on Sandy at varying factor u. For all task-based approaches the task size $k = 16 c$ . Error bars represent the standard error in mean (n = 10). VTEC O157: verotoxigenic Escherichia coli O157: H7.

We further present a set of characteristics of the coarse-grained and fine-grained simulations in Tables 1 and 2. As shown in Table 1, the granularity of the fine-grained task $T_{M}$ is $~ 1 / 30$ of the granularity of the coarse-grained task $T_{M}$ , however the task needs to be scheduled about 235 times more often throughout the simulation run.

Table 1.

Average task granularity ± the standard deviation and the total number of tasks created during the simulation at a given partitioning k.

	Task granularities ( $10^{3}$ cycles)			Number of tasks
k	$T_{S}$	$T_{M}$ coarse	$T_{M}$ fine	$T_{S}$	$T_{M}$ coarse	$T_{M}$ fine
256	596 ± 799	314 ± 132	11 ± 5.2	794112	3102	731889
512	301 ± 461	328 ± 145	11 ± 5.2	1588224	3102	731889
1024	152 ± 294	327 ± 145	11 ± 5.2	3176448	3102	731889

Table 2.

Maximum ± full width at half maximum of the (right-skewed) histogram of waiting times for fulfilled task dependencies, and the average amount ± standard deviation of dependencies assigned to tasks at each discrete time interval $[t_{n}, t_{n + 1}]$ , at a given partitioning k on 32 computing cores.

	Waiting for dependencies ( $10^{3}$ cycles)				Number of dependencies
k	$T_{S}$ coarse	$T_{M}$ coarse	$T_{S}$ fine	$T_{M}$ fine	$T_{S}$	$T_{M}$ coarse	$T_{M}$ fine
256	15 ± 10	550 ± 200	12.5 ± 5	12.5 ± 5	1	146 ± 33	2
512	12.5 ± 5	850 ± 200	12.5 ± 5	12.5 ± 5	1	202 ± 58	2
1024	12.5 ± 5	1350 ± 500	11 ± 4	11.5 ± 4	1	248 ± 83	2

On the other hand, the advantage of the fine-grained scheduling is emphasized by the measurements shown in Table 2; the average waiting time to fulfill the dependencies for the fine-grained $T_{M}$ is 44–108× lower than for the coarse-grained $T_{M}$ . This is explained by the larger number of dependencies associated to the coarse-grained $T_{M}$ which is growing with the partitioning k.

The resulting execution trace is also visualized in Figure 4, where we show that fine-grained $T_{M}$ tasks interleave more densely with tasks $T_{S}$ , thus leading to lower idle times and higher parallel efficiency. The percentage of total work spent on the processing of tasks, the synchronization of worker threads, as well as the time spent waiting for fulfilled dependencies are shown in Figure 5 for the $u = 10$ configuration.

Figure 4.

Scheduling trace of the task-based approach; tasks $T_{M}$ (red color) are processing aggregated or single $E_{2}$ events while tasks $T_{S}$ (other colors) compute private work on partitioned subdomains. As coarse-grained tasks control a higher number of dependencies, blocking may occur. Fine-grained scheduling leads to better interleaving but higher overhead cost.

Figure 5.

Percentage of total work spent on processing of tasks, synchronization (“sync”), and time spent waiting for dependencies (“idle”) for various scheduling policies when measured on 32 cores. Note that the relation of work to overhead agrees well with Figure 2.

For further details of the scheduling performance of the SuperGlue library in regard to the task sizes and the number of dependencies, we like to refer the reader to the benchmarks available in the study by Tillenius (2015).

4.2 Synthetic benchmark

The results in Section 4.1 indicate a delicate performance dependency on the balance between the local events and the effective connectivity of the network. To further investigate this, a synthetic benchmark with a fixed load of local events was created. This model consists of two compartments S and I only, both residing on $N_{n} = 1000$ nodes. The transitions are simply:

\begin{matrix} S & \overset{1}{\to} I, \\ I & \overset{1}{\to} S, \end{matrix}

(32)

where the initial population size of each compartment $I_{i}$ and $S_{i}$ was set to 1000. This model is considered at times $t = [0, Δ t, 2 Δ t, \dots T]$ , with $Δ t = 1$ and $T = 1000$ , thus generating about 2000 local events per synchronization time window $Δ t$ and node i.

The nodes were arranged into k subdomains and a total of $ρ k (k - 1) / 2$ distinct $E_{2}$ events were generated at the end of each time window, each connecting two randomly sampled nodes $X^{(i)}$ and $X^{(j)}$ belonging to different subdomains. Hence $ρ = 1$ means that all subdomains have to communicate with all other subdomains at each synchronization point. The number of tasks for the coarse-grained and fine-grained approach was set to the number of threads ( $k = c$ ).

The measurements obtained on the Sandy computer system at full thread consumption are presented in Figure 6. The parallel efficiency of all methods lies at $~ 0.7$ for $ρ \leq 0.1$ and remains there even when $ρ \to 0$ and so we deduce that the problem is memory bound. The coarse-grained task-based implementation and the fork–join approach scale very similarly with increasing connectivity $ρ$ . The fine-grained task-based approach attains the highest parallel efficiency at $ρ \leq 0.1$ , but the performance drops at a higher global connectivity. This phenomenon arises because each $T_{M}$ task creates dependencies on two subsequent $T_{S}$ tasks at every synchronization window, thus creating a higher overhead and limiting asynchronous task execution.

Figure 6.

Parallel efficiency for the different methods on the synthetic benchmark. Error bars represent the standard error in mean (n = 10).

4.3 Feasibility of parameter estimation

A usually very compute intensive load case is the fitting of model parameters, typically using numerical optimization of some kind. The problem can briefly and ideally be described as follows; unknown is the set of parameters $k^{*}$ and an observed time-series of data $X (t; k^{*})$ . The parameters $k^{*}$ are estimated by repeatedly simulating a whole family of trajectories with parameters k, where k is modified until input data and simulations match up in some suitable sense. The framework’s feasibility of fitting an epidemiological model is of course very important, since the modeling process at some point or the other will involve in calibration of parameters with respect to reference data.

To demonstrate the feasibility of parameter estimation in the current context, we use the epidemiological model introduced in Section 4.1 and first identify the set of parameters $[k_{1}, k_{2}, k_{3}]$ that have a high degree of observability. A suitable such set is:

k_{j} = υ_{j} δ_{j}, j \in {calves, young stock, adults} .

(33)

We let a reference solution be given by a single trajectory $X (t; k^{*})$ , with $δ^{*} = [28, 25, 22]$ , $υ = [8.8, 3.2, 1] \times 10^{- 3}$ , and with $α$ and $β = β (t)$ as in Section 4.1.

To obtain a robust procedure, some kind of smoothing statistics should be considered. We chose to aggregate counts of animals in neighboring nodes into larger regions. To be precise, the overall domain was divided into 21 areas (coinciding with the Swedish county codes), after which the goodness of fit $G (k)$ was defined by:

G {(k)}^{2} = \int_{0}^{T} \sum_{j} {‖ {\bar{X}}_{j} (t; k) - \sum_{l \in C (j)} X^{(l)} (t; k^{*}) ‖}^{2} d t,

(34)

with:

{\bar{x}}_{j} (t; k) = \frac{1}{N} \sum_{i = 1}^{N} {\bar{x}}_{j}^{i} (t; k),

(35)

and where the individual sample trajectories are given by:

{\bar{x}}_{j}^{i} (t; k) = \sum_{l \in C (j)} X^{(l)} (t; k, ω_{i}),

(36)

where N is the number of trajectories and $C (j), j \in {1, \dots, 21}$ is the set of nodes $X^{(l)}$ that belongs to county j. To quantify the uncertainty in $G (k)$ , we compute the variance as:

VG (k)^{2} = \int_{0}^{T} \sum_{j} {\bar{σ}}_{j}^{2} (t; k) dt,

(37)

where:

{\bar{σ}}_{j}^{2} (t; k) = \frac{1}{N - 1} \sum_{i} ∥ {\bar{x}}_{j}^{i} (t; k) - {\bar{x}}_{j} (t; k) ∥^{2},

(38)

and use $\pm 2 VG (k) / \sqrt{N}$ as a measure of the uncertainty.

Next, the parameter estimation problem is approached by solving the minimization problem:

\hat{k} = \arg min_{k} G (k)^{2} .

(39)

In practice, we make use of the pattern search routine in (Hooke and Jeeves, 1961), which conceptually resembles the golden section search (Kiefer, 1953) in its narrowing of the search space. The numerical optimization routine evaluates (equation (34)) until the residual error reaches a defined threshold. In our tests, we varied the initial guess of the parameters $k_{0}$ but found that the results did not vary substantially. In the results below, we conveniently put $k_{0} = 1.6 k^{*}$ .

Since an increasing number of trajectories yields better estimates of the mean and variance, we simulate using different number of trajectories. We measure the total solver time on 12 and 32 computing cores, respectively, and we let the total number of iterations to be $N = 20$ in all cases. The results are presented in Table 3, where the relative residual is defined as:

R (k) = \frac{| G (k) - G (k^{*}) |}{| G (k^{*}) |} .

(40)

Table 3.

Solver time of the parameter estimation problem on 12 and 32 cores, respectively, and using a different number of simulated trajectories.

Trajectories	Relative residual	Time (c = 12; min)	Time (c = 32; min)
10	0.1738	46.6	30.2
20	0.0900	94.2	61.5
40	0.0363	189.3	123.7

The optimization landscape of the goal function (equation (4.8)), and hence the definiteness of the setup itself, is visualized in Figure 7. Due to the simple bisection search behavior of the numerical routine, the obtained parameters k are in fact the same for all displayed cases, although the relative residuals differ considerably.

Figure 7.

Goal function $G (k)$ in the form of confidence intervals $G (k) \pm 2 VG (k)$ , visualized for each parameter $k_{j}$ , $j = {1, 2, 3}$ , when the other parameters $k_{i}$ , $i \neq j$ , are held at the target value $k_{i}^{*}$ . Vertical lines indicate the target $k^{*}$ and the obtained estimates $k_{1 \dots 3}$ . The parameter ranges have been normalized for ease of comparison.

Note that the obvious approach of parallelization by computing the N independent trajectories using separate threads by a sequential algorithm is unfavorable here, for two related reasons. Firstly, each executable needs to store a rather large state space in working memory. Secondly, each simulation must also access the complete database of externally scheduled events.

5 Conclusions

Modeling and simulation are important in designing surveillance and control of livestock diseases and of major economic importance. However, various pathogens require different models to capture the disease dynamics and transmission routes. Moreover, an increasing amount of epidemiologically relevant data is becoming available. We have addressed these challenges and present a flexible and efficient computational framework for modeling and simulation of disease spread on a national scale. The simulation involves two parts. Firstly, the algorithm evolves stochastic dynamics of the disease process. Secondly, a list processor incorporates database events such as entering, exit, or movement of individuals into the model state. The framework is highly flexible in that most conceivable epidemiological models are either directly expressible, or the framework may be straightforwardly extended to encompass also non-standard models. As a concrete example, it would be relatively easy to include intervention strategies such as vaccination programs in order to simulate the impact on the global dynamics.

We have explored different strategies to parallelize the simulator on multi-socket architectures. Firstly, we decomposed the spatial information and the list of deterministic events. We then observed that the decomposed problem can be simulated at a high parallel efficiency, which is limited only by the processing of cross-boundary events. We then created three parallel implementations of the simulator core; we used OpenMP to only parallelize private computations in a fork–join fashion, while cross-boundary events were processed in serial. Two further implementations use a dependency-aware task scheduler to create execution traces that interleave cross-boundary events and private computations with respect to their dependencies. We find that this strategy allows us to exploit shared-memory parallelism at a higher degree than the fork–join approach if task sizes are sufficiently large. We benchmark this approach using the SuperGlue task library but present a set of scheduling rules defining the parallel simulator on general terms, thus allowing it to be implemented also with other dependency-aware task libraries.

We benchmarked our simulator using a model of the spatiotemporal spread of VTEC O157 bacteria in the Swedish cattle population. The model contains 37,221 nodes and evolves $~ 10^{8}$ external events from register data. We found that at a low private workload, the fork–join approach performs best, mainly due to the scheduling overhead of the task-based approaches. For higher private workloads, the simulation benefits from task-based computing, doubling the parallel efficiency on 32 cores in comparison to the fork–join approach.

To further inspect the performance dependency on network properties, we constructed a synthetic benchmark where cross-boundary events were generated randomly. Here we found that the performance of the fork–join approach and the coarse-grained task approach scales well with a growing amount of cross-boundary events. Notably, the performance of the fine-grained task processing depends more strongly on the connectivity of boundary crossing events, thus favoring a more fragmented network.

In a final example, we used the simulator to carry out an experimental parameter fitting within the VTEC O157 bacteria spread model. We emphasize the high computational complexity of this task with multiple unknown parameters to fit and the need to use several full simulation runs to evaluate each parameter candidate. A similar load case results when different intervention strategies are to be evaluated. For example, even when several interventions reduce the infectious spread globally, a policy maker could be interested in finding the most cost-efficient strategy. With this work, we provide a powerful, highly general and freely available software, which can contribute to a rapid and more efficient development of realistic large-scale epidemiological models.

Future research will encompass studies of larger inverse problems, including more realistic data input, and more complex dynamics. Yet another point for future study is the scalability of the task-based approach in a distributed environment.

Footnotes

Acknowledgements

We thank Martin Tillenius for providing assistance with the use of SuperGlue.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was financially supported by the Swedish Research Council within the UPMARC Linnaeus center of Excellence (P. Bauer, S. Engblom).

Author biographies

Pavol Bauer is a PhD candidate at the Department of Information Technology, Uppsala University. He is associated with the Linnaeus center of excellence UPMARC, Uppsala Programming for Multicore Architectures Research Center. His research interests include scientific computing, high-performance computing, and modeling and simulation with applications in Biomedicine in general. He received his master degree in Computer Science from Vienna University of Technology in 2011.

Stefan Engblom is an associate professor at the Department of Information Technology, Uppsala University and is associated with the Linnaeus center of excellence UPMARC. His research is centered around scientific computing and numerical analysis with a focus in the biosciences. He received his PhD from the Uppsala University in 2008 and became a Docent there in 2013.

Stefan Widgren is a veterinary epidemiologist at the Swedish National Veterinary Institute and a PhD candidate at the Swedish University of Agricultural Sciences. His work is focused on animal disease surveillance as well as research on risk factors for introduction and spread of VTEC O157:H7 in Swedish cattle herds.

References

Andreev

Räcke

(2004) Balanced graph partitioning. In: Proceedings of the 16th ACM SPAA, SPAA ‘04. New York, NY, USA, 27–30 June 2004, pp. 120–124. New York: ACM.

Anonymous (2000) Regulation (EC) No 1760/2000 of the European Parliament and of the Council of 17 July 2000 establishing a system for the identification and registration of bovine animals and regarding the labelling of beef and beef products. Official Journal of the European Union L 204: 1–10.

Anonymous (2004) Commission Regulation (EC) No 911/2004 of 29 April 2004 implementing Regulation (EC) No 1760/2000 of the European Parliament and of the Council as regards eartags, passports and holding registers. Official Journal of the European Union L 163: 65–70.

Augonnet

Thibault

Namyst

. (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation Practice and Experience 23(2): 187–198.

Barrett

Bisset

Eubank

. (2008) EpiSimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks. In: Proceedings of Supercomputing 2008, Austin, TX, 15–21 November 2008, pp. 1–12. Piscatway, NJ: IEEE.

Berry

Elwasif

Reynolds-Barredo

. (2012) Event-based parareal: a data-flow based implementation of parareal. Journal of Computational Physics 231(17): 5945–5954.

Brooks-Pollock

de Jong

Keeling

. (2015) Eight challenges in modelling infectious livestock diseases. Epidemics 10: 1–5.

Büttner

Krieter

Traulsen

. (2016) Epidemic spreading in an animal trade network - comparison of distance-based and network-based control measures. Transboundary and Emerging Diseases 63: e122–e134.

Carothers

Perumalla

Fujimoto

(1999) Efficient optimistic parallel simulations using reverse computation. ACM Transactions on Modeling and Computer Simulation 9(3): 224–253.

10.

Cassandras

Lafortune

(2008) Systems and models. In: Introduction to Discrete Event Systems. Boston, MA: Springer, pp. 1–51.

11.

Cray

Moon

(1995) Experimental infection of calves and adult cattle with Escherichia coli O157: H7. Applied and Environmental Microbiology 61(4): 1586–1590.

12.

Davis

Rice

Sheng

. (2006) Comparison of cultures from rectoanal-junction mucosal swabs and feces for detection of Escherichia coli O157 in dairy heifers. Applied and Environmental Microbiology 72(5): 3766–3770.

13.

Drawert

Engblom

Hellander

(2012) URDME: a modular framework for stochastic simulation of reaction-transport processes in complex geometries. BMCSystems Biology 6(1): 76.

14.

Drozdowski

(2009) Parallel tasks. In: Sammes

(ed.) Scheduling for Parallel Processing (Computer Communications and Networks). London: Springer, pp. 87–208.

15.

Duran

Ayguadé

Badia

. (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters 21(02): 173–193.

16.

Engblom

(2015) Strong convergence for split-step methods in stochastic jump kinetics. SIAM Journal on Numerical Analysis 53(6): 2655–2676.

17.

Engblom

Ferm

Hellander

. (2009) Simulation of stochastic reaction-diffusion processes on unstructured meshes. SIAM Journal on Scientific Computing 31(3): 1774–1797.

18.

Fujimoto

(1990) Parallel discrete event simulation. Communications of the ACM 33(10): 30–53.

19.

Fujimoto

(1999) Parallel and Distribution Simulation Systems. 1st ed.New York: John Wiley & Sons, Inc.

20.

Gerasoulis

Yang

(1993) On the granularity and clustering of directed acyclic task graphs. IEEE Transactions on Parallel and Distributed Systems 4(6): 686–701.

21.

Gibson

Bruck

(2000) Efficient exact stochastic simulation of chemical systems with many species and many channels. Journal of Physical Chemistry A 104(9): 1876–1889.

22.

Gillespie

(1977) Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry 81(25): 2340–2361.

23.

Greenwood

Gordillo

(2009) Stochastic epidemic modeling. In: Chowell

Hyman

Bettencourt

LMA

. (eds) Mathematical and Statistical Estimation Approaches in Epidemiology. Amsterdam, Netherlands: Springer, pp. 31–52.

24.

Haidar

Tomov

Dongarra

. (2014) A novel hybrid CPU-GPU generalized eigensolver for electronic structure calculations based on fine-grained memory aware tasks. International Journal of High Performance Computing Applications 28(2): 196–209.

25.

Hancock

Besser

Lejeune

. (2001) The control of VTEC in the animal reservoir. International Journal of Food Microbiology 66(1-2): 71–78.

26.

Harvey

Reeves

Schoenbaum

. (2007) The North American animal disease spread model: a simulation model to assist decision making in evaluating animal disease incursions. Preventive Veterinary Medicine 82(3–4): 176–197.

27.

Hasonova

Pavlik

(2006) Economic impact of paratuberculosis in dairy cattle herds: a review. Veterinarni Medicina-Czech 51(5): 193–211.

28.

Heidelberger

Nicol

(1993) Conservative parallel simulation of continuous time Markov chains using uniformization. IEEETransactions on Parallel and Distributed Systems 4(8): 906–921.

29.

Hooke

Jeeves

(1961) Direct search: solution of numerical and statistical problems. Journal of the ACM 8(2): 212–229.

30.

Jefferson

(1985) Virtual time. ACM Transactions on Programming Languages and Systems 7(3): 404–425.

31.

Karmali

Petric

Lim

. (1983a) Escherichia coli cytotoxin, haemolytic-uraemic syndrome, and haemorrhagic colitis. Lancet 2(8362): 1299–1300.

32.

Karmali

Steele

Petric

. (1983b) Sporadic cases of haemolytic-uraemic syndrome associated with faecal cytotoxin and cytotoxin-producing escherichia coli in stools. Lancet 1(8325): 619–620.

33.

Keeling

(2005) Models of foot-and-mouth disease. Proceedings of the Royal Society B 272(1569): 1195–1202.

34.

Kermack

McKendrick

(1927) A contribution to the mathematical theory of epidemics. Proceedings of the Royal Society A 115: 700–721.

35.

Kiefer

(1953) Sequential minimax search for a maximum. Proceedings of the American Mathematical Society 4(3): 502–506.

36.

Knight-Jones

TJD

Rushton

(2013) The economic impacts of foot and mouth disease – what are they, how big are they and where do they occur?Preventive Veterinary Medicine 112(3-4): 161–173.

37.

Leijen

Schulte

Burckhardt

(2009) The design of a task parallel library. In: Proceedings of the 24th ACM SIGPLAN Conference on object oriented programming systems languages and applications, OOPSLA. New York, NY, USA, pp. 227–242. New York: ACM.

38.

Masuda

Holme

(2013) Predicting and controlling infectious disease epidemics using temporal networks. F1000Prime Reports 5: 6.

39.

Meng

Berzins

(2014) Scalable large-scale fluid–structure interaction solvers in the uintah framework via hybrid task-based parallelism algorithms. Concurrency and Computation Practice and Experience 26(7): 1388–1407.

40.

Nicol

Liu

(2002) Composite synchronization in parallel discrete-event simulation. IEEE Transactions on Parallel and Distributed Systems 13(5): 433–446.

41.

OpenMP Architecture Review Board (2013) OpenMP 4.0 Application Program Interface, p. 117, Lines 14–15.

42.

Pellis

Ball

Bansal

. (2015) Eight challenges for network epidemic models. Epidemics 10: 58–62.

43.

Perez

Badia

Labarta

(2008) A dependency-aware task-based programming environment for multi-core architectures. In: 2008 IEEE International Conference on Cluster Computing, Tsukuba, 29 September–1 October 2008, pp. 142–151. IEEE.

44.

Riley

Remis

Helgerson

. (1983) Hemorrhagic colitis associated with a rare Escherichia coli serotype. New England Journal of Medicine 308(12): 681–685.

45.

Shirley

Rushton

(2005) The impacts of network topology on disease spread. Ecological Complexity 2(3): 287–299.

46.

Stevenson

Sanson

Stern

. (2013) InterSpread plus: a spatial and stochastic simulation model of disease in animal populations. Preventive Veterinary Medicine 109(1–2): 10–24.

47.

Subhlok

Stichnoth

O’Hallaron

. (1993) Exploiting task and data parallelism on a multicomputer. In: Proceedings of the 4th PPOPP, New York, NY, USA: ACM, pp. 13–22.

48.

Tillenius

(2015) SuperGlue: a shared memory framework using data version-ing for dependency-aware task-based parallelization. SIAM Journal on Scientific Computing 37(6): C617–C642.

49.

Widgren

Engblom

Bauer

. (2016) Data-driven network modeling of VTEC O157 transmission in Swedish cattle using complete population movement data. Under revision.

50.

Willeberg

Grubbe

Weber

. (2011) The world organisation for animal health and epidemiological modelling: background and objectives. Revue Scientifique et Technique 30(2): 391–405.

51.

Xiao

Unger

Simmonds

. (1999) Scheduling critical channels in conservative parallel discrete event simulation. In: Proceedings of the 13th Workshop of PADS, Atlanta, GA, 1–4 May, 1999, pp. 20–28. Washington, DC: IEEE.

52.

Yeom

Bhatele

Bisset

. (2014) Overcoming the scalability challenges of epidemic simulations on Blue Waters. In: Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium, Phoenix, AZ, pp. 755–764.

53.

Zafari

Tillenius

Larsson

(2012) Programming models based on data versioning for dependency-aware task-based parallelisation. In: Proceedings of the 15th International Conference on Computational Science and Engineering, Nicosia, Cyprus, 5–7 December 2012, pp. 275–280. Los Alamitos: IEEE Computer Society.