Enable : A CPU-GPU framework for dynamic data-driven agent-based population health simulations

Abstract

Simulating large-scale, high-fidelity population health models demands immense computational resources and efficient parallelization strategies. We present the design and performance of Enable (Efficient National-scale Agent-Based Learning Environment), a hybrid CPU-GPU agent-based modeling framework optimized for the Frontier supercomputer. Enable generates contact networks directly from activity schedules, enabling location-based parallel execution without relying on precomputed contact graphs. A GPU device constructs contact networks and uses an efficient load balancing algorithm to assign locations to processors, while CPUs perform the simulation tasks. This design achieves high parallel efficiency by ensuring balanced edge distributions across processors, resulting in uniform execution times. We evaluate both strong and weak scaling using synthetic and real-world datasets generated from UrbanPop, Uber H3, and OSM maps. Scaling performance is studied with city-to national-scale population sizes on up to 1200 GPUs of the Frontier supercomputer. Enable addresses key computational challenges in large-scale high-fidelity agent-based simulations that beset the development of national-scale virtual population health twins.

Keywords

agent-based modeling epidemic simulation CPU-GPU heterogeneous computing load balancing contact networks exascale frontier supercomputer population health

1. Introduction and background

Agent-based modeling (ABM) has historically been widely used in infectious disease epidemiology to model the transmission of disease and the efficacy of interventions. Its reach as a decision-making tool for hospital administrators, public health policy makers, industrial health, and human health security is, however, much broader. ABMs are well suited for simulating complex systems characterized by high variability and shaped by multiple interacting determinants operating across different levels, often resulting in emergent phenomena. These simulations can serve as virtual representations, or twins, for population health. Virtual population health twins enable in silico exploration of system behaviors and testing the effects of exposure to potentially harmful substances in the work or everyday environment, public health interventions, and spread of infectious diseases in a controlled setting Colasanti et al. (2022); Li et al. (2016); Lu and Habre (2023); Luke et al. (2017); Scott et al. (2015); Raab et al. (2022); Tracy et al. (2018). The utility of these simulations, however, is dependent upon how well the models can characterize the population and there is a sharp trade-off between computational cost and model realism.

Computational tools that enable in silico experimentation can leverage large volumes of data to provide an integrated view of population health that mirrors the complexity and heterogeneity that exist in the real world. A prototype for virtual, dynamic, data-driven representations of human populations that can be used to predict disease, develop targeted interventions and prevent chronic diseases is under development within a larger project. Upon completion, these virtual population health twins are expected to serve as tools that can be used to optimize human health and longevity. This paper reports the parallel design and performance of a high-fidelity data-driven simulation engine, called Enable (see Figure 1), at the core of this larger digital twin capability.

Figure 1.

The data-driven dynamic Enable framework.

ABM is well suited to investigate population health dynamics because it captures the heterogeneity of individuals and their behaviors within a complex environment. Unlike aggregate models, ABMs simulate individual agents – each with unique attributes such as age, occupation, health status, and daily routines – and model their interactions with other agents and their environment over time. This allows for a granular representation of how diseases spread, how people respond to interventions, and how social, spatial, and behavioral factors influence health outcomes. ABMs are particularly powerful for exploring ”what-if” scenarios, such as the impact of targeted intervention strategies for public health planning and epidemic response. However, key computational challenges must be addressed before the benefits of large-scale high-fidelity ABM simulations for real-world decision support can be realized.

1.1. Computational challenges

National-scale ABMs have demonstrated the epidemiological importance of preserving fine-grained population heterogeneity Chen et al. (2024). High-fidelity ABMs often involve simulating millions or even hundreds of millions of agents that interact over time and space, often with fine-grained individual behaviors and environmental contexts. Achieving strong and weak scaling across thousands of computing processes in a distributed memory setting on a heterogeneous CPU-GPU parallel computer is non-trivial due to the irregular, dynamic nature of agent behaviors. Agents do not interact uniformly in space or time – urban centers may see dense, overlapping interactions while rural regions remain sparse, leading to unbalanced workloads.

Scalability is also adversely affected by the diversity of agent behaviors and model heterogeneity. In many simulations, agents engage in different activities with varying computational costs – e.g., disease transmission calculations, policy compliance checks, and mobility updates – which makes uniform parallel execution difficult. Unlike numerical simulations with predictable computation patterns, ABMs are event-driven and highly asynchronous. This unpredictability challenges traditional load-balancing strategies.

Furthermore, as simulations are scaled up to national levels, inter-node communication and synchronization can become scalability bottlenecks, especially when agents need to frequently update shared data structures such as time-varying contact networks. Achieving scalability requires not only dividing computation evenly but also reducing inter-dependencies and optimizing communication between distributed components. Communication overhead increases when agent interactions span partition boundaries, especially in spatially distributed simulations. Efficient domain decomposition, interaction locality, and communication minimization are essential to maintaining high parallel efficiency.

ABMs often lack fine-grained representations of the actual location of individuals over time, failing to faithfully represent population movement and interactions. As models are often developed to predict macro level outcomes, trade-offs between model specificity and computational complexity are often accepted. However, these concessions limit ABM’s ability to capture contextual dependencies; ignoring interactions between agents, their geospatial contexts, and the effects on the disease of interest. Additionally, temporal modeling of contacts that consider an agent’s sequential pattern of exposure and the evolution of networks over time enable greater insights into spatiotemporal trajectories of exposure and disease. These methods have become more prevalent, supported by techniques such as time-sliced contact graphs and dynamic location-based interaction modeling.

Despite these advances, challenges remain in harmonizing heterogeneous data sources, achieving real-time simulation speeds, and ensuring reproducibility. Our work builds on these foundations by integrating high-fidelity activity schedules, real-world spatial data, and scalable contact network generation into a unified, high-performance ABM framework designed for context-aware population-scale health simulations. We improve spatiotemporal realism through the integration of platforms like UrbanPop Tuccillo et al. (2023), OpenStreetMap (OSM) OpenStreetMap contributors (2017), and Uber H3 Technologies (2022), enabling more accurate modeling of movement, collocation, and urban infrastructure. The Enable framework represents an important step toward extreme-scale ABMs by leveraging heterogeneous hardware and distributed simulation of contact networks derived from activity schedules.

In this preliminary study, we address a number of these challenges using a data-driven approach that is complemented by efficient load-balancing and communication hiding strategies to maximize heterogeneous parallel performance of the ABM simulations on the Frontier supercomputer and present initial performance results on up to 1200 GPUs across multiple population scales.

1.2. Contributions

In this paper, we present a hybrid CPU-GPU framework, called Enable (Efficient National-scale Agent-Based Learning Environment), for scalable data-driven agent-based population health simulations that is optimized for CPU-GPU architectures like the Frontier supercomputer. Here, “data-driven” refers to assimilation of demographic, mobility, and geospatial data to construct dynamic, schedule-derived contact networks, and not to AI/ML-based state update rules, as is the common practice today. The paper reports the following distinguishing capabilities:

• Dynamic Contact Networks: Few solutions exist for data assimilation between ABM tools and other programs that can be used to integrate time-dependent factors, such as agent mobility, into ABMs. Typically, ABM simulations are executed using static contact networks that are generated prior to the start of the simulation run. Here, we demonstrate how the integration of dynamic contact networks can be achieved using a population with activity trajectories that vary over time and space. While we use historical survey data for this paper, the assimilation approach presented here opens up the possibility of reconfiguring it for streaming sensor or cellular data to allow for the integration of real-time human mobility data.

• Efficient Parallel Load Balancing: Location-based contact networks often result in load balancing issues as the number of agents in a single location varies by location type. For example, locations such as metro stations or urban cores are hubs that draw a large number of people relative to rural areas. To minimize inter-process communications, computational load is partitioned based on locations per process, not agents per process. We demonstrate the performance improvement from integrating our load balancing approach based on the Uniform Cost Partitioning (UCP) algorithm to balance the edges in each parallel process uniformly.

• Architecture-Aware Parallel Execution: Parallel executions on heterogeneous CPU-GPU architectures are prone to performance bottlenecks due to frequent device-host data exchanges further extending already long run-times for ABMs. Here, we describe a novel CPU-GPU framework that allows efficient use of the CPU-GPU environment of the Frontier supercomputer.

• Memory-Efficient Execution: In traditional ABMS frameworks the contact networks need to be stored explicitly for processing. For a size of 350 M population with 3.5 M locations it produces 34.5 billion edges in each simulation hour taking approximately 250 GB of space. In contrast we only store the list of people in a location which takes only 1.3 GB to store, resulting in substantial memory savings.

• New Load Balancing CUDA Kernel: Since the Uniform Cost Partitioning algorithm (and its variant used in this work) were initially designed for distributed-memory systems, no GPU-based implementation of UCP existed. Here, we report a novel CUDA implementation of the CPU-only load balancing algorithm, which was subsequently adapted to HIP using hipify-perl.

Evaluations are presented to demonstrate excellent weak and strong scaling performance using both synthetic and real-world datasets that include Urban-Pop (activity schedule generation), Uber H3 Hex (location indexing and spatial partitioning) and OpenStreetMap (geographic context and points of interest identification).

1.3. Related work

Early ABMs, such as FluTE Chao et al. (2010) and EpiSims Barrett et al. (2008), demonstrated the potential of simulating disease dynamics at the level of individual agents with detailed movement and contact patterns. These efforts laid the groundwork for more scalable and data-rich approaches capable of realistically simulating complex public health scenarios. In recent years, data-driven ABMs have gained momentum across multiple health domains, incorporating real-world datasets such as census demographics, mobility data, electronic health records, and contact patterns. Frameworks like Gleamviz Van den Broeck et al. (2011b), Emod Bershteyn et al. (2018), and Fred Grefenstette et al. (2013) integrate population data and behavior models to inform simulations. However, these systems often struggle to scale to national-level populations with high spatiotemporal fidelity. Common workarounds include simplifying agent populations and their contact with other agents and their environment at the cost of reducing realism and potentially limiting their utility for real-world decision support and forecasting.

Recent advancements in ABM have focused on addressing computational barriers through high-performance computing and hybrid architectures. For instance, Covasim Kerr et al. (2021) introduced stochastic disease models enriched with policy scenarios, but their scalability is constrained by centralized processing and limited use of GPU acceleration.

Enable uses EpiHiper Chen et al. (2025) as the core simulator. EpiHiper has been deployed in production pipelines for real-time federal pandemic response Bhattacharya et al. (2023) and is ideally suited for HPC environments, particularly for large-scale epidemic modeling with dynamic contact structures. EpiHiper has been operationalized through multi-cluster workflow orchestration at US scale Bhattacharya et al. (2024). Covasim is ideal for small-to-medium-scale simulations but is unsuitable for HPC executions. Fred exhibits limited efficient scaling on national-level models or highly dynamic systems. Table 7 shows a comparison of Enable with other existing tools.

1.3.1. Population scale and fidelity

Enable is explicitly designed for city-to national-scale simulations that target the modeling of the entire U.S. population (hundreds of millions of agents) and millions of geo-located places. Each agent is assigned a full-week schedule with hourly activities drawn from national travel surveys (e.g., National Household Transportation Survey (NHTS)), synthetic microdata (Urban-Pop), and geospatial information (e.g., H3-indexed locations, LandScan, and OpenStreetMap (OSM)). Epicast similarly targets the full US population using UrbanPop-derived agents but uses nighttime/daytime network switching rather than hourly activity-schedule-derived contact networks Alexander et al. (2025). Together, Enable offers high spatiotemporal fidelity: hourly, place-specific schedules for each individual at scale.

Gleam/Gleamviz emphasize global coverage but at coarser spatial resolution: populations are aggregated into thousands of metapopulation patches with mobility flows between them Balcan et al. (2010); Van den Broeck et al. (2011b). Emod focuses on detailed within-host and between-host biology across geographic nodes for specific diseases, but less on a single unified national synthetic population and activity schedules based on survey data Bershteyn et al. (2018). AnyLogic epidemic ABMs are typically used for smaller-scale, policy-oriented models with $O (1 0^{3} - 1 0^{4})$ agents Emrich et al. (2007). ChiSIM/CityCOVID are closest to Enable from a behavioral standpoint: they simulate millions of agents and over a million locations for a single metropolitan region (Chicago), with activity-based mobility and detailed intervention modeling Macal et al. (2018, 2021). FlameGPU/Flares demonstrate very large agent counts on Richmond et al. (2022); Peng et al. (2024), but the realism of the population and behaviors is left to the user, and there is no built-in notion of a national population health twin.

1.3.2. Contact networks generation

In Enable, the contact structure is an explicitly modeled temporal contact network: 24 static “snapshots” per simulation day, each corresponding to 1 hour. For each hour, a GPU kernel constructs contact edges directly from people-per-location-per-hour arrays produced by the activity-schedule pipeline. Enable never stores the full set of contact edges; instead, it generates edges on demand on the GPU and efficiently partitions them across ranks. This reduces memory needs from tens of billions of explicit edges to compact location-based lists and makes it straightforward to alter interventions (e.g., closing location types, modifying schedules) by editing activity and location mappings.

Gleam and similar metapopulation models represent contacts implicitly via mixing within patches and mobility between patches Balcan et al. (2010). Emod uses households, schools, workplaces, and sexual-contact networks combined with location-specific contagion pools, which yield dynamic contacts as individuals move but are not derived from national-scale daily activity diaries Bershteyn et al. (2018). AnyLogic models, FlameGPU and Flares give modelers flexibility to define contact structures, but do not ship with a standard pipeline for schedule-derived temporal contact networks. ChiSIM/CityCOVID are conceptually close but at city-scales: they build endogenous co-location networks as agents follow activity schedules in an urban environment Macal et al. (2018, 2021).

1.3.3. Parallelism and load balancing

Enable adopts a hybrid CPU–GPU design specifically tuned for leadership-class GPU-powered supercomputers such as Frontier. GPUs are responsible for constructing hourly contact networks and performing load balancing via a HIP/CUDA implementation of the Uniform Cost Partitioning (UCP) algorithm, which partitions locations to balance the number of contact edges per rank. EpiHiper executes disease dynamics on CPUs using these balanced partitions. The framework also employs architecture-aware GPU oversubscription (multiple MPI ranks per GPU) to better overlap communication, contact construction, and CPU-side simulation work across thousands of GPUs.

chiSIM/CityCOVID employ a CPU load-balancing strategy within RepastHPC to distribute agents and locations across MPI processes Macal et al. (2018, 2021), but published work does not extend this design to GPUs or national-scale populations. FlameGPU 2 and Flares are GPU-first frameworks that provide excellent kernel-level performance and accommodate very large agent counts Richmond et al. (2023); Peng et al. (2024), but they do not support edge-aware partitioning of national-scale social contact networks or hybrid CPU–GPU task parallelism.

Overall, Enable distinguishes itself from other epidemiological tools as follows:

• National-scale, schedule-derived ABM on exascale-class GPUs. Many frameworks provide either high-fidelity activity-based ABMs at city scale (chiSIM/CityCOVID) or global metapopulation models (Gleam), or GPU ABM engines without a population-health twin pipeline (FlameGPU, Flares). Enable integrates a national synthetic population, hourly activity schedules, and GPU-accelerated execution on thousands of GPUs in a single, end-to-end framework.

• Memory-efficient dynamic contact generation. By storing only compact people-per-location arrays and generating hourly contact networks on GPUs, Enable avoids the prohibitive memory requirements of explicit national-scale contact graphs while preserving high temporal and spatial fidelity.

• GPU-resident, edge-aware load balancing. The HIP/CUDA implementation of UCP for balancing contact edges across ranks/GPUs is a distinctive contribution: load balancing is performed on the GPU, and the partitioning is directly tied to contact edges rather than only to agent or location counts.

• Hybrid CPU – GPU task harmonization. Enable assigns data-parallel graph-construction and partitioning to GPUs while retaining the disease-specific logic on CPUs via EpiHiper. This design enables simulations of more complex and more realistic epidemiological models while still exploiting the massive parallelism and memory bandwidth of modern CPU–GPU heterogeneous leadership-class supercomputers.

• Tight integration with high-fidelity demographic and mobility data. The framework is built around Urban-Pop, NHTS, LandScan, OSM, and H3-based indexing that together construct high-fidelity national-scale populations, locations, and schedules, aligning it with the needs of population health twins rather than generic ABM benchmarks.

Taken together, these features distinguish Enable from existing epidemiological simulators and establish it as a purpose-built framework for national-scale, high-fidelity, energy- and performance-aware population health simulations on leadership-class GPU supercomputers.

2. Data preparation

Since Enable is a data-driven framework, the role of data is crucial for its execution. This section describes the data pre-processing details that precede the execution of Enable simulations.

2.1. Population data

ORNL’s UrbanPop model Tuccillo et al. (2023) was used to generate a synthetic population for Utah (roughly 3.1 million individuals) representing the state’s civilian non-institutionalized population, which excludes both active-duty military and institutional group quarters (prisons, nursing homes). Synthetic populations are produced at the census block group level (600 - 3000 people) based on statistical matching of the American Community Survey (ACS) Public-Use Microdata Sample (PUMS) – which consists of anonymized person/household level ACS responses – to block groups based on population characteristics reported in the ACS Summary File (SF) Nagle et al. (2014).

2.2. Activity data

An activity schedule is a chronological sequence of activities assigned to an individual agent, specifying start and end times, the type of activity (e.g., home, work, school, shopping), and the location at which the activity occurs. Activity data of a population of n agents refers to the activity schedule of each agent in the population. An example is shown in Table 1.

Table 1.

Sample activity schedule sorted by agent ID.

Agent ID	Start time	End time	Activity	Location
A001	07:00	08:00	Home	Home_1
A001	08:00	08:30	Commute	Bus_101
A001	08:30	11:30	Work	Office_A
A001	11:30	12:00	Meals	Cafe_C
A002	07:00	08:00	Home	Home_2
A002	08:15	08:45	Commute	Bus_101
A002	08:45	12:00	Work	Office_A
A003	07:00	08:00	Home	Home_3
A003	08:30	09:00	Commute	Bus_102
A003	09:00	11:00	School	School_B
A003	11:00	12:00	Home	Home_4

We generated full-week activity schedules for each agent in Utah’s synthetic population based on the 2017 National Household Travel Survey (NHTS) US Department of Transportation, Federal Highway Administration (2017). The NHTS consists of a large series of trip diaries for people and households weighted to be representative of the United States population. The trips encompass a variety of activity types, including home, work, school errands, meals, services, leisure, exercise, medical, social, religious/community, and volunteering. While a given person’s trip diary in NHTS typically does not span an entire week, trips recorded for various lifestyle cohorts – defined by primary role/occupation during the week, age, and geographic context – can be grouped into subsamples to estimate aggregate travel behavior. These cohort definitions are conflated with the UrbanPop synthetic population to generate activity schedules tailored to different agent types.

We used several criteria to harmonize UrbanPop with NHTS:

• Primary role including worker (age 16+), student (ages 5-9, 10-15, and 16+), childcare (under age 5), and youth not enrolled in school (ages under 5, 5-9, 10-15, and 16+), worker/student (age 16+), unemployed (age 16+), retired (age 55+), homemaker (age 16+), and other/undefined (age 16+). These labels were merged with UrbanPop based on age, employment (employed, hours worked per week), and school enrollment characteristics.

• Occupation: four categories defined by the NHTS job category variable, including sales/service, clerical or administrative, manufacturing/construction/maintenance/farming, professional/managerial/technical. These were merged with UrbanPop based on 2018 Standard Occupational Classification system (SOC) categories describing worker agents.

• Geographic context, including Census Region (“West”, applied to all agents) and urbanicity, based on approximating NHTS urbanicity (Rural, Small Town, Suburban, Second City, Urban) with population density (persons per square mile) categories mapped to each residential block group. We used this approach to infer urbanicity because the source data, developed by Claritas (2018), is proprietary and we do not have the actual mappings of these categories to block groups.

Based on these criteria, we stratified the Utah synthetic population into 110 cohorts. We then defined NHTS subsamples matching each cohort to support activity schedule simulation. We relied on a target subsample size of at least 500 trips to ensure that each subsample would be representative of an entire week. We defined the subsamples first by United States (US) Census Region (West, Midwest, Northeast, South) and primary role, then further pruned them by occupation and urbanicity if the number of trip responses remained above the threshold subsample size.

We developed an R package, Itineraire, to generate activity schedules tailored to the synthetic population. For a subsample of the NHTS on a given day, Itineraire generates a sequence of trips based on the probability of taking a trip at time m from starting activity a to destination activity b. For the sake of simplicity, the trip chain starts and ends at home each day. A “last call” time, based on the 75th percentile time of trips home, is used to identify when an agent’s final trip home takes place. The trip chain simulator includes additional mechanisms to handle the duration of trips, both total trip time and dwell time, as well as the number of allowable repeats for a given activity. Activity schedules are derived from the trip chain based on the hourly coverage interval portion of the trip spent doing the activity (the dwell time).

For each agent cohort defined for Utah, we generated a pool of 200 full-week activity schedules, linking Monday schedules with successive weekdays based on nearest-neighbor matching on activity sequences, and repeating the procedure for weekends using Saturdays as a baseline. Finally, we randomly assigned activity labels to agents in the synthetic population by cohort label.

Note that our current activity data generation pipeline (Itineraire with NHTS-derived trip chains) builds daily chains that start/end at home ensuring consistent hour-to-hour coverage. This pipeline can be generalized to include non-returning agents and/or agents that cross regional boundaries by extending Itineraire to stitch multi-day activity chains for specific cohorts (truckers, long-haul), and assimilating exogenous multi-day trajectories with their hour-by-hour location lists. Contacts are constructed from hourly co-location at H3-indexed locations; when activity datasets span multiple states (including larger synthetic and national-scale configurations), agents mix across state boundaries wherever location assignment and schedules place them. Agent mobility and interactions are data-driven and encapsulated in the pre-processed time-series activity data provided as an input to Enable.

2.3. Location data

The UrbanPop synthetic population is attributed with anchor nighttime (residential) and daytime (work, school) locations within census block groups. To assign residential point locations for each agent, we downscaled synthetic households to the openly available FEMA USA Structures dataset Yang et al. (2024) by conflating synthetic household dwelling characteristics with properties of residential structures available in each block group Tuccillo (2023).

Daytime activity locations were then assigned to workers and students. The process of generating activity schedules for Utah encompasses all rural counties, core-based metro/micropolitan statistical areas (CBSAs), and combined statistical areas (CSAs) overlapping state boundaries. The worker module, WorkerSim, employs data from the Census Longitudinal Employer Household Dynamics (LEHD) Origin-Destination Statistics (LODES). WorkerSim is carried out first by combining worker totals by North American Industry Classification System (NAICS) reported in the LEHD Workplace Area Characteristics (WAC) with origin-destination statistics by industry (goods production, transportation, all other industries) from the LODES, then probabilistically assigning workplace destinations to synthetic workers by NAICS industry based on reported origin-destination flows. Synthetic worker flows are reweighted to match job totals from WAC using two-dimensional Iterative Proportional Fitting (IPF) Kolenikov (2014). The SchoolSim module encompasses students in childcare, primary education, and post-secondary education. SchoolSim combines enrollment totals from National Center for Education Statistics (NCES) with school locations from Homeland Infrastructure Foundation Level Data (HIFLD). Schools are stratified by grade level, private/public, and sex of student (post-secondary schools only). SchoolSim estimates school proximity based on road network distance to schools, using residential street intersections within home block groups as a proxy. We rely on an initial assumption that each student has two school choices, save for post-secondary education, which is unconstrained (e.g., a university student can choose to live across town from the school they attend). As with WorkerSim, student flows are reweighted to match school enrollment totals by school type and grade using IPF. Additionally, a subset of students with at least one adult at home during the day is withheld from SchoolSim to approximate a home-schooled population.

Following location assignments, anchor activities were matched to Uber H3 hex cells at a level 8 resolution $(\sim 0.8 k m^{2})$ . For home and school point locations, this was simply handled by parsing the H3 cell enclosing each point. We approximated H3 cells for workplace locations by sampling among the H3 cells associated with each workplace block group based on ORNL LandScan USA daytime population for 2019 estimated for each H3 cell.

3. Efficient contact network generation

Contact networks represent interactions or collocations between individuals, where nodes correspond to people and edges represent their contacts. Because interactions are time-dependent and often brief, contact networks are typically time-varying. In this work, a temporal contact network composed of 24 static hourly networks – one for each hour of the day – is directly extracted from activity logs of the population and location data of the activities, as described in Sections 2.2 and 2.3.

Many traditional agent-based models Barrett et al. (2008) rely on generated contact networks represented as edge lists, where each edge denotes a direct contact between individuals residing in a given location. In these models, each person’s next disease state is determined by both their internal state and the states of the people they are connected to through these contact edges. When executed sequentially, such a model can easily update the state of each individual.

However, problems arise when the model is parallelized. The main challenge is that the edge list must be partitioned across multiple ranks, and even an efficient partitioning scheme cannot guarantee that the nodes allocated to different ranks are independent. Interdependent nodes may reside on different ranks, necessitating complex communication protocols to ensure accurate and consistent disease progression across the simulation.

An alternative strategy involves using activity schedules to generate the contact network. By grouping individuals based on their co-location at the same time, these models can create mutually exclusive groups where disease progression is confined within each group. This grouping allows each rank to handle an independent subset of people without the overhead of managing inter-dependencies across ranks, thereby simplifying parallel execution and improving scalability.

Figure 2 illustrates an activity-scheduled contact network with 8 people across 3 locations, distributed over 2 computing ranks. Under a location-agnostic scheme, edges are typically grouped by node (e.g., rank 1 handles agents p₁ to p₄, while rank 2 handles agents p₅ to p₈), which can introduce communication overhead when, for example, p₁ is linked to p₅ across different ranks. In contrast, location-aware partitioning treats each location as a complete subgraph and assigns all agents in one location to the same rank. This approach ensures disease progression can be computed locally without inter-rank communication.

Figure 2.

Example of a contact network generation with 8 people and 3 locations using location-aware and location-agnostic partitioning approaches.

To efficiently generate a contact network from a given activity schedule, we use location-aware partitioning. The pseudocode for this process is shown in Algorithm 1. As a first step, we group agents based on their location. Specifically, we are given an activity schedule A for N agents in L locations, over H time units hours. Each agent has one activity listed per hour, yielding up to 24 activities per agent for a day. During preprocessing, we create an intermediate data structure D that records the list of agents present in each location for each hour. This co-location matrix, D, is then used to generate the contact networks.

Next, we generate the contact networks in parallel. Notably, the contact network for a location at a given hour is independent of the other locations at that hour. For each location ℓ at hour h, all agents present form a complete graph. If a location ℓ has n_ℓ agents, it contributes n_ℓ(n_ℓ − 1) directed edges. We, therefore, treat each location as a separate task when constructing the contact network and distribute these tasks among P ranks. As we will demonstrate, the way locations are distributed across ranks significantly affects the simulation’s load balance. To achieve a balanced workload, we use a distributed memory parallel algorithm called the Uniform Cost Partitioning (UCP) scheme Alam and Khan (2017); Alam et al. (2016). In this scheme, each location’s cost is considered to be n_ℓ(n_ℓ − 1), corresponding to the number of edges processed. The UCP algorithm determines the optimal set of locations ⟨ℓ_s, ℓ_e⟩ assigned to each rank using an MPI-based parallel implementation. Each rank then uses this assignment to generate the complete graphs for its allocated locations.

3.1. Reduced memory footprint

In practice, Enable does not need to store the contact network. Recall that the purpose of the contact network is to propagate the disease model through contacts. ABMS models that do not use activity schedules are aspatial. They only use person-to-person connectivity which forces these models to store the edge list to preserve the connectivity. In Enable, for each location, only the people residing in a location are stored which results in a connectivity graph that is complete. All people-to-people connectivity can, therefore, be determined by iterating over each pair of people in the location. Therefore, the data-driven Enable approach saves a substantial amount of memory at runtime. For example, a population of 350M distributed over 3.5 M locations produces 34.5 billion edges in each simulation hour taking approximately 250 GB of space. In contrast, Enable only stores the list of people in a location which only requires 1.3 GB to store the same contact network, resulting in substantial memory savings.

4. Load balancing algorithm

To distribute the tasks with good load balancing, an accurate estimation of the computational cost of each task is needed. Estimating costs and distributing tasks to get the best load balancing are the most challenging parts of scaling Enable to population sizes ranging from city-to national-scale on a CPU-GPU architecture. Additionally, for the best speedup, the estimation and distribution tasks must also be parallelized so that each can leverage parallelism.

The work of generating edges is divided into many independent tasks. Let $T_{ℓ}$ be the task of generating edges for location ℓ where 0 ≤ ℓ < L. Note that all the tasks are mutually independent, i.e., for any 0 ≤ ℓ, ℓ′ < L such that ℓ ≠ ℓ′, tasks $T_{ℓ}$ and $T_{ℓ^{'}}$ can be executed independently by two different ranks. When the location ℓ contains n_ℓ agents, there is a total of n_ℓ(n_ℓ − 1) bidirectional edges in $T_{ℓ}$ . Let Π denote the set of all tasks. In the next section, we describe how the Π tasks are executed by the P ranks such that the loads are well balanced.

4.1. Computational cost

Let c_ℓ be the computational cost of executing task $T_{ℓ}$ . Assume that α unit of time is required to initialize a task and β unit of time to generate an edge. Let m_ℓ be the expected number of edges generated by task $T_{ℓ}$ . Then the expected computation cost for $T_{ℓ}$ is

E [c_{ℓ}] = α + β m_{ℓ} = α + β n_{ℓ} (n_{ℓ} - 1)

(1)

Therefore, the total expected computational cost

C

is given by

C = \sum_{0 \leq ℓ < L} E [c_{ℓ}] = \sum_{0 \leq ℓ < L} (α + β m_{ℓ}) = α L + β m

(2)

where m is the expected number of generated edges. For the optimal load balancing, the tasks need to be distributed in such a way that each rank has a computational cost of

\hat{C} = C / P

4.2. Task distribution

Parallel edge generation proceeds by splitting the task set Π into P disjoint subsets Π₀, Π₁, …, Π_P−1, such that Π_k ∩Π_l = ∅ for k ≠ l and ⋃_kΠ_k = Π. Each rank P_k is then assigned the subset Π_k, executing the tasks ${T_{x} \in Π_{k}}$ . The cost of a rank P_k is given by $c (P_{k}) = \sum_{T_{x} \in Π_{k}} c_{x}$ , and the objective is to choose subsets Π_k so that all ranks have nearly the same cost $(c (P_{k}) \approx \hat{C})$ .

Finding such subsets is a well-known problem called chains-on-chains partitioning (CCP) problem Pinar and Aykanat (2004); Manne and Sørevik (1995); Olstad and Manne (1995). For improved speedup, the CCP problem must be solved in parallel. In Alam and Khan (2017), the authors introduce the Uniform Cost Partitioning (UCP) algorithm, a distributed memory parallel method that runs in $O (| Π | / P + P)$ time. UCP uses the cumulative cost $C_{x} = \sum_{i = 0}^{x} c_{x}$ to guide the distribution of tasks. Let any subset Π_k start with the task $T_{q_{k}}$ and end with the task $T_{q_{k + 1} - 1}$ where q_k is called the lower boundary of Π_k. The lower boundary q_k satisfies the following condition: $C_{q_{k} - 1} < k \hat{C} \leq C_{q_{k}}$ for 0 < k < P. Then $q_{k} = {a r g m i n}_{x} (C_{x} \geq k \hat{C})$ , i.e., a task $T_{x}$ is executed by the rank P_k where $k = ⌊C_{x} / \hat{C}⌋$ .

The original UCP algorithm considers the task as non–divisible, i.e., the entire task is assigned to a rank. However, when the tasks are divisible, an extension of the UCP algorithm, called UCP–DIV Alam et al. (2016) that achieves fine-grained load balancing by dividing the tasks into multiple sub tasks can be used. A careful examination of the task $T_{ℓ}$ shows that when the location ℓ contains n_ℓ agents, for each agent the task creates an edge to all of the other n_ℓ − 1 agents. Creating edges for an agent is not dependent on other agents as we create bi-directional edges. Therefore, the task can be divided into n_ℓ sub-tasks where each sub-task produces edges for an agent making UCP–DIV ideal for fine-grained load distribution.

4.3. Implementation of UCP–DIV on the GPU

Although the UCP and UCP–DIV algorithms were originally proposed and implemented for distributed-memory systems, a GPU-based implementation of the UCP algorithm had not been available until now. Our work provides a CUDA implementation, later adapted to HIP using hipify-perl.

The kernel was initially developed in CUDA due to CUDA’s more mature tooling and documentation, which allowed for faster prototyping and debugging during early development. Once the CUDA implementation was validated for correctness and performance, it was translated to HIP to ensure cross-platform compatibility and support for AMD GPUs. This approach allowed us to leverage existing CUDA resources while ultimately achieving our portability goals.

The translation from CUDA to HIP is largely syntactic: cuda APIs map directly to hip equivalents, and cub::DeviceScan is replaced by the binary-compatible hipcub::DeviceScan, backed by rocPRIM. The HIP port required only standard practices (e.g., verifying launch bounds, block sizes for wavefront-64 and careful use of Local Data Share (LDS)/shared memory to avoid bank conflicts). Performance tuning was limited to hardware-specific adjustments, including configuring block sizes as multiples of 64 threads to match the MI250X Wave64 execution model, and computing grid sizes at runtime using hipOccupancy-MaxActiveBlocksPerMultiprocessor based on the available Compute Units. The core algorithms (prefix-sum, UCP boundary finding) remained unchanged. Performance parity was observed within expected variance after applying routine HIP best practices and did not require AMD-specific algorithmic changes.

The core of the GPU approach relies on an efficient prefix-sum procedure derived from Blelloch’s algorithm Blelloch (1990), carefully optimized to avoid bank conflicts and handle large arrays Poscablo (2025). Because GPU computation can outpace MPI communication at large processor counts, our current design executes the entire UCP partitioning computation on a single GPU attached to each MPI rank, thereby removing the need for inter-rank communication overhead.

The pseudocode for the GPU kernel used in UCP–DIV, developed for this study, is shown in Algorithm 2. The algorithm processes the entire cost array in SIMD fashion, incrementally stepping through the data with an offset equal to the total thread count to reduce bank conflicts and enhance memory coalescing. It also removes the need for the binary search performed in the distributed-memory parallel version. Task boundaries are written to a pre-allocated GPU buffer, eliminating any intra-node communication overhead. In practice, this approach effectively scales to handle millions of tasks.

5. Architecture-aware simulations

To achieve highly granular simulations of disease propagation, a finer-grained model that uses hourly contact networks rather than daily ones is used. Our activity schedule is similarly granular, specifying up to 24 activities per agent per day. With a typical ABMS framework, the contact network for each time step is first determined and next all disease propagation paths according to the current contact network and the chosen epidemiological model are processed. Once the epidemiological model completes for time step t, the contact network for the next hour needs to be generated again. Thus, the tasks for each time step t involve two main phases: (1) generating the contact network for hour t from the activity schedule, and (2) running the epidemiological model on that newly generated network. Because both phases are computationally demanding, careful consideration is required to parallelize them effectively.

Modern high-performance computing (HPC) systems typically comprise multiple CPU cores and several GPUs that are attached to the CPUs. These processing units excel in different areas. CPUs are most efficient when the workload involves random memory accesses, branching, and complex decision making. GPUs, on the other hand, are well-suited for operations where the same instruction can be applied to many data items in parallel. Previous studies have demonstrated that epidemic simulations, which rely on irregular memory access patterns and intricate propagation rules, are better handled by Bisset et al. (2012); Xiao et al. (2019). Therefore, in our agent-based modeling and simulation (ABMS) system, CPUs focus on running the core epidemic simulations using the disease propagation model.

However, GPUs can still be leveraged to accelerate an important complementary task: generating the contact networks for upcoming time steps. Figure 3 demonstrates this idea. While the CPUs are busy progressing the epidemic simulation, GPUs can consume the activity schedule data and build the next hour’s contact network. Because GPU kernel calls are asynchronous, they return control to the CPU immediately after launching, and the overhead for these launches is typically small. As a result, while the CPU cores are advancing the disease model for time step t, the GPU can concurrently process the activity schedule, determine the partition of contact networks based on estimated costs using the UCP algorithm (see Algorithm 1), and prepare the new contact network.

Figure 3.

CPU-GPU execution of Enable simulations on a single MPI rank.

Before moving on to time step t + 1, the CPU switches its contact network pointer to the GPU-generated network from the previous time step, instructs the GPU to generate the next hour’s network, and proceeds with the new simulation time step. Note that the contact network is represented in a location-centric manner, listing which agents occupy each location, rather than explicitly storing all edges (see Figure 2). The epidemic simulation engine then iterates over these co-located agents, treating them as complete graphs, to compute the next state of the disease.

We conducted experiments to measure wall-clock time under two configurations: (1) a synchronous baseline, in which GPU-based partition boundary computation completes before the CPU-based disease simulation begins at each time step, and (2) the asynchronous pipelined execution described in this paper, where contact generation for time step t + 1 overlaps with disease simulation for time step t. Across our benchmark scenarios (Utah state-scale synthetic population of approximately 3 million agents), the asynchronous pipeline reduces per-step wall-clock time by roughly up to 20%. The degree of improvement depends on the relative costs of the two stages; in practice, the CPU-based disease simulation is typically more expensive than the GPU-based load-balancing phase. As a result, the overlap can effectively hide most of the shorter stage when the two stages are well matched in runtime.

In our experiments, the GPU-based UCP runtime typically accounts for approximately 1% – 20% of the total per-timestep runtime for Utah state-scale synthetic population of approximately 3 million agents. The exact fraction varies depending on problem size and system configuration, but remains a relatively small portion of the overall simulation cost, with the CPU-based disease simulation dominating the timestep runtime.

The GPU-based UCP load-balancing phase incurs an approximately constant per-timestep cost, as it depends solely on the number of locations and is independent of disease state. The CPU-based disease propagation phase, however, varies with hourly contact network density and the current epidemiological state. During low-contact overnight periods, when agents are predominantly co-located within small household groups, the UCP phase can account for up to 20% of total per-timestep wall-clock time. During peak-contact periods (roughly hours 8–14), when network density is highest, disease propagation dominates and the UCP fraction falls below 1%. Across the full diurnal cycle and all problem sizes studied, 1%–20% is a reasonable characterization of the GPU load-balancing overhead.

6. Performance results

The Frontier HPE Cray EX supercomputer housed at the Oak Ridge Leadership Computing Facility was used for this performance study. The system comprises 9408 AMD compute nodes, each featuring a 64-core “Optimized 3rd Gen EPYC” CPU and 512 GB of DDR4 memory. In addition, every node hosts four AMD MI250X accelerators, each containing two Graphics Compute Dies (GCDs), thereby offering eight GCDs per node with 64 GB of high-bandwidth memory (HBM2E) each. Connectivity is powered by AMD’s Infinity Fabric, enabling up to 36 GB/s bidirectional bandwidth between CPU and GPU, 200 GB/s peak bandwidth between GCDs on the same MI250X, and 50–100 GB/s between GCDs located on different MI250X accelerators.

6.1. CPU-GPU process mapping

Since each MI250X physically houses two GCDs that the operating system exposes as independent GPU devices, each Frontier node maps to 8 logical GPUs per node. Not all 64 CPU cores are available to user applications. Frontier operates in low-noise mode, which permanently pins all OS and system processes to core 0. In addition, SLURM core specialization (invoked via the -S 8 flag at job allocation) reserves one additional core per L3 cache region, consuming 8 cores in total across the node’s cache topology. The combined effect of these two system-level constraints reduces the number of user-accessible CPU cores from 64 to 56 cores per node.

Enable maps one MPI rank per CPU core. EpiHiper running on the CPUs involves irregular memory access patterns, probabilistic disease-state transitions and branching logic. Maximizing CPU utilization is, therefore, a primary design objective. With 56 usable cores per node, this yields 56 MPI ranks per node for all datasets except the 350M synthetic population.

In GPU-centric HPC applications, the standard practice is a one-to-one mapping of MPI ranks to GPUs. Enable’s CPU-first design breaks this convention: with 56 MPI ranks competing for only 8 GPUs per node, each GPU must serve 7 MPI ranks simultaneously - a 7:1 oversubscription ratio. This is a deliberate architectural choice. GPUs in Enable are used exclusively for contact network construction and load balancing via the UCP-DIV kernel, and these tasks execute asynchronously while CPUs advance the disease simulation (see Figure 3). Multiple ranks can therefore share a GPU without creating a hard serialization bottleneck, provided the GPU completes its network construction task within one simulation tick.

GPU affinity is enforced through two SLURM binding directives: --gpu-bind=closest, which binds each MPI rank to the physically nearest GPU to minimize PCIe and AMD Infinity Fabric traversal latency, and --distribution=*:block, which distributes ranks in contiguous blocks across the node’s NUMA regions to preserve locality between CPU cores and their associated GPU device.

Table 2 summarizes the process-to-GPU mapping configurations used across all experimental datasets. The mapping is not a fixed policy but a memory-aware, workload-adaptive decision: the ratio of MPI ranks to GPUs shifts depending on which resource is the binding constraint - CPU parallelism for moderate-scale datasets, and memory capacity for national-scale workloads, explained further in Section 6.3. In the context of the experimental scenarios studied here. This adaptability is essential for a framework that must operate coherently across population sizes spanning two orders of magnitude on the same hardware platform.

Table 2.

Summary of CPU-GPU process mapping configurations. The rank-to-GPU ratio adapts based on per-node memory constraints rather than a fixed policy.

Dataset	# of agents	# of nodes	MPI ranks (GPU)/Node	Total MPI ranks (GPUs)
UT	3.1M	1 to 20	56 (8)	1120 (160)
KY	4.4M	1 to 20	56 (8)	1120 (160)
NJ	8.8M	1 to 20	56 (8)	1120 (160)
10M	10M	1 to 20	56 (8)	1120 (160)
350M	350M	1 to 150	8 (8)	1200 (1200)

6.2. Epidemic disease model

Our experiments were conducted using a Susceptible – Infected – Recovered (SIR) model operating on a contact network. At each discrete time step, every infected agent has a probability β of transmitting the disease to any susceptible neighbor in the contact network. Each infected agent then recovers with probability γ, transitioning to the recovered state. The contact network is updated at an hourly timestep, reflecting changes in agents’ locations and interactions.

The classical Susceptible–Infected–Recovered (SIR) model generates characteristic epidemic trajectories where the susceptible population decreases monotonically, the infected population follows a bell-shaped curve with a distinct peak, and the recovered population increases toward saturation. We applied an SIR model to demonstrate epidemic behavior using hourly contact network data from the Utah synthetic population. The resulting SIR curves are presented in Figure 4.

Figure 4.

SIR curves for the Utah synthetic population, where S = Susceptible, I = Infected, and R = Recovered. The simulation used a classical SIR epidemic model executed for 30 simulated days with hourly activity-driven contact networks generated from the Utah dataset comprising 3.1 M agents.

As shown in the figure, the model exhibits the expected classical SIR patterns. However, the hourly contact network reveals temporal variations in disease transmission. During daytime hours, when interpersonal contacts reach their maximum frequency, the disease spreads more rapidly than during nighttime hours, as evidenced by sharp increases in the number of infected individuals. In our model, contact frequency peaks between 8 AM and 4 PM, corresponding to periods of heightened transmission activity.

6.3. Experimental scenarios

In this section, the datasets and experiments designed to evaluate the performance of the Enable framework are presented.

The population sizes used in this scalability study vary in their size between a few million agents (Utah, Kentucky and New Jersey) to tens and hundreds of agents (10M and 350M). Unlike the Utah, Kentucky and New Jersey datasets, activity schedules for population sizes of 10M and 350M agents are currently unavailable. These larger-scale datasets were synthetically generated and used in these scaling studies merely to demonstrate the scalability of the ENABLE framework at national-scale population sizes. The parallel performance characteristics are expected to mirror those presented here for the synthetic populations when activity schedules for such large populations become available.

The first synthetic dataset contains 10 million people and 100 thousand locations. For each time step (an hour) it produces about 990 million edges to process. The second synthetic dataset has 350 million people and 3.5 million locations for each simulation hour. Therefore, it creates approximately 34.5 billion edges for each 1-h time step (called a tick).

Constructing of the UT state-level activity dataset with 3.1 million individuals distributed across 1.04 million households and 1596 schools was described in Section 2.2. We also estimated 101 thousand workplaces and 100 thousand additional locations for non-home/work/school activities. Each person is associated with a geoid and can move to a workplace within that region. For every individual, the dataset specifies their hourly activities for each day of the week, producing 7 × 24 = 168 entries per person. Each entry includes the activity type, the relevant location, and the specific day and hour. Finally, these schedules were used to generate hourly contact networks. The number of connections varies from millions during the night to billions in the daytime as expected in realistic activity patterns of any population that is overwhelmingly home-bound at night.

Since each Frontier computing node holds 512 GB of memory, the number of MPI ranks that can be executed in a computing node depends on the memory required by the MPI processes. To generate the contact networks over the course of the simulation, the activity schedule is loaded and stored in the CPU memory. Each activity is denoted by a location (4 bytes) and an activity type (2 bytes). In case of the 10M synthetic population, storing 24 activities requires 10M × 24 × 6B = 1.34 GB of memory. For Utah, the activity schedule is defined for 7 days and requires 3.1 M × 24 × 7 × 6B = 2.8 GB of memory. Therefore, both cases used all 56 available cores in a computing node to execute 56 MPI ranks in parallel.

However, the 350M synthetic population dataset requires a different mapping due to main-memory storage constraints. Storing the full activity schedule for 350M agents, covering 24 hourly activities per agent across a simulation day, requires approximately 47 GB of CPU memory per rank. Supporting 56 MPI ranks simultaneously would demand roughly 56 × 47GB ≈ 2.6 TB, far exceeding the 512 GB DDR4 capacity of a single Frontier node. To accommodate this, Enable reduces the rank count to 8 ranks per node for the 350M case, restoring a conventional 1:1 GPU-to-rank mapping and reducing per-node memory pressure to within the hardware budget. These configurations are summarized in Table 2.

6.4. Load balancing performance

As highlighted in Section 4, estimating and balancing computational loads of tasks with irregular data dependencies, as is the case in ABM simulations of epidemiology, in a distributed memory parallel setting are challenging. For optimal speedup, both estimating and balancing need to be executed in parallel. One common approach is to partition the set of locations among the ranks so that the computational workload is evenly distributed. However, a naive scheme that merely assigns an equal number of locations to each rank often leads to imbalanced loads.

In this section, we report the efficacy of the load balancing algorithm that was implemented (see Section 4) and integrated into the Enable simulation framework. To illustrate its efficacy, an Enable agent-based simulation using an activity schedule for Salt Lake City with 843K people and 240K locations was executed on 16 CPU cores. The load-balancing results for Salt Lake County, Utah are intentionally presented for up to 16 ranks. Salt Lake City is selected because its highly skewed population distribution in urban areas exhibits severe load imbalance. Additionally, using a smaller number of ranks allows us to clearly expose the imbalance present in the baseline partitioning and to highlight the relationship between heterogeneous task costs and runtime variation. The locations are distributed among the CPU cores and each CPU rank executes only on its distribution of locations. We measure the number of contact edges processed and the time required to process the disease progression in each CPU rank.

Figure 5 illustrates the number of edges processed and the runtime per rank when locations are uniformly partitioned among ranks. Specifically, rank P_i handles locations from i × n_L/P to (i + 1) × n_L/P − 1, where n_L is the total number of locations and P is the number of ranks. As Figure 5(a) shows, the uniform location assignment does not yield a balanced edge load, because some locations contain significantly more people. Consequently, as shown in Figure 5(b), the runtime for processing edges in each rank also varies substantially among ranks, leading to runtime imbalance. Looking into both figures, it is evident that the run times correlate strongly with the number of edges each rank handles. If the locations were partitioned in such a way that edges are more evenly distributed across the CPU ranks, the overall runtime is likely to become more balanced as well. Figure 5 demonstrates the necessity for more sophisticated partitioning strategies to ensure efficient parallel performance.

Figure 5.

Uneven distribution of loads and parallel execution time on CPU ranks without the UCP load balancing algorithm.

Using the UCP partitioning algorithm helps address the load imbalance evident in the naive approach, where there is a strong correlation between the number of edges assigned to each rank and its runtime leads to uneven workloads (see Figure 5). By distributing edges more evenly among ranks (Figure 6), UCP achieves a more uniform runtime across ranks, a trend evident for contact networks at all hours. It is worth noting that slight “jitters” occur near the lower end of the runtime scale, likely due to hardware effects or measurement noise, but overall, the UCP method maintains significantly more balanced performance.

Figure 6.

Improvement in parallel load balance and execution time due to UCP load balancing algorithm in Enable.

6.5. GPU performance

We conducted a comparative performance evaluation between GPU-accelerated and MPI-distributed implementations of the UCP-DIV algorithm. The distributed MPI version employed standard optimization techniques including parallel prefix operations and optimized inter-rank communication across P ranks. The GPU implementation utilized single-device execution with thread-level parallelism and high-bandwidth memory access.

Performance benchmarks demonstrate superior GPU execution times compared to the MPI approach (Table 3). This advantage derives from the GPU’s unified memory architecture enabling efficient parallel reductions and high-throughput memory operations. The asynchronous nature of GPU kernel execution provides additional computational overlap, making this approach particularly suitable for applications requiring frequent load balancing operations, such as dynamic simulations and iterative graph processing.

Table 3.

Execution runtime (in secs) of the UCP-DIV Algorithm on the GPU compared to MPI-based Implementation with 100 and 400 MPI ranks.

No. of tasks	GPU	CPU-100	CPU-400
1M	107.5	364.3	431.7
10M	255.7	631.2	709.0
100M	2071.7	3675.8	3304.3
1B	23,674.4	33,312.1	29,787.5
2B	47,429.7	66,065.2	59,165.7

There are two main kernels used to compute UCP: PrefixSum and UCP-DIV. The PrefixSum kernel is used to perform prefix sum operations on the GPU. The kernels UCP-DIV is used to find boundary based on UCP-DIV algorithm. Hardware performance counters were collected via rocprofiler-compute (v3.1.0) on AMD Instinct MI250X GPUs. All measurements were taken on a single GCD (Graphics Compute Die) at N = 10, 000, 000 UCP tasks against the MI250X hardware with 23.9 TFLOP/s peak FP64 and 1600 GB/s peak HBM2e bandwidth.

Table 4 and the roofline plot Williams et al. (2009) in Figure 7 together confirm that both PrefixSum and UCP-DIV are firmly memory-bandwidth-bound, with arithmetic intensities of 0.21 and 3.75 FLOP/byte, respectively. Despite their contrasting computational roles, the two kernels achieve a strikingly similar fraction of peak HBM2e bandwidth — 77.7% (1242.7 GB/s) for PrefixSum and 78.1% (1250.1 GB/s) for UCP-DIV.

Table 4.

Hardware performance counters for the two primary UCP kernels on AMD Instinct MI250X, N = 10⁷ tasks. HBM bandwidth utilization is normalized to the 1600 GB/s per-GCD peak; FP64 utilization to 23,900 GFLOP/s.

Metric	PrefixSum	UCP-DIV
FP64 FLOPs (×10⁶)	33.32	300.00
Performance (GFLOP/s)	258.40	4687.50
% of peak FP64	1.08	19.61
VALU instruction fraction (%)	50.20	77.10
HBM BW achieved (GB/s)	1242.70	1250.10
% of peak HBM	77.70	78.10
Arithmetic intensity (FLOP/byte)	0.21	3.75
L2 hit rate (%)	44.10	14.80
Avg. L1 → L2 latency (cycles)	843.30	722.50

Figure 7.

Roofline model for the UCP kernels on AMD Instinct MI250X. Both PrefixSum (AI = 0.208 FLOP/byte) and UCP-DIV (AI = 3.750 FLOP/byte) lie well to the left of the ridge point, confirming that both kernels are memory-bandwidth-bound.

Although both kernels are memory-bound, UCP-DIV is substantially more compute-efficient. Its arithmetic intensity of 3.750 FLOP/byte, which is 18× higher than PrefixSum’s 0.208 FLOP/byte, reflects the FP64 division and comparison operations required to evaluate partition-boundary conditions per element, yielding 4687.5 GFLOP/s (19.6% of peak FP64) versus only 258.4 GFLOP/s (1.1%) for the scan. This is further confirmed by the VALU instruction fraction: UCP-DIV devotes 77.1% of all instructions to vector arithmetic, compared with 50.2% for PrefixSum, indicating a considerably denser use of the compute pipeline once data arrives from HBM. VALU utilization on GPUs represents the percentage of time the vector processing units are actively performing calculations rather than waiting for data (latency) or sitting idle. Neither kernel incurs LDS bank conflicts, ruling out shared-memory contention as a bottleneck in line with established GPU optimization principles Hijma et al. (2023).

6.6. Scaling studies

Strong scaling of a parallel application illustrates how its parallel execution time changes with the number of ranks for a fixed problem size. Figure 8(a) displays the speedup of the Enable simulation framework for the UT, KY and NJ population datasets. In Figure 8(b), the strong scaling speedup with population of 10M and 350M agents is shown. Unlike the UT, KY and NJ datasets, the 10M and 350M agents datasets are synthetically generated and used in these scaling studies to demonstrate the scalability of the Enable framework at national-scale population sizes for which activity schedules are currently unavailable. The average time per simulation time step was measured using MPI_Wtime and strong speedup computed by dividing the runtime on one rank by the runtime on multiple ranks.

Figure 8.

Strong scaling with real-world and synthetic population data of varying sizes.

The strong scaling results of the parallel Enable framework shown in Table 5 and Figure 8 demonstrate robust scalability across a wide range of MPI ranks, with performance trends strongly influenced by problem size. Speedup increases near-monotonically as the number of MPI ranks increases, with the 350M-agent scenario achieving the highest absolute speedup. This behavior indicates that larger problem sizes are better able to amortize communication and synchronization overheads, thereby sustaining scalability at higher concurrency levels. Similarly, the state-level simulations (UT, KY, NJ) show consistent scaling trends, with the largest instance (NJ, 8.8 M agents) achieving the highest speedup, further reinforcing the strong dependence of scalability on workload size.

Table 5.

Strong-scaling speedup.

Ranks	UT	KY	NJ	10M	350 M
56	29.2	35.5	37.7	44.6	180.8
280	133.3	159.1	169.7	190.1	368.5
560	233.3	263.6	282.9	357.8	610.4
1120	416.7	477.3	518.6	649.3	922.3

Parallel efficiency follows the expected behavior for strong scaling, gradually decreasing as the number of MPI ranks increases due to reduced computational work per rank and relatively increasing communication costs. For smaller population size and large GPU counts (UT at 1120 GPUs), the strong-scaling efficiency is poor which is to be expected as communication overheads dominate. However, for larger workloads, such as with 350M population, the efficiency reached above 80% despite the irregular computational workload of the data-driven ABMs. This behavior indicates that the ENABLE simulator effectively balances computation and communication and is capable of exploiting large-scale parallel systems for high-fidelity epidemic modeling. Overall, the results demonstrate that ENABLE achieves scalable performance across both synthetic and state-level workloads, with larger problem sizes exhibiting superior scalability and efficiency characteristics, consistent with strong scaling theory in distributed agent-based simulations.

Weak scaling measures the performance of a parallel algorithm when the input size per rank remains constant. For this experiment, the number of ranks was varied from 1 to 1200. Each rank executed approximately 291.7 K people and 2.9 K locations for a maximum of 350M people using 1200 ranks (GPUs). Figure 9 shows that the Enable framework achieves excellent weak scaling with almost constant runtime. The run times vary between 0.16 and 0.21 s with parallel efficiency sustained above 75% throughout, confirming that the UCP-DIV load balancing and asynchronous CPU–GPU pipeline absorb communication overheads at US-population scales.

Figure 9.

Weak scaling for up to 350M agents across 1 to 1200 MPI ranks (GPUs) with ∼291.7 K agents and ∼2.9 K locations per rank.

6.7. Model validity

6.7.1. Comparison with compartmental models

The primary goal of this paper is to report the architectural design and scaling performance of Enable on leadership-class heterogeneous architectures as shown in the previous sections. In this section, we present small-scale unit test results using the SIR disease model to demonstrate the validity of the ABM-powered Enable computations with those computed using a compartmental (ODE) model.

The unit test scenarios presented here consist of a population of n = 4000 agents divided into three compartments namely Susceptible (S), Infectious (I) and Recovered (R) individuals representing the SIR compartmental disease model. The disease dynamics are modeled by the following set of ordinary differential equations (ODEs):

\frac{d S}{d t} = - β \frac{S I}{N}; \frac{d I}{d t} = β \frac{S I}{N} - γ I; \frac{d R}{d t} = γ I

(3)

where β is the effective transmission rate, or the probability of contracting the disease per unit of time upon contact between a susceptible and an infected individual, and γ is the rate at which infected individuals recover or die and are removed from the infected compartment. We do not consider dynamics of births and deaths in these unit tests.

The mathematical and computational framework used to define the disease model within the agent-based EpiHiper simulator is called a probabilistic timed transition system (PTTS). To model within-host disease progression in EpiHiper, the PTTS uses timed and probabilistic rules to determine how an individual’s illness evolves. For example, once an agent is exposed, the PTTS dictates the probability distribution of how many days it will take for the agent to transition into an infectious or symptomatic state. To model between-host disease transmission that governs how the contagion spreads across the social contact network, the PTTS evaluates variables like contact duration, environment, mask compliance, and base infectivity to calculate the probability of a transmission event occurring.

Transmissibility (τ) is a global scaling parameter in EpiHiper that defines the base rate (likelihood) of an infection being passed from an infectious agent to a susceptible agent during a single unit of contact time. Using the transmissibility parameter, EpiHiper computes a propensity score of a transmission event occurring across any given edge in the social contact network instead of assuming a flat probability for every interaction. While transmissibility dictates between-host disease progression, another EpiHiper parameter, called the dwell time, dictates the internal biological clock of the disease. It defines exactly how many simulation iterations an agent remains trapped in a specific health state before moving to the next one. To capture natural human variance, EpiHiper does not use hard-coded, uniform numbers; instead, it relies on continuous dwell-time probability distributions to model human biological variance using two key parameters, namely, the shape parameter (α) that dictates the skew or peak of the probability distribution and the scale/rate parameter (κ) that stretches or compresses the distribution along the time axis.

Figure 10 demonstrates that the SIR curves computed using the ODE model and Enable agree closely for a fully connected contact network (Figure 10(a)), a moderately connected contact network (Figure 10(b)) and a sparsely connected contact network (Figure 10(c)) of the 4000 agents. The values of β and γ for the ODE model are chosen to be equal to 0.5 and 0.1, respectively. The transmissibility parameter, τ, in EpiHiper for these comparisons is chosen to be equal to β/d where d is the average number of connections per agent while the shape parameter (α) and the scale/rate parameter (κ) are chosen to equal 1 and 10, respectively.

Figure 10.

Comparisons of SIR curves computed by an ODE model (β = 0.5; γ = 0.1) and Enable (τ = β/d; α = 1; κ = 10), where d is average connection per agent.

6.7.2. Validity of parallel execution across multiple ranks

Ensuring deterministic and reproducible stochastic behavior across different processor configurations is essential for verification and validation in large-scale parallel simulations. During the development of the parallel epidemic simulation workflow, differences were observed in the behavior of random number generation depending on the distribution used and the number of MPI ranks involved.

EpiHiper makes use of a gamma distribution to model the duration of disease states, particularly, time between infection and becoming infectious (latent (exposed) period) and the duration during which an infected individual can transmit the disease (infectious period). The gamma distribution is also used to model stochastic delays in transmission processes, such as, time until an infected individual becomes capable of transmitting, variation in infectiousness over time and generation intervals between successive infections. The behavior of the gamma distribution provided by the C++ standard library was found to depend on the number of MPI processes. Even when the underlying pseudo-random number generator was initialized with the same seed, the resulting gamma-distributed samples differed when the processor count changed. Consequently, simulation runs executed with different MPI configurations produced different statistical realizations despite using identical seed values.

One possible workaround involved re-initializing the seed generator and the random number generator for each individual sample. Under this approach, each gamma sample was generated from a newly constructed generator initialized with the same seed, which ensured identical values across different MPI configurations. While this strategy restored determinism, it incurred a substantial performance penalty because repeated initialization of random number generation objects is computationally expensive. This overhead becomes significant in large-scale simulations that require billions of stochastic samples.

To address these limitations, a custom gamma distribution generator named GammaRNG was implemented that maintains statistical accuracy while supporting deterministic seeding across MPI ranks.

The custom GammaRNG class implements gamma-distributed random variate generation in C++ without any dependency on <random> or standard library distributions—a deliberate design choice motivated by critical shortcomings of std::gamma_distribution. Although std::gamma_distribution is nominally portable, the C++ standard does not mandate a specific algorithm or bit-for-bit reproducibility across compilers, standard library implementations (libstdc++, libc++, MSVC STL), or platforms. As a result, two processes seeded identically but compiled with different toolchains, a routine occurrence in heterogeneous HPC environments, can produce entirely different sample sequences, undermining reproducibility in large-scale stochastic simulations. Furthermore, std::mt19937, the most common engine paired with std::gamma_distribution, carries its internal state and is poorly suited to MPI workloads with thousands of ranks, each requiring an independent, lightweight stream.

The underlying pseudorandom number generator in GammaRNG is, instead, the xoshiro256 algorithm Blackman and Vigna (2021), initialized from a single 64-bit seed value via the splitmix64 expansion scheme, which maps the scalar seed into a high-quality 256-bit PRNG state across four uint64_t words. Gamma variates for shape parameter α ≥ 1 are produced using the Marsaglia & Tsang acceptance-rejection method Marsaglia and Tsang (2000), which includes a squeeze test that bypasses the computationally expensive log (⋅) call in approximately 98% of draws. For α < 1, the Ahrens–Dieter boosting technique is applied, drawing from Γ(α + 1) and scaling by U^1/α. Normal variates required internally are generated via the Marsaglia polar method Marsaglia and Bray (1964), with a spare-value cache to amortize the cost of paired draws. The class is fully instance-based—each GammaRNG object carries its own private PRNG state—meaning that two instances constructed with distinct seeds (e.g., GammaRNG rng_a(100), rng_b(200)) produce statistically independent streams with no shared mutable state, making rank-local instantiation via GammaRNG rng(base_seed + rank) a natural and correct MPI usage pattern. Correctness was validated by drawing N = 1, 000, 000 samples for multiple (α, β) configurations and confirming that empirical mean and variance converge to their theoretical values αβ and αβ², respectively (e.g., Γ(2.0, 1.5): expected mean 3.0000, sampled 3.0006; expected variance 4.5000, sampled 4.4968).

Using this custom generator, simulations executed across different MPI ranks generate consistent stochastic realizations, enabling reliable comparison of results and improving the robustness of large-scale parallel simulation studies with the Enable framework. Figure 11 illustrates the robustness of the S, I and R trajectories on a varying number of ranks using the custom generator.

Figure 11.

SIR epidemic trajectories showing the evolution of susceptible (S), infected (I), and recovered (R) populations over time.

7. Portability

Parallel performance results of the Enable framework reported in this paper are limited to its execution on the Frontier supercomputer which is a leadership-class computing platform. The underlying design of the Enable framework is, however, equally applicable to any heterogeneous CPU-GPU parallel computing architecture and the parallel execution performance trends and characteristics reported here are expected to remain unchanged with appropriate architecture-specific tuning of the process/rank-to-GPU/CPU mapping.

The use of Frontier in this work is a matter of evaluation platform, not design constraint. See Table 6. The core algorithmic contributions of Enable, viz., dynamic contact network generation from activity schedules, location-aware parallel partitioning via the Uniform Cost Partitioning (UCP-DIV) algorithm, asynchronous CPU–GPU task harmonization and memory-efficient location-centric contact representation, are each independent of any particular vendor architecture or machine. Specifically:

Table 6.

Portability of the Enable framework.

Feature	Frontier-specific element	Portable/General element
GPU kernels	HIP + AMD MI250X profiling via ROCm rocprof & rocprofiler-compute	CUDA implementation validated on DGX volta V100; HIPIFY-PERL port
Inter-node communication	Evaluated on Slingshot-11 interconnect but exclusively uses standard MPI primitives	Compatible with any high-speed fabric (InfiniBand, slingshot, OmniPath, etc.)
Rank-to-GPU mapping	56 CPU cores and 8 GPUs per node (Frontier node topology)	Parameterized adaptive policy; applies to any CPU-GPU node ratio
Data pipeline	Evaluated with Utah, Kentucky, and New Jersey synthetic populations	Any URBANPOP/NHTS/H3/OSM dataset without platform-specific modification
Load-balancing algorithm	HIP UCP-DIV kernel, 30 ms per tick on MI250X	Same algorithm in CUDA; 67 ms per tick on V100

GPU kernel portability: The UCP-DIV load-balancing kernel was prototyped in CUDA and subsequently ported to HIP using HIPIFY-PERL. This two-stage development process was deliberate: the CUDA codebase targets NVIDIA GPUs (as demonstrated by the comparative profiling on a DGX Volta V100 system reported in Table 3), while the HIP translation supports AMD GPUs including the MI250X devices on Frontier as well as any future AMD architecture.

MPI-based distributed design: All inter-node communication relies exclusively on standard MPI primitives. The process-to-GPU mapping strategy, one rank per CPU core for compute-bound workloads, one rank per GPU for memory-bound workloads, is a general policy applicable to any heterogeneous CPU-GPU cluster that exposes GPU devices through a standard runtime (ROCm, CUDA, or oneAPI/SYCL with appropriate HIP translation layers). The specific core counts and GPU-per-node ratios used in our experiments reflect Frontier’s node topology but the adaptive mapping logic is parameterized by those values, not hard-coded to them (Table 7).

Table 7.

Comparison of Enable with existing epidemiological simulators.

Tool	Strengths	Limitations
Enable (this work)	Hybrid CPU–GPU framework built on EpiHiper for city- to national-scale, activity-based ABMs; GPU-accelerated temporal contact generation and edge-aware load balancing (Uniform cost partitioning); demonstrated strong and weak scaling on leadership-class CPU–GPU systems.	Implementation complexity due to hybrid CPU–GPU design and GPU-aware load balancing; currently optimized for specific accelerator-rich HPC architectures; requires high-quality synthetic populations and activity data, which may not be readily available for all regions.
Covasim	Lightweight Python implementation that runs on standard desktop systems; well-suited for small to medium-scale simulations and rapid policy prototyping.	No built-in parallel or distributed execution support; limited scalability to very large populations or leadership-class systems.
Fred	C++ backend enables modest optimization and improved performance over interpreted models; moderate scalability for regional- to state-level models.	Not designed for true HPC-scale workloads; limited support for large, dynamic contact networks and exascale-class deployments.
Gleam/GleamViz	Mature global metapopulation framework using census and empirical mobility data; well-established for global pandemic scenario analysis and policy evaluation Balcan et al. (2010); Van Den Broeck et al. (2011a).	Coarse-grained metapopulation representation (well-mixed patches rather than activity-based individuals); CPU-only client–server implementation with limited focus on GPU or exascale scaling; no explicit, schedule-derived person-to-person contact networks.
Emod	Individual-based, multi-disease platform with detailed within-host and between-host dynamics; supports complex disease biology (e.g., malaria, HIV, TB, dengue) and heterogeneous interventions Bershteyn et al. (2018).	Primarily CPU-based (multi-threaded and distributed) with no published GPU backend; model configuration is complex and not oriented toward a single national synthetic population plus unified activity-diary pipeline; HPC use focuses on clusters rather than GPU-accelerated supercomputers.
AnyLogic (Epidemic ABMs)	General-purpose ABM/DES/SD environment with rich GUIs, visualization, and hybrid modeling; suitable for rapid development of small to medium-scale policy and operations models AnyLogic Company (2024).	Java-based engine with limited parallelism and no native GPU or MPI support; not intended for national-scale or exascale HPC studies; lacks built-in pipelines for national synthetic populations and high-resolution activity schedules.
chiSIM/CityCOVID	High-fidelity city-scale ABMs (e.g., ∼2.7 M residents and ∼1.2 M locations for chicago) with activity-based mobility and detailed interventions; runs on CPU clusters using RepastHPC and workflow tools such as EMEWS Macal et al. (2018); Ozik et al. (2021).	Focused on large metropolitan areas rather than full national populations; CPU-only implementation with load balancing at the process level but no GPU acceleration.
FlameGPU 2	Domain-agnostic GPU ABM framework with excellent raw performance and scaling to hundreds of millions of agents in suitable models; supports single-node multi-GPU and ensemble simulations, with mature C++ and Python APIs Richmond et al. (2023, 2022).	Does not provide a built-in epidemiological or population-health pipeline; users must design all epidemiological state machines, contact structures, and data pipelines; multi-node, exascale deployments for a single national epidemic ABM are not the primary focus.
Flares	GPU-accelerated, epidemic-focused ABM framework built on the FlameGPU ecosystem; designed for rapid epidemic simulation and large scenario ensembles on GPUs Peng et al. (2024).	Emphasizes framework-level scalability more than integration with realistic synthetic populations; support for multi-node graph partitioning and national-scale activity-schedule–derived contact networks not reported.
Hmes	Scalable human mobility and epidemic simulation system with support for fast intervention modeling; uses realistic mobility and contact models to explore large what-if spaces Geng et al. (2022).	Implemented for CPU-based clusters, with no GPU backend reported; focuses on scalable mobility/contact modeling rather than hybrid CPU–GPU co-design; does not yet target national-scale population health twins with activity schedule-derived temporal networks.
Loimos	Large-scale ABM for pandemics on realistic social contact networks, emphasizing parallelization and scaling on HPC systems Kitson et al. (2025).	Current implementations are CPU-centric without GPU support; does not support dynamic, activity-schedule driven contact network generation comparable to Enable.
EPICAST 2.0	National-scale ABM for respiratory pathogen spread, built atop UrbanPop Tuccillo et al. (2023); explicit nighttime and daytime contact networks; supports vaccination, NPIs, and geography- and industry-specific intervention policies Alexander et al. (2025).	CPU-only distributed execution; no hourly, activity-based temporal contact networks; no GPU-accelerated contact network generation; no network-aware load balancing; no multi-node GPU scalability.

Architecture-neutral data pipeline: The upstream data pipeline (URBANPOP, NHTS-derived activity schedules, Uber H3 spatial indexing, OpenStreetMap) is fully decoupled from the simulation runtime. Synthetic populations for any geographic region can be ingested without platform-specific modification.

8. Conclusions and future work

We introduced ENABLE (Efficient National-scale Agent-Based Learning Environment), a hybrid CPU–GPU framework for scalable, data-driven agent-based population health simulations on heterogeneous architectures such as the Frontier supercomputer. Key contributions include: dynamic contact network generation from time-varying activity trajectories, enabling realistic spatio-temporal mobility patterns and laying the foundation for real-time data assimilation; a GPU-resident implementation of the Uniform Cost Partitioning (UCP-DIV) algorithm — first in CUDA, subsequently ported to HIP — that balances location-based edge distributions across ranks, reducing inter-process communication and memory overhead; and an asynchronous CPU–GPU execution pipeline that overlaps contact network construction with disease propagation, minimizing host-device communication overheads. Comprehensive strong and weak scaling evaluations on synthetic and real-world datasets (UrbanPop, Uber H3 Hex, OpenStreetMap) confirm excellent parallel efficiency at up to 1200 GPUs. Future work will target full Frontier-scale deployment and integration of real-time data pipelines for a broader class of policy and emergency response scenarios.

Footnotes

Acknowledgements

This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (). This research used resources at the Oak Ridge Leadership Computing Facility which is a DOE Office of Science User Facility. This research was sponsored by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing under Award Number DE-SC-ERKJ422.

ORCID iDs

Sudip K. Seal

Ritwick Mishra

John Gounley

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author biographies

Maksudul Alam is a Research Scientist in the Computer Science and Mathematics Division at Oak Ridge National Laboratory. He received his Ph.D. in Computer Science from Virginia Tech. His research interests include high-performance computing, machine learning, mathematical modeling, graph algorithms, performance optimization and benchmarking, and nonlinear optimization. His work spans the design and implementation of parallel algorithms for complex problems in network science, epidemiological simulation, artificial intelligence, and scientific computing on modern accelerated architectures.

Dr. Sudip Seal is a Distinguished Scientist in the Oak Ridge National Laboratory and leads the Systems and Decision Sciences Group within the Computer Science and Mathematics Division. He is a Joint ORNL-UT Faculty in the Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville. He holds a PhD in computer engineering and a PhD in theoretical high energy physics specializing in the design, development and analysis of scalable algorithms for traditional and AI-driven methods in large-scale science and engineering applications. Dr. Seal is an Associate Editor of the Journal of Parallel and Distributed Computing and a Senior Member of both IEEE and ACM.

Joseph V. Tuccillo is a Research Scientist in the Geospatial Science and Human Security Division at Oak Ridge National Laboratory. His research interests include spatial microsimulation and geodemographic analysis, with applications in areas including environmental hazards, energy affordability, and public health.

Elizabeth McBride is an R&D Scientist in the Location Intelligence Group at Oak Ridge National Lab. She specializes in human travel behavior, and for over 10 years, she has been working with travel surveys and other mobility data to better understand people’s travel decisions.

Ritwick Mishra is a PhD student in the Department of Computer Science at University of Virginia. His research interests lie at the intersection of network science and AI for multi-agent systems, with a specific focus on network cascades. He has authored multiple peer-reviewed publications presented at conferences including AAAI, AAMAS, KDD.

Adam Spannaus is a research scientist with the Advanced Computing for Health Sciences section at Oak Ridge National Lab. In brief, his research interests lie in the fertile intersection of data science, statistics and health sciences, and using insights gained from solving real-world problems to inform and push-forward theoretical understanding. He is the primary developer of the FrESCO library (Framework for Exploring Scalable Computational Oncology) as part of the joint DOE-NCI MOSSAIC (Modeling Outcomes Using Surveillance Data and Scalable Artificial Intelligence for Cancer) project. Additionally, he is interested in developing uncertainty quantification and explainability methods for high-risk clinical decision settings, especially as related to cancer informatics. Presently he is developing uncertainty quantification methods from probabilistic and geometric viewpoints utilizing tools from Bayesian inference and Topological Data Analysis respectively.

Sifat Moon is a Research Scientist for HPC and AI in Health within the Biostatistics and Multiscale Systems Modeling Group in the Computational Sciences and Engineering Division. Prior to joining ORNL, she was a Postdoctoral Research Associate in the Network Systems Science and advanced Computing (NSSAC) section at the University of Virginia. She received her Ph.D. in Computer Engineering from Kansas State University in 2021 for her dissertation “Modeling and analysis of stochastic contagion processes over large networks from limited data.” Dr. Moon’s research areas include computational science, scalable data mining, graph algorithms, machine learning, and discrete algorithms on HPC systems. In her Ph.D., she worked on solving discrete algorithmic questions and developed large network systems to solve real-world spatiotemporal concerns. Before embarking on her Ph.D. journey, Dr. Moon gained valuable industry experience as a software engineer at Samsung Research & Development Institute, where she contributed to the design of sketching algorithms for the Samsung drawing engine and enhanced its performance. Her diverse background and research interests position her as a skilled expert in computational network science.

Dr. James Nutaro is a Distinguished Research & Development Scientist and Group Lead at Oak Ridge National Laboratory in the Computational Systems Engineering & Cybernetics group. Dr. Nutaro received a Doctorate in Computer Engineering in 2003 from the Electrical & Computer Engineering Department at the University of Arizona in Tucson, Arizona. His research interests center on modeling and simulation methods with an emphasis on systems integration and modeling systems from data. Dr. Nutaro is an associated editor for the journals SIMULATION and ACM Transactions on Modeling and Computer Simulation. He is a member of the Society for Computer Simulation and a senior member of the IEEE.

John Gounley is a computational scientist in the Computational Sciences and Engineering Division at Oak Ridge National Laboratory, where he leads the Scalable Biomedical Modeling group. John’s research focuses on scalability of algorithms for biomedical simulations and data.

Heidi A. Hanson received an MS and PhD in Demography and Sociology from the University of Utah. She is the Group Leader of the Biostatistics and Biomedical Informatics Group in the Computing and Computational Sciences Directorate at Oak Ridge National Laboratory (ORNL), Oak Ridge, TN, USA. Prior to her tenure at ORNL, she served as the co-director of the Surgical Population Analytic Research Core in the Department of Surgery at the University of Utah and the Assistant Director of Research at the Utah Population Database. Her research interests include population-scale biomedical informatics, life-course epidemiology, statistical demography, and the development of machine learning and AI methods for multimodal health data. She leads efforts to integrate large-scale clinical, genomic, and environmental data to improve disease surveillance and understand temporal and intergenerational patterns in cancer and other complex diseases.

References

Alam

Khan

(2017) Parallel algorithms for generating random networks with given degree sequences. International Journal of Parallel Programming 45(1): 109–127. https://doi.org/10.1007/s10766-015-0389-y

Alam

Khan

Vullikanti

, et al. (2016) An efficient and scalable algorithmic method for generating large-scale random graphs SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 372–383.

Alexander

Harris

Kitson

, et al. (2025) Epicast 2.0: A large-scale, demographically detailed, agent-based model for simulating respiratory pathogen spread in the United States. https://doi.org/10.48550/arXiv.2504.03604

AnyLogic Company (2024) Anylogic Multimethod Simulation Modeling Environment. https://www.anylogic.com/

Balcan

Gonçalves

, et al. (2010) Modeling the spatial spread of infectious diseases: the GLobal epidemic and mobility computational model. Journal of Computational Science 1(3): 132–145. https://doi.org/10.1016/j.jocs.2010.07.002

Barrett

Bisset

Eubank

, et al. (2008) EpiSimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks Procs. of the 2008 ACM/IEEE Conference on Supercomputing, pp. 1–12.

Bershteyn

Gerardin

Bridenbecker

, et al. (2018) Implementation and applications of EMOD, an individual-based multi-disease modeling platform. Pathogens and Disease 76(5): 59. https://doi.org/10.1093/femspd/fty059

Bhattacharya

Chen

Hoops

, et al. (2023) Data-driven scalable pipeline using national agent-based models for real-time pandemic response and decision support. Int. J. High Perform. Comput. Appl 37(1): 4–27. https://doi.org/10.1177/10943420221127034

Bhattacharya

Machi

Chen

, et al. (2024) Novel multi-cluster workflow system to support real-time HPC-enabled epidemic science: investigating the impact of vaccine acceptance on COVID-19 spread. Journal of Parallel and Distributed Computing 191: 104899. https://doi.org/10.1016/j.jpdc.2024.104899

10.

Bisset

Aji

Marathe

, et al. (2012) High-performance biocomputing for simulating the spread of contagion over large contact networks. BMC Genomics 13(2): S3. https://doi.org/10.1186/1471-2164-13-S2-S3

11.

Blackman

Vigna

(2021) Scrambled linear pseudorandom number generators. ACM Transactions on Mathematical Software 47(4): 1–32. https://doi.org/10.1145/3460772

12.

Blelloch

(1990) Prefix Sums and their Applications. Carnegie Mellon University. Technical Report CMU-CS-90-190.

13.

Chao

Halloran

Obenchain

, et al. (2010) FluTE, a publicly available stochastic influenza epidemic simulation model. PLoS Comput Biol 6(1): e1000656. Available at: https://doi.org/10.1371/journal.pcbi.1000656

14.

Chen

Bhattacharya

Hoops

, et al. (2024) Role of heterogeneity: National scale data-driven agent-based modeling for the US COVID-19 scenario modeling hub. Epidemics 48: 100779. https://doi.org/10.1016/j.epidem.2024.100779

15.

Chen

Hoops

Mortveit

, et al. (2025) EpiHiper: a high performance computational modeling framework to support epidemic science. PNAS Nexus 4(1): 557. https://doi.org/10.1093/pnasnexus/pgae557

16.

Claritas (2018) Assessing the role of urbanicity. https://nhts.ornl.gov/assets/Assessing_the_Role_of_Urbanicity.pdf

17.

Colasanti

MacLachlan

Silverman

, et al. (2022) Using agent-based models to address non-communicable diseases: a review of models and their application to policy. The Lancet 400: S33. https://doi.org/10.1016/s0140-6736(22)02243-7

18.

Emrich

Suslov

Judex

(2007) Fully agent-based modellings of epidemic spread using AnyLogic Proc. EUROSIM 2007 - 6th EUROSIM Congress on Modeling and Simulation, p. 7.

19.

Geng

Zheng

Han

, et al. (2022) HMES: a scalable human mobility and epidemic simulation system with fast intervention modeling IEEE Smartworld, Ubiquitous Intelligence & Computing. Scalable Computing & Communications, Digital Twin, Privacy Computing, Metaverse, Autonomous & Trusted Vehicles.

20.

Grefenstette

Brown

Rosenfeld

, et al. (2013) FRED (A framework for reconstructing epidemic dynamics): an open-source software system for modeling infectious diseases and interventions. BMC Public Health 13(1): 940. https://doi.org/10.1186/1471-2458-13-940

21.

Hijma

Heldens

Sclocco

, et al. (2023) Optimization techniques for GPU programming. ACM Computing Surveys 55(11): 239:1–239. https://doi.org/10.1145/3570638

22.

Kerr

Stuart

Mistry

, et al. (2021) Covasim: an agent-based model of COVID-19 dynamics and interventions. PLoS Computational Biology 17(7): e1009149. https://doi.org/10.1371/journal.pcbi.1009149

23.

Kitson

Costello

Chen

, et al. (2025) Pandemics in silico: scaling agent-based simulations on realistic social contact networks 2025 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 484–496.

24.

Kolenikov

(2014) Calibrating survey data using iterative proportional fitting (Raking). STATA Journal 14(1): 22–59. https://doi.org/10.1177/1536867x1401400104

25.

Zhang

Pagán

(2016) Social norms and the consumption of fruits and vegetables across New York city neighborhoods. Journal of Urban Health 93: 244–255. https://doi.org/10.1007/s11524-016-0028-y

26.

Habre

(2023) Impacts of distinct travel behaviors on potential air pollution exposure measurement error. Atmospheric Environment 306: 119820. https://doi.org/10.1016/j.atmosenv.2023.119820

27.

Luke

Hammond

Combs

, et al. (2017) Tobacco town: computational modeling of policy options to reduce tobacco retailer density. American Journal of Public Health 107(5): 740–746. https://doi.org/10.2105/AJPH.2017.303685

28.

Macal

Collier

Ozik

, et al. (2018) chiSIM: an agent-based simulation model of social interactions in a large urban area Proceedings of the 2018 Winter Simulation Conference (WSC), pp. 810–821.

29.

Macal

Ozik

Collier

, et al, (2021) CityCOVID: a city-scale agent-based model of COVID-19. https://www.anl.gov/dis/citycovid

30.

Manne

Sørevik

(1995) Optimal partitioning of sequences. Journal of Algorithms 19(2): 235–249. https://doi.org/10.1006/jagm.1995.1035

31.

Marsaglia

Bray

(1964) A convenient method for generating normal variables. SIAM Review 6(3): 260–264. https://doi.org/10.1137/1006063

32.

Marsaglia

Tsang

(2000) A simple method for generating gamma variables. ACM Transactions on Mathematical Software 26(3): 363–372. https://doi.org/10.1145/358407.358414 URL. https://doi.org/10.1145/358407.358414

33.

Nagle

Buttenfield

Leyk

, et al. (2014) Dasymetric modeling and uncertainty. Annals of the Association of American Geographers 104(1): 80–95. https://doi.org/10.1080/00045608.2013.843439

34.

Olstad

Manne

(1995) Efficient partitioning of sequences. IEEE Transactions on Computers 44(11): 1322–1326. https://doi.org/10.1109/12.475128

35.

OpenStreetMap contributors (2017). https://www.openstreetmap.org

36.

Ozik

Wozniak

Collier

, et al. (2021) A population data-driven workflow for COVID-19 modeling and learning. Int. J. High Perform. Comput. Appl 35(5): 483–499. https://doi.org/10.1177/10943420211035164

37.

Peng

Liu

Yang

(2024) FLARES: a framework for large-scale agent-based rapid epidemic simulation Procs. of the 22nd IEEE International Conference on Industrial Informatics, pp. 1–6.

38.

Pinar

Aykanat

(2004) Fast optimal load balancing algorithms for 1D partitioning. Journal of Parallel and Distributed Computing 64(8): 974–996. https://doi.org/10.1016/j.jpdc.2004.05.003

39.

Poscablo

(2025) GPU prefix-sum/parallel-scan implementation in CUDA. https://github.com/mark-poscablo/gpu-prefix-sum/tree/master

40.

Raab

Lenger

Stickler

, et al. (2022) An initial comparison of selected agent-based simulation tools in the context of industrial health and safety management Procs. of the 8th International Conference on Computer Technology Applications, pp. 106–112.

41.

Richmond

Chisholm

Heywood

, et al. (2022) FLAME GPU 2.0.0-rc. https://doi.org/10.5281/zenodo.7434228

42.

Richmond

Chisholm

Heywood

, et al. (2023) FLAME GPU 2: a framework for flexible and performant agent-based simulation on GPUs. Software: Practice and Experience 53(8): 1659–1680. https://doi.org/10.1002/spe.3207

43.

Scott

Livingston

Hart

, et al. (2015) SimDrink: an agent-based NetLogo model of young, heavy drinkers for conducting alcohol policy experiments. The Journal of Artificial Societies and Social Simulation 19: 10. https://doi.org/10.18564/jasss.2943

44.

Technologies

(2022) H3: A Hexagonal Hierarchical Spatial Index. https://h3geo.org

45.

Tracy

Cerdá

Keyes

(2018) Agent-based modeling in public health: current applications and future directions. Annual Review of Public Health 39(1): 77–94. https://doi.org/10.1146/annurev-publhealth-040617-014317

46.

Tuccillo

(2023) Downscaling synthetic populations to realistic residential locations Procs. of the US-RSE Conference 2023: Software Enabled Discovery and Beyond. Chicago, IL, USA, pp. 1–3.

47.

Tuccillo

Stewart

Rose

, et al. (2023) UrbanPop: a spatial microsimulation framework for exploring demographic influences on human dynamics. Applied Geography 151: 102844. https://doi.org/10.1016/j.apgeog.2022.102844

48.

US Department of Transportation, Federal Highway Administration (2017) National Household Travel Survey. https://nhts.ornl.gov

49.

Van den Broeck

Gioannini

Gonçalves

, et al. (2011a) The GLEaMviz computational tool, a publicly available software to explore realistic epidemic spreading scenarios at the global scale. BMC Infectious Diseases 11: 37. https://doi.org/10.1186/1471-2334-11-37

50.

Van den Broeck

Gioannini

Gonçalves

, et al. (2011b) The GLEaMviz computational tool, a publicly available software to explore realistic epidemic spreading scenarios at the global scale. BMC infectious diseases 11: 37. Available at: https://doi.org/10.1186/1471-2334-11-37

51.

Williams

Waterman

Patterson

(2009) Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM 52(4): 65–76. https://doi.org/10.1145/1498765.1498785

52.

Xiao

Andelfinger

Eckhoff

, et al. (2019) A survey on agent-based simulation using hardware accelerators. ACM Computing Surveys 51(6): 1–35. https://doi.org/10.1145/3291048

53.

Yang

Laverdiere

Hauser

, et al. (2024) A baseline structure inventory with critical attribution for the US and its territories. Scientific Data 11(1): 502. https://doi.org/10.1038/s41597-024-03219-x