PANORAMA: An approach to performance modeling and diagnosis of extreme-scale workflows

Abstract

Computational science is well established as the third pillar of scientific discovery and is on par with experimentation and theory. However, as we move closer toward the ability to execute exascale calculations and process the ensuing extreme-scale amounts of data produced by both experiments and computations alike, the complexity of managing the compute and data analysis tasks has grown beyond the capabilities of domain scientists. Thus, workflow management systems are absolutely necessary to ensure current and future scientific discoveries. A key research question for these workflow management systems concerns the performance optimization of complex calculation and data analysis tasks. The central contribution of this article is a description of the PANORAMA approach for modeling and diagnosing the run-time performance of complex scientific workflows. This approach integrates extreme-scale systems testbed experimentation, structured analytical modeling, and parallel systems simulation into a comprehensive workflow framework called Pegasus for understanding and improving the overall performance of complex scientific workflows.

Keywords

Performance modeling extreme scale scientific workflow

1 Introduction

Modern science often requires the processing and analysis of vast amounts of data in search of postulated phenomena and the validation of core principles through the simulation of complex system behaviors and interactions. Not surprisingly, this trait is common in fields as diverse as astronomy, bioinformatics, physics, and ocean and ice sheet modeling.

In order to support the computational and data needs of today’s science, new knowledge must be gained about how to deliver the growing, distributed and high-performance computing (HPC) capabilities to the scientist’s desktop in an accessible, reliable and scalable way. As applications and infrastructure are growing in scale and complexity, our understanding of application behavior is not keeping up with technological and algorithmic advances. As a result, it is difficult for scientists and infrastructure providers to determine the expected and realistic behavior of a complex scientific workflow. Furthermore, when failures or performance anomalies are encountered, it is extremely difficult to pinpoint the root cause of the problem, and to correct or mitigate the issue.

Many science applications today are composed of custom scripts that tie several community or custom codes together. These scripts are frequently edited by hand to suit each new computational platform or to expand the parameter and data sets they operate on. The computing infrastructure that these applications run on is very complex and diverse. It includes NSF-funded systems, such as Open Science Grid,¹ XSEDE,² and cloud-based infrastructures,³ DOE computing facilities at Oak Ridge National Laboratory, Lawrence Berkeley National Laboratory, and others. Input, output, and intermediate data are transferred over high-speed national and regional transit networks such as Internet2⁴ ESnet.⁵

Given the complexity of applications and infrastructures, modeling their behavior is very tedious and sometimes infeasible. Part of the solution for understanding the behavior of the applications is to formalize their structure in the form of a workflow. Workflows are able to declaratively express complex applications that are composed of several individual codes with data and control dependencies. A workflow management system can take this application description and manage and optimize its execution on a variety of computational platforms.

Tools can be created that analyze the workflow and that develop models of expected behavior given a particular computing environment, such as an HPC system, clusters distributed over wide area networks, or clouds. Having a coupled model of the application and execution environment, decisions can be made about resource provisioning, application task scheduling, data management within the application, etc. As the application is executing, correlated real-time monitoring of the application and infrastructure can be used to verify the application behavior, to detect and diagnose faults, and to support adaptivity.

This paper describes the PANORAMA⁶ project that aims to further our understanding of the behavior of scientific workflows as they are executing in heterogeneous environments. The central contribution of this article is a description of the PANORAMA approach for modeling and diagnosing the run-time performance of complex scientific workflows. This approach integrates extreme-scale systems testbed experimentation, structured analytical modeling and parallel systems simulation into a comprehensive workflow framework called Pegasus (Deelman et al., 2005, 2015) for understanding and improving the overall performance of complex scientific workflows on today’s and future-generation architectures.

The paper is organized as follows. Section 2 describes applications that are motivating our work. Section 3 describes the Pegasus workflow management system that manages the application execution. Section 4 describes the tools and techniques we use to capture and profile the observed behavior of workflows. Section 5 shows our approach to analytical workflow modeling. Section 6 describes our approach to correlating the real-time application and infrastructure monitoring data, while Section 7 explores anomaly detection based on that data. Section 5.3 describes how analytical models can be augmented with detailed simulations. Section 9 describes related work. Section 10 presents the work ahead and concludes the paper.

2 Motivating examples

We have identified two important application use-cases involving advanced workflows that are the initial focus of our modeling efforts: parameter refinement workflows for the Spallation Neutron Source (SNS) and climate simulation automation for the Accelerated Climate Modeling for Energy (ACME) project.

2.1 SNS

The SNS (Mason et al., 2006) is a DOE research facility at Oak Ridge National Laboratory that provides pulsed neutron beams for scientific and industrial research. SNS uses a particle accelerator to impact a mercury-filled target with short proton pulses to produce neutrons by the process of spallation. A wide variety of experiment instruments provide different capabilities for researchers across a broad range of disciplines, including: physics, chemistry, materials science, and biology.

SNS hosts hundreds of researchers every year who conduct experiments within short reservations of a few days to a few weeks. Providing these researchers with efficient, user-friendly and highly configurable workflows that reduce the turnaround time from data collection to analysis and back is essential to the success of SNS. Figure 1 shows the data flow for a typical SNS instrument, in this case NOMAD. Neutron events scattered from the scientific sample under investigation are collected by an array of detectors. These raw events are processed into a representation familiar to the domain scientist depending on the type of experiment. For NOMAD, the reduced form is a powder diffraction pattern. This reduced data is then analyzed and compared with materials simulations to extract scientific information.

Figure 1:

SNS example workflow.

In addition to workflows for processing data from SNS experiments, there are also workflows for data analysis and simulation to support and guide SNS experiments, and to validate computer models against experimental data. These workflows automate tedious manual processes to reduce time to solution and improve researcher productivity. In collaboration with the Center for Accelerating Materials Modeling (CAMM) of SNS data, we are adapting a workflow that executes simulations to support experimental design and the validation of molecular models as a use case for the PANORAMA project. The workflow executes an ensemble of molecular dynamics and neutron scattering simulations to optimize a model parameter value. This workflow is being used to investigate temperature and hydrogen charge parameters for models of water molecules. The results are compared with data from QENS experiments using the BASIS instrument at SNS.

An illustration of this parameter refinement workflow is shown in Figure 2. The workflow executes one batch job to unpack the database along with five jobs for each set of parameters. First, each set of parameters is fed into a series of parallel molecular dynamics simulations using NAMD (Phillips et al., 2005). The first simulation calculates an equilibrium and the second inputs that equilibrium and calculates the production dynamics. Each NAMD simulation runs on 288 cores for 1–6 hours. The output from the MD simulations has the global translation and rotation removed using AMBER’s (Case et al., 2015) ptraj utility⁷ and is passed into Sassena (Lindner and Smith, 2012) for the calculation of coherent and incoherent neutron scattering intensities from the trajectories. Each Sassena simulation runs on 144 cores for up to 6 hours. The final outputs of the workflow are transferred to the user’s desktop and loaded into Mantid (Arnold et al., 2014) for analysis and visualization.

Figure 2:

The SNS refinement workflow executes a parameter sweep of molecular dynamics and neutron scattering simulations to optimize the value for a target parameter to fit experimental data.

2.2 ACME

The ACME project is using coupled models of ocean, land, atmosphere and ice to study the complex interaction between climate change and societal energy requirements. One of the flagship workflows of this effort is the fully-coupled climate model running at high resolution.

The complete workflow for ACME is illustrated in Figure 3. The ACME climate modeling effort involves the interaction of many different software and hardware components distributed across computing resources at several DOE laboratories. As part of the ACME project, many of the workflow activities that were previously done manually are being automated. The goal is to have an automated, end-to-end workflow that provides full data provenance.

Figure 3:

The complete Accelerated Climate Modeling for Energy (ACME) includes many interacting components distributed across DOE labs.

One important step towards that goal is to automate the small portion of the full workflow that involves running the ACME climate model. The PANORAMA project is developing a workflow that automates the manual effort involved in monitoring and resubmitting the model code in case of failures, and provides periodic reporting for validation of science outputs. The workflow, illustrated in Figure 4, divides a large climate simulation into several stages. Each stage completes a portion of the total target simulation time. For example, a 40-year simulation may be divided into eight 5-year stages. This enables each stage of the workflow to be completed within the maximum walltime permitted for batch jobs on the target DOE leadership class computing system. Restart files generated at the end of each stage are used as input to the next stage in order to continue the simulation. Each stage also produces history files, which are used by the workflow to automatically compute summary data called climatologies. This climatology data can be reviewed periodically by project scientists to ensure that the simulation is progressing as expected, so that problems can be identified, and corrections made, before computing resources are wasted. Both the history files and the climatologies are transferred to HPSS and CADES (open infrastructure) for long-term storage and future analysis.

Figure 4:

The ACME workflow runs one climate simulation in several stages. The output of each stage is used to compute climatologies for validation. All outputs are stored in HPSS and CADES for archiving and further analysis.

3 Workflow execution

In order to facilitate workflow creation, scientists need to be able to formulate the workflows in a way that is meaningful to them, in a resource-independent way, using high-level abstractions to specify the structure of the analysis and the data to be operated on (via a visual or textual representation). This abstract workflow (or workflow instance) is important because it uniquely identifies the analysis to be conducted at the application level, without including operational details of the execution environment. The workflow instance can be published along with the results of a computation to describe how a particular data product was obtained. This approach supports reproducibility, a cornerstone of the scientific method. In order to support the use of abstract workflow specifications, planning technologies are needed to automatically interpret and map user-defined, abstract workflows onto the available resources.

Workflow technologies have been demonstrated to be very effective in exploiting coarse-grain parallelism in applications running on distributed infrastructures such as grids and clouds (Deelman et al., 2009). Our system, Pegasus (Deelman et al., 2015), focuses on scalable, reliable and efficient workflow execution on a wide range of systems, from user’s desktops to leadership class machines (Deelman et al., 2006; Vahi et al., 2013b). The cornerstone of our approach is the separation of the workflow description from the description of the execution environment, which results in: (1) workflows that are portable across execution environments; and (2) the ability for the workflow management system to make performance- and reliability-focused decisions at “compile time” and/or at “runtime”. Pegasus pioneered the use of planning in scientific workflow systems (Deelman et al., 2002). It takes a resource-independent, or abstract, workflow description, automatically locates the input data and computational resources necessary for workflow execution, maps/plans this description onto the available execution resources, then reliably executes the plan. When errors occur, Pegasus tries to recover when possible by retrying tasks, by retrying the entire workflow, by providing workflow and task-level checkpointing, by re-planning portions of the workflow, by trying alternative data sources for staging data, and, when all else fails, by providing a rescue workflow containing a description of only the work that remains to be done. A Pegasus workflow can process millions of tasks and terabytes of data, so storage management is important. Pegasus has a sophisticated model for reasoning about and optimizing data transfers and it cleans up storage as the workflow is executed so that data-intensive workflows have enough space to execute on storage-constrained resources (Srinivasan et al., 2014).

4 Characterizing workflows

Workflow behavior modeling and analysis methods need accurate execution event traces and monitoring information to build models of system behavior and conduct anomaly detection and diagnosis. Here, we describe some of the tools and techniques we are using in PANORAMA.

4.1 Workflow monitoring

Workflow-level monitoring collects information about events in the execution of a workflow that are critical to understanding its behavior. The STAMPEDE (Gunter et al., 2011; Samak et al., 2011a, b; Vahi et al., 2013a) framework collects monitoring data by parsing and correlating log files that are generated by the workflow management system. These logs contain information about the submit time of the jobs, the time when the jobs are sent to the remote system for execution, and the time when they are reported finished or failed. The logs also have information about the progress of the workflow through time, for example, when the data dependencies have been satisfied. This data can be used to compute runtimes, queue delays, and other important performance metrics.

4.2 Task monitoring and profiling

Task-level monitoring can be used to measure and predict the resource requirements of workflows. In addition to workflow-level data, the STAMPEDE framework also collects information about the execution of workflow tasks on remote resources using Kickstart (Vöeckler et al., 2006), a monitoring tool launched along with the computation and data management tasks to collect information about the behavior of the tasks and their execution environment. As part of the DOE dV/dt project⁸ we added functionality to Kickstart to automatically capture resource usage metrics of workflow tasks (Juve et al., 2014). This functionality uses operating system monitoring facilities as well as system call and library call interposition to collect fine-grained profile data that includes process I/O, file accesses, runtime, memory usage, and CPU utilization. Using this information, we developed techniques based on machine learning algorithms for automatically modeling and predicting the resource usage of workflow tasks (Ferreira da Silva et al., 2013b, 2015), such as runtime, disk usage and memory consumption. This approach combined with a Monitor Analyze Plan Execute (MAPE-K) autonomic computing loop (Kephart and Chess, 2003), is used for online estimation of task needs as the workflow executes. Experimental results show that this online process produces more accurate estimates than a comparable offline method. Table 1 shows an example of using this approach for a bioinformatics application (Epigenomics) and quantifies the prediction errors.

Table 1:

Average estimation errors of task runtime, I/O write, and peak memory usage for the Epigenomics workflow.

Task	Estimation	Runtime	I/O write	Memory
		Average error	Average error	Average error
		(%)	(%)	(%)
fastqSplit	Offline	8.36	3.28	9.14
	Online	8.36	3.28	9.14
filterContams	Offline	59.31	109.81	102.83
	Online	29.13	5.35	8.15
sol2sanger	Offline	54.93	98.20	96.68
	Online	34.74	1.23	1.96
fast2bfq	Offline	27.13	128.18	99.98
	Online	17.09	15.11	10.65
map	Offline	23.62	0.00	21.07
	Online	1.39	0.00	3.33
mapMerge	Offline	53.74	93.34	1.01
	Online	10.22	9.39	1.00
pileup	Offline	6.00	4.17	49.42
	Online	5.11	3.87	19.31

4.3 Application monitoring

As part of the PANORAMA project we will develop the capability to capture monitoring data that is generated by application tasks. Unlike task-level monitoring data, which is collected using operating system facilities, application-level monitoring data cannot be computed by tools outside the application because it requires access to the application’s internal state. This includes progress information such as the percent of the task that has been completed, performance metrics such as the number of timesteps being completed per hour, and diagnostic information such as error messages. We will provide an interface for application code to easily export this data in a structured format. This interface will also enable us to collect application metadata that is only known at runtime.

4.4 Infrastructure monitoring

The previous monitoring techniques are not sufficient for modeling the behavior of scientific workflows. Additional information about the infrastructure is also required to understand the observed performance of workflows. Network performance monitoring using tools such as perfSONAR (Tierney et al., 2009), for example, is crucial to discover “soft failures” in the network, where the network seems to be up, but is performing at just a fraction of it’s peak efficiency, which translates to poor performance for data-intensive workflows. perfSONAR instances can run continuous checks for latency changes and packet loss, and run periodic throughput tests using a suite of already available tools such as BWCTL, OWAMP, and NDT. There are currently around 1200 perfSONAR instances worldwide, providing us with critical network performance information.

5 Analytical models

5.1 Performance modeling with Aspen

Our Aspen system (Lee et al., 2015; Spafford and Vetter, 2012), was designed to bridge the gap between structured analytical models and functional simulation. Specifically, Aspen’s domain-specific language (DSL), was designed to create analytical models of traditional scientific applications and HPC architectures, in contrast to scientific workflows. In this work, we are extending the Aspen modeling framework to include workflow scenarios and resources to be able to develop structure analytical models and predictions for scientific workflows.

Aspen’s DSL approach to analytical performance modeling provides several advantages over traditional approaches. For instance, Aspen’s kernel construct helps to fully capture control flow, and preserves more algorithmic information than traditional frameworks (Spafford and Vetter, 2012). Aspen’s model also captures the important concepts of data capacities and data movement in very explicit terms. Similarly, the abstract machine model is more expressive than frameworks that reduce machine specifications to a small set of parameters. The formal language specification forces scientists to construct models that can be syntactically checked and consumed by analysis tools. This formal specification also facilitates collaboration between domain experts and computer scientists, and enables scientists to include application specific parameters in their model definitions, which would otherwise be difficult to infer. Finally, Aspen is modular, and therefore it is easy to compose, reuse, and extend performance models. With these features, Aspen can help answer important application-specific questions.

Aspen is complementary to other performance prediction techniques including simulation (Janssen et al., 2011; Rodrigues et al., 2011), emulation, or measurement on early hardware prototypes. Compared to these techniques, Aspen’s analytical model is machine-independent, has fewer prerequisites (e.g. architectural descriptions, application source code), and can be computed very efficiently.

5.1.1 Performance modeling of workflows

Many workflow systems have been integrated with performance monitoring and analysis tools (da Cruz et al., 2008; Ostermann et al., 2009; Truong and Fahringer, 2004). These systems typically collect only coarse-grained information, such as task runtime and data size. Using Kickstart (Section 4) enables us to collect fine-grained profiles including I/O, memory, CPU usage, and runtime data for use in performance modeling.

In most workflow models, the focus is on performance prediction of individual computational tasks. Data transfer and management tasks that are present in workflows are either not modeled, or are modeled very simplistically. Thus, existing performance predictions are not realistic and not detailed enough to help scientists and infrastructure providers estimate application performance, and pinpoint and diagnose issues, and schedule resources dynamically.

Profiling data, which is automatically collected during workflow execution and includes detailed runtime and resource usage statistics, can be used to create a historical performance model of the workflow, which is able to generate these estimates. In addition, profiling data can be used to guide resource provisioning algorithms that require estimates of resource usage (Mao and Humphrey, 2011; Ostermann et al., 2010; Singh et al., 2006)

Aspen (Spafford and Vetter, 2012) was originally designed with the capability to model large scale scientific applications running on traditional HPC systems (e.g. Titan, with hybrid CPU/GPU architecture, or Mira with the Blue Gene/Q architecture). Current Aspen constructs model low-level resources such as cores, memory access patterns, MPI communication, and memory capacity. We have been extending Aspen to include application features and resources necessary for workflows, and, in particular, the following.

Compute resources. Workflows represent a much broader yet more abstract set of resources, which include large storage systems of varying capability and capacity, high speed networks, and a variety of instruments and distributed computing systems and services. We are extending Aspen to include entire supercomputers as a resource, including their HPC architecture, job scheduling queues, and storage systems.

Networks. One area of focus for our modeling will be networks, particularly wide area networks, which are often used during workflow execution. Wide area communication models must deal with network protocol issues. For example, TCP throughput, and thus data transfer throughput, is directly impacted by packet loss. Newer protocols and long haul fiber networks will need to be represented in Aspen.

Storage systems. For data-intensive applications, storage system (e.g. GPFS, HPSS, Lustre) performance has a significant impact on overall workflow performance. The increased data movement and storage performance requirements created by large workflows can heavily impact overall workflow performance and, thus, will need to be modeled by Aspen.

5.2 Automatic generation of Aspen performance models from Pegasus descriptions

One major benefit to our approach is symmetry in representations of Pegasus workflows and Aspen performance models. In fact, we can use the structured Pegasus framework to generate Aspen performance models automatically. Because Aspen is a structured language that mirrors concepts (e.g. subroutines, loops) in traditional languages such as C and Java, it is straightforward to generate performance models from other structured representations (Lee et al., 2015).

We have started the implementation of a post-processing phase during the Pegasus compilation that will use the Pegasus intermediate abstract workflow representation (i.e. DAX) to construct an Aspen model, which maps the dependencies in the workflow to an Aspen control flow sequence. (The infrastructure model is different from the workflow model, and is developed in a later time.)

Figure 2 shows an example DAX developed for an SNS workflow. It contains two sequences with different parameter values within a single workflow. Listing 1 shows the results as an Aspen kernel, with three major steps. The first step runs a pair of sequential equilibrium and production dynamics NAMD calculations. The second step unpacks the database in parallel with the two AMBER ptraj tasks. The final step runs the two Sassena tasks in parallel with each other.

Listing 1:

Automatically generated Aspen model for example SNS workflow.

In Aspen, statements within a par clause execute independently, i.e. with task parallelism, and statements within a seq clause must be executed in order, i.e. they have sequential dependencies. Our initial scheduler performed a wavefront-like algorithm, grouping each set of tasks with completed dependencies into a par clause, repeating within one master sequential region until all tasks completed. Because Pegasus operates at the level of individual tasks, however, this schedule was only accurate if all tasks within a group took the same amount of time. As such, we implemented an advanced scheduler, which is a more direct translation of the dependencies and more accurately matches Pegasus scheduling behavior.

5.3 Evaluating the use of analytical models

Validation of any general performance model is a difficult task. The most obvious validation step would be to use the model to generate a performance prediction, and compare the performance against a measured value. This strategy has weaknesses, generally stemming from one important fact: performance is machine-dependent, and so cannot effectively disambiguate the correctness of our generated application/workflow models from that of the machine model. This means that we must generate multiple predictions and measurements in an attempt to disambiguate not only across machines, but across scales, and across scaling rates.

In the case of Aspen, two facets work in our favor. First, Aspen models are not based on runtime, but on resource usage. This means we can query the Aspen models to derive, for example, counts of floating point operations or messages. While these values might vary based on compiler or library, they are far more machine-independent and can be measured with tools such as hardware counters and MPI library interposition to give a ground truth against which we can judge our application models. Second, Aspen is an analytical tool, and can output not simply values for runtime and resource usage, but also symbolic equations. In some cases, this makes validation possible against algorithmic expectations. For example, we can use Aspen to validate that the number of floating point operations in a matrix multiplication is equal to $2 n^{3}$ .

As Section 5.1 points out, the context of workflows adds complexity. However, extending the MPI measurement to TCP/IP communication, and adding I/O measurement can give us additional metrics for validating our workflow models. Sections 6 and 7 also describe additional monitoring which can aid this process, and we note that anomaly detection is not only more useful with a performance model in hand, but can assist in validating that same performance model.

6 Correlating monitoring of workflows and infrastructure

An accurate attribution of the anomalies to the observed workflow performance is critical for fixing the problem, or adapting the system. In some cases the observed anomaly may have complex causes. For example, poor task performance can be caused by a slow CPU, by poor cache performance, by a slow disk, etc. Low task throughput within a workflow execution can be attributed to bottlenecks in the workflow engine, a lack of computational resources, or application errors. The end-to-end performance of data-intensive workflows is even more complex as it often depends on efficiently moving large data sets. This can be accomplished using bandwidth provisioned high-speed networks, however, performance in such networks is affected by a large number of factors, including: congestion, packet loss, end-host network and I/O performance, and end-host network tuning, among others. Thus, to diagnose these complex problems, we need to correlate observed performance with monitoring information coming from a number of systems (CPU, network, I/O, storage, etc.), and determine which of these systems is the most likely cause of the problem.

Using a combination of system monitoring utilities like the sysstat suite, nethogs, iotop, and the multi-domain networking monitoring tool, perfSONAR, we are conducting a parametric study of data transfer performance for workflows in a controlled and isolated environment using a testbed called ExoGENI (Baldin et al., 2012). We are exploring different workflow configurations based on throttling of data volumes, different data transfer modes (sharedfs, nonsharedfs/pegasus-lite, nonsharedfs-condorio) (Deelman et al., 2015), and multiple distributed domains. This will enable us to define baseline expected data-transfer performance and the monitoring manifestations of different performance profiles. We plan to combine this with other monitoring and profiling tools described in Section 4. We will use this monitoring information and the output of analytical performance models from Aspen to conduct a thorough evaluation of these different kinds of correlations. This will help in detecting anomalies in an online fashion.

7 Anomaly detection and diagnosis

Anomalies should be detected when there is a deviation of observed workflow performance from the end-to-end performance model. At the same time, for accurate diagnosis, the observed performance deviations need to be traced to anomalies at the application, workflow, and infrastructure levels. A combination of offline and online strategies needs to be developed for detecting performance anomalies during workflow execution, and for diagnosing the root causes of the observed problems.

7.1 Offline anomaly classification

Our initial approach is to classify the different types of anomalies that can cause problems with end-to-end workflow performance, and use offline methods to determine the severity of the anomalies. Infrastructure anomalies contribute to a number of issues with application performance. Among them are: resource unavailability (Russell et al., 2006) poor I/O performance (Mandal et al., 2013), disk failure, file system corruption, violations of disk quotas (Deelman and Chervenak, 2008), poor network performance due to congestion and packet loss (Mathis et al., 1997), firewalls (which often drop packets in flows over 1 Gbps), limitations in LAN switches (which may not have enough buffering to handle multiple simultaneous flows), bad optical fiber, and others. Typical application anomalies seen during workflow execution include: input data unavailability, application execution errors (Ferreira da Silva et al., 2013a), and poor performance for tasks with very short execution times (Chen et al., 2014; Ferreira da Silva et al., 2014b). Workflow-level anomalies are deviations in workflow-level performance metrics, such as: low task throughput, which may indicate problems with resource availability; high task error rates, or significant changes in task error rates; slow end-to-end data transfers; data sets that are larger or smaller than expected; long queue wait times, which may indicate low resource availability or fairness and priority issues; and low resource utilization, which may be caused by resource failures, over-provisioning, or misconfiguration.

We need to quantify anomaly severity in discrete anomaly levels so that different sets of actions can be triggered to address different levels of the diagnosed anomaly. For example, some anomalies may be critical to workflow performance, such as input data unavailability, while others may only reduce application performance, such as long queue wait times. Anomaly levels can be determined based on thresholds that are derived by clustering performance metrics into groups. The threshold value of an anomaly level will be determined from execution traces collected and published in the workflow archive developed as part of the DOE dV/dt project (Ferreira da Silva et al., 2014a),⁹ for which different thresholding approaches can be used. Anomaly levels and thresholds will, initially, be determined offline; thus they will not add any overhead to the workflow execution.

7.2 Real-time anomaly detection

We developed a persistent query agent (PQA) (Mandal et al., 2014) that enables persistent queries on data federated from different sources, including multi-domain infrastructure monitoring data, and workflow and application performance data. Using this agent, it is possible to run continuous queries on disparate data sources and get asynchronous, real-time notifications when relevant performance events occur. We will leverage the PQA framework to combine offline anomaly analysis, Aspen performance model evaluations, and multi-domain monitoring data for online, real-time detection and scalable distribution of the triggers to the appropriate stakeholders.

It is important to define, ahead of time, which metrics are important for particular application use cases by benchmarking executions of individual applications. It is also imperative to register the application-level metrics, workflow-level metrics, and infrastructure metrics with the system so that persistent queries can refer to these metrics while constructing event conditions that might result in anomaly triggers. This will make it possible to use a unified mechanism to understand anomalies in multi-domain infrastructure, anomalies in the application, and anomalies in the application’s view of the infrastructure. The Aspen analytical performance models will also help in identifying critical metrics, and hence aid in the design of the persistent queries. Predictions from the analytical models will also be pushed as events into the PQA making it possible to do “continuous diffs” of observed metric values with values predicted from the end-to-end models. Initially, our anomaly detection algorithms will follow a threshold-based approach, which not only considers deviations from performance models, but also employs a complex event processing engine to pull together relevant monitoring events from disparate sources to generate anomaly triggers. In the future, we will extend this real-time, event-based approach, which is purely based on discrete events (no/minimal history), to an approach that constructs temporal signatures from time-series data and compares them with known anomaly signatures.

Another advantage of the persistent query mechanism is that the anomaly triggers immediately identify the corresponding persistent query, and hence the precise event condition that led to the trigger. Since persistent queries are constructed using well-defined metrics, an analysis of associated metrics for each query that got triggered due to an anomaly condition would help identify or narrow down the root cause of performance deviations. The triggers are also distributed in a scalable fashion using a publish-subscribe framework, which makes it possible to inform only the interested stakeholders. For instance, different persistent queries will be relevant for different stakeholders. Since all of these queries can co-exist in the system, and triggers are associated with queries, only the stakeholders who are interested in the anomaly event condition will be notified. Hence, the design of persistent queries will take into account the different possible use cases: scientist use cases, workflow management system use cases, infrastructure adaptation use cases, resource provisioning use cases, and infrastructure operator use cases.

8 Hybrid modeling: interleaving simulation with analytical modeling

Workflow models will undoubtedly have many components and interactions among those components. We will use simulation to help understand workflow components for which analytical models are not sufficiently accurate. The two simulation components are described below.

Rensselaer’s Optimistic Simulation System (ROSS) (network simulation) is a framework for developing parallel discrete event simulations. ROSS has demonstrated highly scalable, massively parallel event processing capability for both conservative and optimistic synchronization approaches (Barnes et al., 2013; Bauer Jr. et al., 2009; Carothers and Perumalla, 2010; Liu et al., 2012; Mubarak et al., 2012). ROSS mitigates Time Warp state-saving overheads via reverse computation (Carothers et al., 1999). In this approach, rollback is realized by performing the inverse of the individual operations that were executed in the event computation. This eliminates the need to explicitly store prior logical process (LP) state, which represents a distinct component of the model state, leading to much more efficient memory utilization. We have leveraged reverse computation to demonstrate massively parallel performance of the ROSS simulation engine (Bauer Jr. et al., 2009) in several models. Most recently, ROSS optimistic event processing has demonstrated super-linear performance for the PHOLD benchmark using nearly 2 million Blue Gene/Q cores on the 120 rack Sequoia supercomputer system located at LLNL (Barnes et al., 2013).

CODES storage simulation: we also developed the CODES simulation framework (based on ROSS), which combines models of storage devices, high-performance networks, I/O forwarding infrastructure, and storage software into a unified parallel discrete event simulation model. CODES can simulate complete, large-scale storage systems all the way from individual application processes to individual storage device accesses.

Traditionally, users are required to create a simulation structure for the component of interest, and the interactions with those components. Since we already have the high-level structured analytical model of the target workflow in Aspen, we intend to use this Aspen model to drive the ROSS/CODES simulation of the component under consideration, and the external interactions that it expects (analytically). For example, in this combined framework, we will be able to use an Aspen model for the entire workflow (e.g. Climate model simulation), and then use the CODES simulator to simulate a component of the workflow (e.g. HPSS storage system) in great detail. The Aspen model will generate the I/O requests and transfers to the components (e.g. HPSS) in the simulator, and use analytical models for the remainder.

9 Related work

There have been several previous efforts to characterize the workloads of distributed systems (Hart, 2011; Iosup and Epema, 2011; Mahambre et al., 2012; Ren et al., 2012). For instance, Iosup and Epema (2011) and Hart (2011) presented analyses of Grid and HPC workload characteristics including system usage, user population, application characteristics, and characteristics of grid-specific application types. Analysis of MapReduce job characteristics such as CPU utilization, memory usage, slots allocation, I/O operations, and network transfers was presented in Ren et al. (2012). Cloud workload patterns (periodicity, threshold, relationship, variability, and image similarity) were identified in Mahambre et al. (2012). These studies typically use data provided by Grid and HPC workload archives (Germain-Renaud et al., 2011; Iosup et al., 2008; Kondo et al., 2010).¹⁰ These workloads mainly capture information about task executions, but lack critical information about scientific workflow executions such as task dependencies, task clustering, etc.

In the area of characterization and profiling, some efforts have been made to collect and publish traces and performance statistics for real scientific workflows, such as workflow-based workload traces from the Austrian grid (Ostermann et al., 2008a, b), provenance-based workloads from a workflow management system (Madougou et al., 2013), a survey of workflow characteristics from several research domains including bioinformatics, medical informatics, weather and ocean modeling, and astronomy (Ramakrishnan and Gannon, 2008), and workflow-based workloads from a science gateway (Ferreira da Silva and Glatard, 2013). We have characterized and profiled scientific workflows to understand and predict their resource requirements (Bharathi et al., 2008; Ferreira da Silva et al., 2013b, 2015; Juve et al., 2013), and recently, we published traces for a few workflows executed using Pegasus (Ferreira da Silva et al., 2014a), and synthetic workflows based on statistics from real applications for use in simulations (Ferreira da Silva et al., 2014a).¹¹

Performance modeling spans a large range of techniques, from analytical modeling to simulation. In analytical modeling, predictions can be obtained without directly using the application source code. These modeling methods range from ad-hoc, hand-written models to more structured methods such as BSP (Valiant, 1990) and LogP (Alexandrov et al., 1995; Culler et al., 1993). The strength of analytical approaches when evaluating far-future architectures is shown in the HPC space in two examples considering fast Fourier transform (FFT) for exascale systems in Gahvari and Gropp (2010) and Czechowski et al. (2012). Aspen extends these analytical modeling concepts to a formal DSL and adds extensibility and flexibility to analytical techniques; see Spafford and Vetter (2012) for a treatment of FFT on HPC architectures. Aspen is under active development and investigation for a variety of performance modeling contexts; see Spafford et al. (2013) for a more recent example of the use of Aspen in the HPC space.

10 Future work

In addition to real workflows we will also develop synthetic workflows that emulate the behavior of the real applications, but allow us to easily vary parameters such as the input data size and the task runtimes. This will be supported by profiling the applications using our workflow profiling tools, which will be extended to support parallel applications and to collect additional metrics that characterize application behavior.

We will also identify and develop mechanisms for collecting and correlating infrastructure, workflow, and application monitoring data. This will involve developing new tools and enhancing existing tools to extract monitoring data and transport it to a central location for correlation and analysis. We plan to use the empirical monitoring data to validate the Aspen performance models for the different workflow use-cases. We will also validate the models using the ExoGENI and ESnet testbeds.

We plan to develop offline techniques for clustering and classifying anomalies based on thresholds and time-series data. We will extend our persistent query tool to use both Aspen performance models and monitoring data, and design new persistent queries for real-time detection and diagnosis of anomalies. To evaluate the accuracy of our anomaly detection approach we plan on using a combination of fault and load injection.

We will extend the Aspen analytical models and integrate them further with ROSS/CODES to facilitate detailed simulations, where analytical models lack the necessary insights.

We will also explore and implement adaptation techniques, which will be automatically triggered when anomalies and failures are detected.

Figure 5 shows how all of these components will fit together. Pegasus will call Aspen tools to generate estimates of workflow and task performance based on application parameters and workflow, task and machine models. These tools will be constructed as libraries that can be called directly by Pegasus. Aspen, in turn, will call ROSS/CODES to simulate complex infrastructure components such as the network or storage system. As the workflow is executing, Kickstart will collect task- and application-level monitoring data in real time and report it back to Pegasus through a REST interface. The monitoring data, performance estimates, and workflow state information will be stored in a relational database. Infrastructure monitoring components will track the performance of network, storage, and computing systems and report that information to an InfluxDBSee http://influxdb.com time series database. Infrastructure and application monitoring data will be correlated by the persistent query agent, which will query data from the workflow and infrastructure monitoring databases. The correlated data will be examined to identify anomalies, which will be reported to users through a web-based dashboard, and sent back to Pegasus through the REST interface to guide workflow adaptation.

Figure 5:

PANORAMA architecture.

Footnotes

Funding

This work was funded by the DOE (contract number DE-SC0012636, “Panorama - Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows”). The development of the neutron scattering simulation workflow was supported by the US Department of Energy (DOE), Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division. The use of Oak Ridge National Laboratory’s Spallation Neutron was sponsored by the Scientific User Facilities Division, Office of Basic Energy Sciences. This manuscript has been authored, in part, by Oak Ridge National Laboratory, which is managed bb UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the US Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (). This research is sponsored by the Office of Advanced Scientific Computing Research in the US Department of Energy.

Notes

References

Alexandrov

Ionescu

Schauser

. (1995) LogGP: Incorporating long messages into the LogP model. In: 7th annual ACM symposium on parallel algorithms and architectures, pp. 95–105.

Arnold

Bilheux

Borreguero

. (2014) Mantid: Data analysis and visualization package for neutron scattering and

μ

SR experiments. Nuclear Instruments and Methods in Physics Research Section A 764: 156–166.

Baldin

Xin

Evans

. (2012) Exogeni: A multi-domain infrastructure-as-a-service testbed. In: 8th international ICST conference on testbeds and research infrastructures for the development of networks and communities, pp. 97–113.

Barnes

Jr Carothers

Jefferson

. (2013) Warp speed: executing time warp on 1,966,080 cores. In: ACM SIGSIM conference on principles of advanced discrete simulation, pp. 327–336. DOI:10.1145/2486092.2486134.

Bauer

Jr Carothers

Holder

(2009) Scalable time warp on blue gene supercomputers. In: ACM/IEEE/SCS 23rd workshop on principles of advanced and distributed simulation, pp. 35–44.

Bharathi

Chervenak

Deelman

. (2008) Characterization of scientific workflows. In: 3rd workshop on workflows in support of large-scale science, pp. 1–10. DOI:10.1109/WORKS.2008.4723958

Carothers

Perumalla

(2010) On deciding between conservative and optimistic approaches on massively parallel-platforms. In: Winter simulation conference’10, pp. 678–687.

Carothers

Perumalla

Fujimoto

(1999) Efficient optimistic parallel simulations using reverse computation. ACM Transactions on Modeling and Computer Simulation 9(3): 224–253. DOI:10.1145/347823.347828.

Case

Berryman

Betz

. (2015) AMBER 15. San Francisco, CA: University of California.

10.

Chen

Ferreira da Silva

Deelman

. (2014) Using imbalance metrics to optimize task clustering in scientific workflow executions. Future Generation Computer Systems. DOI:10.1016/j.future.2014.09.014.

11.

Culler

Karp

Patterson

. (1993) LogP: Towards a realistic model of parallel computation. ACM SIGPLAN Notices 28(7): 1–12.

12.

Czechowski

Battaglino

McClanahan

. (2012) On the communication complexity of 3D FFTs and its implications for exascale. In: 26th ACM international conference on supercomputing, pp. 205–214. DOI:10.1145/2304576.2304604.

13.

da Cruz

da Silva

Gadelha LMR

. (2008) A lightweight middleware monitor for distributed scientific workflows. In: 8th IEEE international symposium on cluster computing and the grid, pp. 693–698. DOI:10.1109/CCGRID.2008.89.

14.

Deelman

Blythe

Gil

. (2002) Pegasus: Planning for execution in grids. GriPhyN technical report 20: 1–6.

15.

Deelman

Callaghan

. (2006) Managing large-scale workflow execution from resource provisioning to provenance tracking: The Cybershake example. In: 2nd IEEE international conference on e-Science and grid computing, pp. 4–6.

16.

Deelman

Chervenak

(2008) Data management challenges of data-intensive scientific workflows. IEEE international symposium on cluster computing and the grid, pp. 687–692. DOI:10.1109/CCGRID.2008.24.

17.

Deelman

Gannon

Shields

. (2009) Workflows and e-science: An overview of workflow system features and capabilities. Future Generation Computer Systems 25(5): 528–540.

18.

Deelman

Singh

. (2005) Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13(3): 219–237.

19.

Deelman

Vahi

Juve

. (2015) Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46: 17–35. DOI:10.1016/j.future.2014.10.008

20.

Ferreira da Silva

Chen

. (2014a) Community resources for enabling and evaluating research on scientific workflows. In: 10th IEEE international conference on e-Science, pp. 177–184. DOI:10.1109/eScience.2014.44.

21.

Ferreira da Silva

Glatard

(2013) A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executions. In: Euro-Par 2012: Parallel Processing Workshops (Lecture Notes in Computer Science, vol. 7640). New York: Springer, pp. 79–88. DOI:10.1007/978-3-642-36949-0_10.

22.

Ferreira da Silva

Glatard

. (2013a) Self-healing of workflow activity incidents on distributed computing infrastructures. Future Generation Computer Systems 29(8): 2284–2294. DOI:10.1016/j.future.2013.06.012.

23.

Ferreira da Silva

Glatard

. (2014 b) Controlling fairness and task granularity in distributed, online, non-clairvoyant workflow executions. Concurrency and Computation: Practice and Experience 26(14): 2347–2366. DOI:10.1002/cpe.3303.

24.

Ferreira da Silva

Juve

Deelman

. (2013b) Toward fine-grained online task characteristics estimation in scientific workflows. In: 8th workshop on workflows in support of large-scale science, pp. 58–67. DOI:10.1145/2534248.2534254.

25.

Ferreira da Silva

Rynge

Juve

. (2015) Characterizing a high throughput computing workload: The Compact Muon Solenoid (CMS) experiment at LHC. In: 2015 international conference on computational science (ICCS 2015).

26.

Gahvari

Gropp

(2010) An introductory exascale feasibility study for FFTs and multigrid. In: IEEE international symposium on parallel distributed processing, pp. 1–9. DOI:10.1109/IPDPS.2010.5470417.

27.

Germain-Renaud

Cady

Gauron

. (2011) The grid observatory. In: IEEE international symposium on cluster computing and the grid, pp. 114–123.

28.

Gunter

Deelman

. (2011) Online workflow management and performance analysis with stampede. In: 7th international conference on network and service management, pp. 1–10.

29.

Hart

(2011) Measuring teragrid: workload characterization for a high-performance computing federation. International Journal of High Performance Computing Applications 25(4): 451–465.

30.

Iosup

Epema

(2011) Grid computing workloads. IEEE Internet Computing 15(2): 19–26. DOI:10.1109/MIC. 2010.130.

31.

Iosup

Jan

. (2008) The grid workloads archive. Future Generation Computer Systems 24(7): 672–686. DOI:10.1016/j.future.2008.02.003.

32.

Janssen

Adalsteinsson

Kenny

(2011) Using simulation to design extremescale applications and architectures: Programming model exploration. SIGMETRICS Performance Evaluation Review 38(4): 4–8.

33.

Juve

Chervenak

Deelman

. (2013) Characterizing and profiling scientific workflows. Future Generation Computer Systems 29(3): 682–692. DOI:10.1016/j.future. 2012.08.015.

34.

Juve

Tovar

. (2014) Practical resource monitoring for robust high throughput computing. Technical report, University of Southern California.

35.

Kephart

Chess

(2003) The vision of autonomic computing. Computer 36(1): 41–50. DOI:10.1109/MC.2003. 1160055.

36.

Kondo

Javadi

Iosup

. (2010) The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. In: 10th IEEE/ACM international conference on cluster, cloud and grid computing, pp. 398–407.

37.

Lee

Meredith

Vetter

(2015) Zermatt: A framework for automated performance modeling and online performance prediction. In: ACM international conference on supercomputing, Newport Beach, CA.

38.

Lindner

Smith

(2012) Sassena: X-ray and neutron scattering calculated from molecular dynamics trajectories using massively parallel computers. Computer Physics Communications 183(7): 1491–1501.

39.

Liu

Cope

Carns

. (2012) On the role of burst buffers in leadership-class storage systems. In: IEEE 28th symposium on mass storage systems and technologies, pp. 1–11. DOI:10.1109/MSST.2012.6232369.

40.

Madougou

Shahand

Santcroos

. (2013) Characterizing workflow-based activity on a production e-infrastructure using provenance data. Future Generation Computer Systems 29(8): 1931–1942. DOI:10.1016/j.future.2013.04.019.

41.

Mahambre

Kulkarni

Bellue

. (2012) Workload characterization for capacity planning and performance management in IAAS cloud. In: IEEE international conference on cloud computing in emerging markets, pp. 1–7. DOI:10.1109/CCEM.2012.6354624.

42.

Mandal

Baldin

Xin

. (2014) Enabling persistent queries for cross-aggregate performance monitoring. IEEE Communications Magazine 52(5): 157–164.

43.

Mandal

Ruth

Baldin

. (2013) Evaluating I/O aware network management for scientific workflows on networked clouds. In: 3rd international workshop on network-aware data management, pp. 2:1–2:10. DOI:10.1145/2534695.2534698.

44.

Mao

Humphrey

(2011) Auto-scaling to minimize cost and meet application deadlines in cloud workflows. In: International conference for high performance computing, networking, storage and analysis, pp. 49:1–49:12. DOI:10.1145/2063384.2063449.

45.

Mason

Abernathy

Anderson

. (2006) The Spallation Neutron Source in Oak Ridge: A powerful tool for materials research. Physica B: Condensed Matter 385: 955–960.

46.

Mathis

Semke

Mahdavi

. (1997) The macroscopic behavior of the TCP congestion avoidance algorithm. SIGCOMM Computer Communication Review 27(3): 67–82. DOI:10.1145/263932.264023.

47.

Mubarak

Carothers

Ross

. (2012) Modeling a million-node dragonfly network using massively parallel discrete event simulation. In: 3rd international workshop on performance modeling, benchmarking and simulation of high performance computer systems, pp. 366–376. DOI:10.1109/SC.Companion.2012.56.

48.

Ostermann

Plankensteiner

Prodan

. (2009) Workflow monitoring and analysis tool for askalon. In: Grid and Services Evolution. New York: Springer, pp. 1–14. DOI:10.1007/978-0-387-85966-8_6.

49.

Ostermann

Prodan

. (2008a) On the characteristics of grid workflows. In: CoreGRID Symposium - Euro-Par 2008, pp. 1–12.

50.

Ostermann

Prodan

. (2008b) A trace-based investigation of the characteristics of grid workflows. In: Priol

Vanneschi

(eds.) From Grids to Service and Pervasive Computing. New York: Springer, pp. 191–203. DOI:10.1007/978-0-387-09455-7_14.

51.

Ostermann

Prodan

. (2010) Dynamic cloud provisioning for scientific grid workflows. In: 11th IEEE/ACM international conference on grid computing, pp. 97–104. DOI:10.1109/GRID.2010.5697953.

52.

Phillips

Braun

Wang

. (2005) Scalable molecular dynamics with NAMD. Journal of Computational Chemistry 26(16): 1781–1802.

53.

Ramakrishnan

Gannon

(2008) A survey of distributed workflow characteristics and resource requirements. Technical Report, Indiana University. Available at: http://www.cs.indiana.edu/pub/techreports/TR671.pdf.

54.

Ren

Wan

. (2012) Workload characterization on a production Hadoop cluster: A case study on Taobao. In: IEEE international symposium on workload characterization, pp. 3–13. DOI:10.1109/IISWC.2012.6402895.

55.

Rodrigues

Hemmert

Barrett

. (2011) The structural simulation toolkit. SIGMETRICS Performance Evaluation Review 38(4): 37–42.

56.

Russell

Aalst

. (2006) Workflow exception patterns. In: Dubois

Pohl

(eds.) Advanced Information Systems Engineering (Lecture Notes in Computer Science, vol. 4001). Berlin: Springer, pp. 288–302. DOI:10.1007/11767138_20.

57.

Samak

Gunter

Goode

Deelman

Juve

Mehta

Silva

Vahi

(2011 a) Online fault and anomaly detection for large-scale scientific workflows. In: 2011 IEEE 13th international conference on high performance computing and communications (HPCC). IEEE, pp. 373–381.

58.

Samak

Gunter

Goode

Deelman

Mehta

Silva

Vahi

(2011 b) Failure prediction and localization in large scientific workflows. In: 6th workshop on workflows in support of large-scale science. ACM, pp. 107–116. DOI:10.1145/2110497.2110510.

59.

Singh

Kesselman

. (2006) Application-level resource provisioning on the grid. In: 2nd IEEE international conference on e-science and grid computing. DOI:10.1109/E-SCIENCE.2006.261167.

60.

Spafford

Vetter

(2012) Aspen: A domain specific language for performance modeling. In: ACM/IEEE international conference for high performance computing, networking, storage, and analysis, pp. 84:1–84:11.

61.

Spafford

Vetter

Benson

. (2013) Modeling synthetic aperture radar computation with Aspen. International Journal of High Performance Computing Applications 27(3): 255–262.

62.

Srinivasan

Juve

. (2014) A cleanup algorithm for implementing storage constraints in scientific workflow executions. In: 9th workshop on workflows in support of large-scale science, pp. 41–49. DOI:10.1109/WORKS. 2014.8.

63.

Tierney

Boote

Boyd

. (2009) Instantiating a Global Network Measurement Framework. Technical Report LBNL-1452E, Lawrence Berkeley National Lab.

64.

Truong

Fahringer

(2004) Scalea-g: A unified monitoring and performance analysis system for the grid. In: Dikaiakos

(ed.) Grid Computing (Lecture Notes in Computer Science, vol. 3165). Berlin: Springer, pp. 202–211. DOI:10.1007/978-3-540-28642-4_24.

65.

Vahi

Harvey

Samak

Gunter

Evans

Rogers

Taylor

Goode

Silva

Al-Shakarchi

. (2013a) A case study into using common real-time workflow monitoring infrastructure for scientific workflows. Journal of Grid Computing 11(3): 381–406.

66.

Vahi

Rynge

. (2013b) Rethinking data management for big data scientific workflows. In: Workshop on BigData and science: infrastructure and services, pp. 1–9.

67.

Valiant

(1990) A bridging model for parallel computation. Communications of the ACM 33(8): 103–111.

68.

Vöeckler

Mehta

. (2006) Kickstarting remote applications. In: 2nd international workshop on grid computing environments, pp. 1–8.