Abstract
New systems are often based on optimistic assumptions of how they will improve human performance. In the cognitive engineering tradition, these assumed benefits are regarded as hypotheses that need to be tested. An important element of a system user evaluation is to determine whether the hypothesized benefits are realized. Evaluation may also uncover unsupported aspects of performance or unanticipated side-effects of introducing the new technology that need to be addressed. We present a work-centered approach to user evaluation intended to meet these objectives, focusing specifically on design of tailored user-feedback questionnaires (work-centered questionnaires) that are intended to be diagnostic of how specific system elements do, or do not, support work. We summarize two recent evaluation studies we have conducted that illustrate our approach and the diagnostic power of work-centered questionnaires. We discuss how the goals and approach of a work-centered evaluation differ from more traditional approaches to usability evaluation that emphasize the use of standardized questionnaires and broad assessments of usability.
Introduction
New technologies are often introduced based on optimistic assumptions of how they will improve human performance. Many times, the assumed benefits do not materialize because the full complexities in the work domain were not appreciated (Hettinger et al., 2017; Pew & Mavor, 2007). One way to overcome this problem is to encourage multiple rapid cycles of prototyping and user evaluation during the design process. Effective user evaluation is important to justify the financial investment required to implement a new technology. It is particularly important in evaluating systems intended for complex high risk environments where there is a need to establish that the new system is reliable and safe and does not introduce any unintended negative consequences (O’Hara et al., 2012).
In the cognitive engineering (CE) tradition, prototypes are regarded as hypotheses about what constitutes effective support (Potter et al., 2000; Woods & Dekker, 2000). An important element of a user evaluation is to determine whether the hypothesized benefits are realized. Additionally, evaluations are used to uncover unsupported aspects of performance or unanticipated side-effects of introducing the new technology that need to be addressed (Roth & Eggleston, 2010). We have been engaged in the development and refinement of an approach to user evaluation, called work-centered evaluation, that is intended to meet these objectives.
The work-centered design and evaluation approach was first articulated by Robert Eggleston working with several of his colleagues (Eggleston, 2003; Eggleston et al., 2003). It has continued to be honed across more than 10 design and evaluation projects spanning multiple domains, including military airlift mission planning and scheduling software systems (Eggleston et al., 2003; Roth et al., 2006, 2017; Scott et al., 2009; Truxler et al., 2012); advanced power plant control rooms (Roth et al., 2010); and health care information technology systems (Clark et al., 2017; Wang et al., 2019).
Key elements of a work-centered evaluation include: (1) explicitly specifying the hypothesized benefits of the system framed as cognitive support objectives; (2) having users interact with the system across multiple (actual or simulated) test scenarios that reflect challenging work situations that arise in the domain; and (3) employing multi-faceted assessment measures, including a tailored work-centered questionnaire that elicits user feedback on the effectiveness of the system in meeting each of the cognitive support objectives, as well as the usability and usefulness of design elements of the system.
Previous papers have presented the overall work-centered evaluation approach and its philosophical underpinnings (Roth & Eggleston, 2010) as well as strategies for creating work-based test scenarios that are representative of the complexities that arise in a domain (Patterson et al., 2010). However, less attention has been paid to describing the rationale for, and design of work-centered user feedback questions.
This paper presents the work-centered approach to user-feedback questionnaire design and the role it plays in work-centered evaluations. We draw on two user-evaluation studies we recently conducted of novel displays intended for hospital emergency departments (EDs) to illustrate the approach (Clark et al., 2017; Wang et al., 2019). The two studies demonstrate the diagnostic power of work-centered questionnaires in establishing whether anticipated cognitive support objectives have been met, identifying aspects of the design that are suboptimal, as well as uncovering unsupported aspects of work to propel future innovation.
A Work-Centered Approach to Design and Evaluation
Work-centered design and evaluation has its roots in the CE tradition of grounding design in a deep analysis of the requirements of work (Eggleston, 2003). Figure 1 provides an overview of the main elements of the approach. The process begins with an analysis of the work objectives and processes as well as challenges and obstacles to effective performance. Typically, the analysis relies on a mix of cognitive analysis methods coming out of the cognitive task analysis and cognitive work analysis traditions that draw on field observations and structured interviews for insight into work demands and requirements (Bisantz & Roth, 2007; Hettinger et al., 2017).

Overview of work-centered approach to system design and evaluation. Copyright Aaron Z. Hettinger, MedStar Health National Center for Human factors in Healthcare, MedStar Institute for Innovation.
From this analysis a set of cognitive support objectives and associated information needs are defined that specify requirements to support more effective performance. For example, as part of the process of designing an Emergency Department Information System (EDIS) for management of patient care and patient flow through the ED, the design team conducted in-depth cognitive analyses of the functions of an ED and the information that nurses and providers required (Guarrera et al., 2015). From this analysis they derived a set of 19 cognitive support objectives that were used to guide the design and evaluation of the EDIS (Clark et al., 2017). Examples of cognitive support objectives they identified are “The EDIS should support assessment of whether you have the resources required (e.g., beds, staffing) for the current patient demand”; “The EDIS should support maintaining awareness of overall acuity of patients waiting and currently being treated”; and “The EDIS should support ability to identify which patients are most critical.
The set of cognitive support objectives derived from a cognitive analysis can be thought of as a model of support that specifies the hypothesized support requirements for effective performance. The phrase “model of support” is intended to highlight that the cognitive support objectives represent the design teams’ hypotheses about how the technology will affect performance. The cognitive support objectives are then used to inform both the design of the system and the work-centered evaluation that is used to test that the system design meets the cognitive support objectives. Finally, as we describe in greater detail below, the work-centered evaluation employs work-based scenarios and measures that are also informed by the cognitive support objectives.
Work-Centered Evaluation
Traditionally, a distinction has been made between formative and summative evaluations (Nielsen, 1993). Formative evaluations are intended to identify design deficiencies and opportunities for improvement and are typically conducted as part of iterative design cycles. Summative evaluations are intended to provide an overall assessment of the efficacy of a system and are typically conducted at the completion of the system design. Work-centered evaluations combine elements of both (Roth & Eggleston, 2010).
From a summative perspective the aim is to evaluate the “model of support” framed in terms of a set of cognitive support objectives that the system is designed to meet. Underlying the system being evaluated is an implicit model of support that asserts that the features embodied in the system effectively achieve the cognitive support objectives. One aspect of a work-centered evaluation is to explicitly test these claims. The goal is to evaluate how well each of the cognitive support objectives is being met.
Ideally, work-centered evaluations also serve a formative evaluation function. As in the classic usability paradigm, part of the goal of our approach to evaluation is to uncover design problems that need to be addressed prior to final implementation. An additional explicit goal is to probe for any previously unrecognized work demands and unanticipated support requirements, propelling further design innovation.
Figure 2 highlights two key aspects of a work-centered evaluation: (1) use of work-based scenarios that allow the study participant to exercise the system in a representative range of actual or simulated work situations; (2) use of multiple measures that in combination are intended to test both the model of support that underpins the system design and the usability and usefulness of specific elements of the system design.

Key elements of a work-centered evaluation. Copyright Aaron Z. Hettinger, MedStar Health National Center for Human factors in Healthcare, MedStar Institute for Innovation.
Work-based scenarios are crafted to be representative of the work situations where the display/decision support design is anticipated to facilitate effective performance. Ideally this would include a sampling of both routine cases as well as more complicated conditions that are likely to stress performance (Patterson et al., 2010). The goal is to allow study participants to exercise the system across a sample of realistic situations enabling them to provide more informed feedback on the strengths and limitations of the support provided.
Work-centered evaluations employ multiple measures that collectively assess the usefulness and usability of system design features, as well as the ability of the system as a whole to meet the cognitive support objectives. Usability measures address the question of how easy each of the design features is to learn, understand and use. Usefulness measures address the question of how useful a feature is from the perspective of supporting the ability of users to achieve their work objectives. Usefulness and usability are not necessarily aligned (Davis, 1989; Lund, 2001). It is possible that a feature is easy to use but not useful from the perspective of achieving work objectives. Conversely a feature could be useful from the perspective of achieving work objectives but not very usable because of the details of how it was implemented. In principle it can be difficult to disentangle usability, usefulness and effectiveness of cognitive support because they are necessarily inter-related. For example, if the design features are poorly executed (e.g., the buttons are too small, the color choices are hard to discriminate), the lack of usability of these features will negatively impact their usefulness and in turn the ability to provide the intended cognitive support (e.g., ability to access the required information or to recognize important distinctions).
Figure 2 illustrates how work-centered evaluations employ a combination of measures in an attempt to disentangle these different aspects. Some measures assess the system elements as designed and implemented (Test System Design). These measures include users’ (1) ratings of the usability of individual design elements (Usability), (2) ability to correctly answer questions that rely on understanding the design elements (System Element Understanding), and (3) performance on work-based scenario tasks that depend on using these design elements (Task Performance). The underlying logic is that features that are easy to use will be rated highly. They will also be easy to understand and learn to use. This should be reflected in how well participants perform on the System Element Understanding questions. Ease of use will also contribute to task performance on the work-based scenarios because features that are challenging to use will likely create delays and errors in performance.
An overlapping set of measures is used to examine the effectiveness of the system in meeting the cognitive support objectives (Test Model of Support). These measures include users’ (1) performance on the work-based scenario tasks (Task Performance), (2) ratings of Usefulness of individual design elements; (3) ratings of frequency of use of the design elements; and (4) ratings of the extent to which the system meets the cognitive support objectives. The underlying logic is that if a system is providing effective cognitive support, then users will find its features to be useful and they will indicate that the system effectively achieves the cognitive support objectives.
Since cognitive support is typically achieved through a combination of design elements, the cognitive support ratings questions ask about how effectively the system taken as a whole meets the cognitive support objectives. The mapping between specific design elements and the cognitive support objectives is not explicitly presented to the study participants.
The frequency of use question is intended to assess how often the study participants would anticipate using the system. The goal of this question is to assess whether the system supports work activities that the individuals engage in regularly (e.g., once a day) vs. ones that they encounter less often (e.g., once a month). While users are generally poor at predicting future behavior, they should be able to reliably anticipate frequency of use based on their understanding of and prior experience with the temporal rhythm of their work (Osberg & Shrauger, 1986).
Effectiveness of cognitive support should also be reflected in better task performance on the work-based scenarios. Note that task performance appears under both Test System Design and Test Model of Support because performance is inevitably affected by both the usefulness of system design elements in supporting performance and how well those design elements are executed. By examining the pattern of results across these various measures it becomes more possible to disentangle usability, usefulness, and cognitive support, and better pinpoint which aspects of the system work well, which work less well and why.
Work-Centered Questionnaire
The work-centered questionnaire includes sections that assess effectiveness of meeting cognitive support objectives, usability of system elements, usefulness of system elements, and frequency of use. More specifically it includes:
A section organized around the cognitive support objectives: Closed-form ratings questions intended to probe the effectiveness of the system in meeting each of the cognitive support objectives. Typically, a rating scale is used with “not at all effective” and “extremely effective” as the low and high anchor point labels.
A section organized around core elements of the system: Closed-form ratings questions that ask for evaluation of the usability and usefulness of particular views, functions, and/or capabilities of the system (typically on a rating scale with “not at all usable/useful” and “extremely usable/useful” as the low and high anchor point labels). This section also asks about the anticipated frequency of use of each of the capabilities (e.g., never, once a week, once a shift, several times a shift).
An open-ended feedback section: Open-ended questions intended to solicit concerns users may have with the system, usability problems and/or suggestions for improvement. Importantly this includes identifying aspects of work, or situations that are not well supported by the system being evaluated.
The questionnaire can be presented in paper-based form or via computer-based entry. The sections are typically presented in the order shown above, although the ordering of sections is not critical.
Work-centered user-feedback questionnaires are given at the completion of the user evaluation session, after the user has had an opportunity to become familiar with the features and capabilities of the system, as well as had an opportunity to exercise the system on a series of realistic, work-related scenarios that sample situations and cognitive complexities that are likely to arise in the domain. This ensures that the user-feedback is grounded in experience with the system across multiple realistic conditions.
The use of work-centered questionnaires differentiates work-centered evaluations from other user evaluation approaches. Work-centered questionnaires differ substantially from standardized questionnaires such as the System Usability Scale (SUS) that are often used in usability evaluations (Lewis, 2018aLewis, 2018b). Standardized usability questionnaires employ generic questions so that the same questionnaire can be applicable across a wide range of products and services (Gao et al., 2018). An example question from SUS is “I felt very confident using the system.” These generic questions can contribute to a global assessment of perceived usability of the system as a whole. They are also useful for comparison of usability scores across systems. However, they are less useful for understanding what aspects of the design led to a high or low rating.
In contrast, work-centered questionnaires are tailored to the specific system being evaluated. Rather than providing a holistic assessment of system usability, work-centered questionnaires are intended to be diagnostic, specifically linking system elements to assessments. The questionnaire includes questions explicitly intended to probe whether the system provides the cognitive support hypothesized by the developers as well as identify which features of the system are useful and usable and which are less so.
Illustrative Evaluations
The authors, collectively, have conducted over 10 work-centered evaluations as part of applied programs to develop and evaluate novel displays and decision support systems intended to improve the cognitive performance of individuals and teams (e.g., Clark et al., 2017; Roth et al., 2010, 2017; Wang et al., 2019). The evaluations included work-centered questionnaires tailored to the specific application. In all cases the results of the questionnaires were important to determining whether the cognitive support objectives underlying the systems were in fact met. They also proved to be diagnostic in pointing to aspects of the design that were less effective as well as identifying unanticipated aspects of work that were not well supported.
Here we summarize two recent work-centered evaluations we conducted of novel displays for hospital ED applications as a way of illustrating the use of work-centered questionnaires and the diagnostic power they provide. While in each case we provide a brief overview of the entire design and evaluation process, the primary focus is on the use of the work-centered questionnaires. More details on the design process and rationale are covered in prior publications (Clark et al., 2017; Guarrera et al., 2015; McGeorge et al., 2015; Wang et al., 2019).
Emergency Department Information System
The first example is a study evaluating a prototype EDIS intended to support ED clinical staff (e.g., nurses, physicians) in tracking patient care and ED resource allocation (Clark et al., 2017). The prototype was developed by an interdisciplinary team made up of ED domain experts and human factors professionals. An extensive design process was followed that began with development of an Abstraction Hierarchy (AH) for the ED domain based on interviews with ED nurses and physicians. An AH is intended to provide a comprehensive representation of the goals of a (socio-technical) system and a decomposition of the functions and processes required to achieve system goals represented at different levels of abstraction (Rasmussen, 1986; Vicente, 1999). The AH for the ED documented the high-level goals of the ED and the processes, constraints, and physical resources involved in achieving those goals (Guarrera et al., 2015). The AH informed subsequent design steps that included identifying information needs and creating and iteratively refining prototype displays for tracking patient care and ED resource allocation.
Nineteen cognitive support objectives emerged and were iteratively refined as part of this design process. These included the need to support awareness of the overall ED state and flow of patients through the ED, patient care, staff workload, and available resources. The complete set of cognitive support objectives are presented in appendix 1.
While cognitive support objectives were not directly derived from the AH, they can be mapped back to functions within the AH. For example, the AH includes ‘maintain situational awareness over the ED’ as a function, with ‘personnel’, ‘patients’ and ‘facilities and equipment’ identified at the lower physical function level. Cognitive support objectives associated with maintaining awareness of overall ED status as well as maintaining awareness of Staff Workload status, status of ‘Resources and Equipment’ and status of patient flow within the ED map back to these functions.
The abstraction hierarchy representation, how it supported identification of information needs, and details of the iterative design process are documented in Guarrera et al., 2015. The cognitive support objectives derived from the design process and how they were used to inform the user evaluation are documented in Clark et al., 2017.
The resulting EDIS prototype contained seven display areas that could be accessed via a main dashboard (Figure 3). Areas included an overview status of the ED (e.g., number of patients in ED, average time to first evaluation), waiting room status (e.g., information on patients in the waiting room); ED patient flow (overview of patient flow from waiting room through disposition; individual patient progress (details on each patient); ED beds (bed availability); resources and equipment (e.g., number of people waiting and expected wait for orders such as imaging, laboratory tests); and staff workload (estimated workload of each staff member based on number and complexity of patients assigned to them).

EDIS prototype system dashboard. All patient data is fictitious (McGeorge et al., 2015). Copyright Ann Bisantz, University at Buffalo, The State University of New York.
Eighteen health care providers (nurses, physicians and physician assistants) participated in the user evaluation of the EDIS prototype in a simulated setting. The evaluation session included a prototype familiarization phase in which the prototype was first demonstrated and participants answered questions intended to assess their understanding of EDIS display elements (System Element Understanding); a task performance phase in which participants used the EDIS to complete two work-based scenario tasks; and a user assessment phase in which participants evaluated the system using a work-centered questionnaire consisting of the elements described above.
The two work-based scenarios were developed by the clinical members on the team as examples of the kind of cognitively complex tasks that clinical staff can find themselves needing to perform. This included a task to re-orienting themselves to the status of the ED after returning from a patient resuscitation, and a planning task in which participants were notified of a mass casualty incident and asked to prepare for an influx of patients. The purpose of the tasks was to enable participants to experience use of the displays in a sampling of realistic, cognitively demanding situations that required them to collect and integrate information across different areas of the prototype in a non-scripted manner.
In this study, the work-centered questionnaire included questions regarding ease of use, usefulness, frequency of use, and extent to which the displays supported the work-oriented cognitive needs of emergency medicine clinical staff as reflected in the 19 cognitive support objectives. The usability, usefulness, and cognitive support objectives questions employed a 9-point rating scale with “1” labeled as “not at all effective” and “9” labeled as “extremely effective.” Participants could alternatively select N/A to indicate “not experienced during session.” The frequency of use question asked the participant “How frequency would you use this information” with four options ranging from “never” to “more than four times per shift.” They could also answer “don’t know/not experienced during session.”
The full results of the study are reported in Clark et al. (2017). Here we summarize the results obtained using the work-centered questionnaire. The mean score across participants for each of the 19 cognitive support objectives questions was computed. As illustration, Table 1 shows the mean ratings obtained for the top three and bottom three cognitive support objectives questions. As can be seen, the highest mean score was 8.56 (on a 9-point scale) and the lowest mean score was 5.89, indicating that all the cognitive support questions received ratings above neutral (5 being the midpoint of the scale), but that some cognitive support objectives were judged to be more effectively supported than others.
Mean Scores for the Top 3 and Bottom 3 Scoring Cognitive Support Objective Questions
For applied settings where the goal is to evaluate a prototype to decide whether it sufficiently supports performance or needs to be improved, inspection of mean ratings may be sufficient to support design decisions. Because this was a research study rather than a user evaluation being conducted in an applied setting, a more rigorous analysis approach was required. Statistical analyses were conducted to establish which differences were statistically significant. A two-way ANOVA was calculated with role (Provider vs. Nurse) and cognitive support objective as the independent variables and cognitive support objective rating score as the dependent variable. There was a significant main effect of cognitive support objective. There was no significant effect of role and no interaction. Post hoc paired comparisons were then computed to determine which differences among the cognitive support objectives were statistically significant. One finding from this analysis was that the lowest scoring question received a score that was significantly lower than the top 10 scoring cognitive support objectives.
These results illustrate the diagnostic power of the cognitive performance support-oriented questions. The rating questions enabled the team to identify which cognitive support objectives are highly supported and which are less so. As an example, the results revealed that “support for prioritizing your task” which received the lowest mean rating was less well supported than most other cognitive support objectives. In particular, the statistical analysis indicated that its mean score was reliably lower than the 10 highest scoring cognitive support objectives. The authors noted that this cognitive support objective was included because the initial cognitive analysis conducted in support of the design project identified the need to support individual task prioritization as part of a full system. However, support for this cognitive task was beyond the design goals of this prototype. The results served to confirm the need to include support for individual task prioritization as part of broader suite of ED support being planned.
The questionnaire section examining the usability, usefulness, and frequency of use of different elements of the display system also proved to be diagnostic, pointing to opportunities to improve the displays. Table 2 provides two examples of display elements together with the mean usability and usefulness score they received averaged across participants. As can be seen, the question described the display element as well as how it was anticipated to be useful to them. Overall mean usability score across the 18 display elements was 7.19 and mean usefulness score was 7.44 indicating that in general the display elements were perceived to be usable and useful.
Mean Usability and Usefulness Scores for Two Example Display Elements
The usability and usefulness scores also proved to be diagnostic in identifying aspects of the display that were most useful and usable and which were less so. For example, mean usability and usefulness was computed for each of the seven display areas. The display relating to waiting room information and patient progress information yielded statistically higher mean usefulness ratings than usability ratings. These findings confirmed that while waiting room and patient progress information were useful to ED staff members, the information presentation could benefit from redesign. Median frequency of use ratings were ‘4 or more times’ per shift for both the waiting room and patient progress information, reinforcing the conclusion that the information provided in these portions of the display were highly useful to the providers.
Examining usability vs. usefulness of different design elements can provide useful insight into the strengths and weaknesses of different elements of a system. This is illustrated in Figure 4 where we show a scatter plot of usability vs. usefulness of EDIS design elements that were included in the evaluation. To enhance readability, we only present a subset of the 18 design elements that were evaluated in the study. As can be seen, some design elements were clearly identified as both usable and useful (e.g., bed status, length of stay). Other design elements were clearly identified as both less useful and less usable (e.g., historical trends). Still others were identified as relatively useful but less usable (e.g., abnormal test results has a usability score of 6.12 and a usefulness score of 7.95, indicating that users recognized that alerts regarding abnormal test results were very useful, but the way this information was presented was not very effective.)

Usefulness vs. Usability scores for EDIS display features.
Patient-Focused ED Display
The work-centered evaluation approach was also used in a more recent study we conducted to develop and evaluate a novel, patient-focused electronic health record display for the ED (Wang et al., 2019). The patient-focused display was designed to help nurses and Emergency Medicine (EM) providers (Attending Physicians, Resident Physicians and Physician Assistants) develop shared understanding of a patient’s health status and progress through the plan of care.
The display design was based on an analysis of cognitive support needs derived from prior cognitive analyses (Guarrera et al., 2015) as well as additional observations, interviews and focus groups with nurses and EM providers that were conducted in support of this design effort (Wang et al., 2019). The cognitive support needs were translated into a list of 18 cognitive support objectives. Example cognitive support objectives are:
Quickly assess the patient’s current clinical condition (e.g., symptoms, lab results, vital signs)
Understand the status of orders for the patient (in process, waiting for results, completed)
Be alerted to/understand hold-ups in the patient’s care/progress through the ED
Know important non-medical information about the patient that may impact their care (e.g., patient does not have a ride home; patient speaks a language other than English; patient uses a walker)
These cognitive support objectives were used to guide what information was included in the patient-focused display and how it was presented. They also provided the basis for the cognitive support objectives questions included in the work-centered questionnaire. The complete set of cognitive support objectives are provided in Appendix 2.
Figure 5 presents a screenshot of a sample display. The display contained seven major sections: a top banner; a list of orders; a summary of current patient status; a summary of the plan of care; a listing of future actions (upcoming tasks); an event feed providing a chronological list of events and activities relating to the patient; and a timeline providing a birds-eye view of all the events that have occurred throughout the patient’s stay on both a small and large scale. A more detailed description of the patient-focused display can be found in Wang et al. (2019).

Image of the patient-focused display prototype (Data is simulated and/or de-identified). Copyright Aaron Z. Hettinger, MedStar Health National Center for Human factors in Healthcare, MedStar Institute for Innovation.
Twenty EM clinicians, including 10 nurses and 10 providers served as participants in a study to evaluate the patient-focused display. The evaluation methodology was very similar to that used in Clark et al. (2017). The one difference is that in addition to having the participants fill out a tailored, work-centered questionnaire, we also had them fill out a SUS usability questionnaire (Sauro, 2011).
The detailed results of the study are presented in Wang et al. (2019). Here we present select findings that highlight the diagnostic value of the work-centered questionnaire as well as the relation between the work-centered questionnaire scores and the SUS scores.
Mean scores for cognitive support objective questions varied from a low of 5.75 to a high of 8.5 indicating a wide range of perceived effectiveness of the display in supporting the different cognitive support objectives. Table 3 lists the six cognitive support objectives that received the highest and lowest mean ratings respectively (when averaged across participants) to provide a sense of the kinds of questions presented and the range of mean scores obtained.
Mean Scores for the Top 3 and Bottom 3 Scoring Cognitive Support Objective Questions
As in the EDIS evaluation study, a two-way ANOVA was calculated with role (Provider vs. Nurse) and cognitive support objective as the independent variables and rating score on the cognitive support objective question as the dependent variable. A statistically significant main effect of cognitive support objective was found. There was also a significant effect of role, with nurses giving higher scores than providers, but no interaction. Post hoc paired comparisons were then computed (using Tukey’s honest significant difference test) to determine which differences among the cognitive support objectives were statistically significant. One finding from this analysis was that the two lowest rated cognitive support objectives were rated significantly lower than the 12 highest rated cognitive support objectives. The range of ratings for cognitive support objectives, and the fact that some were statistically shown to be reliably lower than others, illustrate the diagnostic power of the cognitive support objectives questions. The questions eliciting the lowest scores pointed to opportunities to improve the displays as well as opportunities for additional views that would be valuable to develop. For example, participants gave the lowest mean cognitive support objectives score to “being alerted to significant changes,” Their verbal feedback reinforced this point. Participant feedback suggested that the low score was because the prototype did not use highlighting or color coding to flag significant lab values or test results. A recommendation coming out of the study was to employ more effective salience coding in the final implemented system. This is an interesting example where a usability problem (failure to use highlighting or color coding to flag significant test results) influenced the perceived effectiveness of cognitive support. More generally, when participants provide low ratings to cognitive support objectives questions, it can be helpful to examine the usability and usefulness ratings for the corresponding system elements that were intended to provide that cognitive support.
The study also illustrated the diagnostic power of the usability and usefulness measures. Just as in the case of the cognitive support objectives questions, there was wide range in mean ratings obtained for the usability and usefulness questions on the 27 design elements. Mean usability averaged across participants ranged from 5.65 for the display element receiving the lowest mean usability rating to 8.56 for the display area receiving the highest mean usability rating. Similarly mean usefulness ranged from 5.45 to 8.45. Lower mean ratings suggest opportunities for improvement.
Examining Usability vs. Usefulness Ratings for Opportunities for Improvement
One strategy for identifying opportunities for improvement is to look for instances where usefulness was rated to be relatively high while usability was rated relatively low. To search for examples of such instances we plotted mean usability vs. usefulness scores, across all usability/usefulness questions, for this study (Wang et al., 2019) as well as the EDIS evaluation study described earlier (Clark et al., 2017). This is displayed in Figure 6. For readability reasons we do not include the feature labels. Quadrants based on median scores were identified, in order to differentiate among features that have both usability and usefulness scores that are higher (or lower), and features where usability and usefulness scores diverge.

Usefulness vs. Usability scores for display features across two studies. Quadrants are defined based on median scores.
As can be seen in Figure 6, while usability and usefulness are generally correlated, items where usefulness is high but usability is low do occur. For example, one display element of the patient-focused ED display was notable in receiving a relatively high mean usefulness score (8.05) but a relatively low mean usability score (6.70) signaling a need to improve how this design feature was implemented. The display element in question showed abbreviated results of completed lab tests with the ability to hover for additional information. While the relatively high usefulness score indicated that study participants thought the information was important to show, the relatively low usability score suggests that it could be presented more effectively.
More generally this type of scatter plot display can be used to drive development decisions. For example, features that score high on usefulness, but lower on usability (in the top left quadrant), could be prioritized for improvement.
Examining Concurrent Validity
One question that arises is how ratings obtained using the tailored work-centered questionnaires compare to ratings obtained using standardized usability questionnaires such as SUS. For example, is there a correlation between ratings obtained with the tailored work-centered questionnaire and standardized usability questionnaire ratings? A statistically significant correlation of .3 or more with an established usability questionnaire is generally considered evidence of concurrent validity (Lewis, 2018a).
Work-centered questionnaires include multiple sections, some intended to tap usability of different design features, some intended to assess usefulness of different design features, and some intended to assess whether pre-identified cognitive support objectives have been met. It is important to know which if any of these types of questions (usability, usefulness and cognitive support questions) correlate with standardized usability questionnaires such as SUS.
To explore this question, we included a SUS questionnaire in our most recent user evaluation study (Wang et al., 2019). This allowed us to directly compare SUS scores with scores obtained from the more tailored work-centered questionnaire. We selected the SUS questionnaire as it is the most widely used and intensely studied standardized usability questionnaire (Lewis, 2018a, 2018b). SUS usability scores were calculated for each participant using the standard calculation method (Lewis, 2018a). Mean scores were also calculated for the cognitive support objective questions, usability and usefulness on the work-centered questionnaire.
We computed Pearson correlation between SUS scores and mean usefulness, usability, and Cognitive support objective across the 20 participants. SUS scores were significantly correlated with usability (r = .66, p < .01) but not usefulness (r = .30, p > .10) or cognitive support objective ratings (r = .36, p > .10).
The relatively high and statistically significant correlation between our usability scores and SUS scores, provides evidence of concurrent validity of the usability measure in our work-centered questionnaire. This suggests that if one measures usability as is done in the work-centered approach, it may not be necessary to also measure SUS.
The lack of statistically significant correlation between SUS and the usefulness and cognitive support scores is also of some interest. In principle one would not expect a strong correlation between SUS and usefulness or cognitive support scores because these two measures are intended to tap different constructs than usability. However, given the low power of our analysis (N = 20), one cannot conclude that there is no correlation (since that would be tantamount to accepting the null hypothesis). In fact, since poor usability can negatively impact the usefulness of a design feature and this in turn can impact effectiveness of cognitive support, one might expect a positive correlation between those measures and SUS, though the correlations would be expected to be weaker than between the usability measure and SUS.
Discussion
This paper provides an overview of the work-centered approach to system design and evaluation. We have applied this approach across multiple design and evaluation projects over a wide range of domains. The applications have ranged in scope from rapid prototypes with a narrow focus of application, for example a system to support weather forecasting for a military air transport control center (e.g., Scott et al., 2005), to high fidelity designs of large complex systems such as an advanced digital control room for a Nuclear Power Plant application (Roth et al., 2010). In this paper we summarized the design and evaluation of two prototypes intended to support the cognitive work of ED staff as a way of illustrating the approach. We primarily focused on the design of work-centered questionnaires and the kinds of insights they can yield because this is an aspect of the work-centered approach that most differs from other user-centered design and evaluation methods.
Our description of the work-centered design and evaluation approach focused on characterizing the objectives of the approach and the ideal design and evaluation process. In practice, there are likely to be pragmatic considerations that limit the ability to fully meet that ideal. System design and evaluation studies, particularly in applied settings, are likely to face resource limitations that include limited access to domain practitioners to serve as participants in the study, and limited time and financial resources to design and conduct the evaluation. In addition, the type and amount of evidence required for assessing the success of a system design will depend on the motivation for conducting the evaluation, who sponsored the work, and who are the target consumers of the findings. For example, academic journals may demand performance data and statistically significant findings, whereas people in applied settings responsible for making design implementation decisions may not desire (or be willing to pay for) this type of evidence.
In this section we discuss some of the major pragmatic challenges associated with conducting work-centered evaluation projects. We frame the discussion in terms of questions that arise in work-centered design and evaluation projects and provide answers based on our own experiences. The hope is that this will provide guidance for others embarking on design of work-centered design and evaluation projects.
When and How Do You Generate Cognitive Support Objectives?
Ideally cognitive support objectives are generated prior to the system design. In the work-centered approach cognitive support objectives are informed by the results of cognitive analyses of the work domain (e.g., interviews with and observations of domain practitioners) and are derived by an integrated multi-disciplinary team that includes cognitive engineers, graphic display designers and software engineers as part of the design process. However, in practice, the cognitive support objectives that drive the design are often not explicitly enumerated prior to the design effort. They may be informally understood by design team members and discussed during design meetings, but they may not be formally written down. This will especially be the case if the system is developed by a team using a different design approach, and the work-centered evaluation team is brought in after the design is completed. In those cases, one of the primary roles of the work-centered evaluation team is to ‘reverse engineer’ the design to identify the cognitive support objectives that implicitly underly it. This is accomplished by inspecting the design, talking with members of the design team, and reviewing materials they may have generated (e.g., presentations prepared for stakeholders and decision-makers) that list the anticipated benefits to users and the organization of implementing the proposed design.
Even if the cognitive support objectives have been generated as part of the design process they will most likely need to be refined and reworded so that they are sufficiently specific, distinct, and concise for inclusion in the work-centered questionnaire.
How Do You Ensure That Work-Based Scenarios Are Representative?
Work-based scenarios are intended to be representative of the range of work situations where the display/decision support system is anticipated to facilitate effective performance. Ideally the cognitive support objectives would define the bounds of support (i.e., the range of situations for which the system is intended to provide effective support). For example, is it intended to support only routine cases or do the claims for support extend to off-normal situations? Are there likely to be ‘edge cases’ where the automated elements of the system are likely to fail? What happens then? Do the cognitive support objectives extend to those cases as well? Thus, one answer to the question of identifying representative work-based scenarios is to look to the cognitive support objectives to understand the range of situations where the system is intended to provide effective support and sample broadly from that range of situations.
Another related answer is to draw on the results of cognitive analyses of the work domain to identify classes of work-situations that should be sampled. Typical outputs of cognitive task analyses and cognitive work analyses include lists of cognitive demanding situations that arise in the domain (e.g., situations that stress workload, create goal conflicts, are challenging from the perspective of situation assessment or situation awareness.) The cognitive demanding situations identified through cognitive analyses can be used to create work-based scenarios that are representative of the cognitive challenges in the domain (See Ernst et al., 2019 for an example in the domain of Army helicopter operations and implications for next-generation designs). Yet another source of guidance in design of complex work-based scenarios are theory-driven generic lists of domain characteristics that are known to challenge cognitive performance. For example, Patterson et al. (2010) assembled a list of complicating factors organized around macrocognitive functions (e.g., detecting, sensemaking, planning, coordinating). Examples of items from their list for the sensemaking function are missing information, misleading information, and information coming in gradually over time that must be detected and integrated. Finally, domain experts are an invaluable resource in identifying relevant classes of domain situations to include in an evaluation and helping to craft work-based scenarios that are representative of those situations.
In the ideal evaluation, scenarios should include not only commonly occurring routine cases but also ‘edge-cases’ that may arise less frequently but are cognitively challenging and important to support. By including a broad range of test cases, it becomes possible to probe the boundaries of the support provided by the system. Situations where the system does not provide effective support are thus more readily identified, propelling additional design and evaluation cycles.
In practice the number and range of work-based scenarios included in an evaluation will depend on multiple factors including the phase of design and purpose of the evaluation (e.g., Is it an early or mature prototype? Is the evaluation to assess that a prototype has sufficient promise to continue development or is it a final validation of a fully implemented design to be submitted as part of the regulatory approval process?) and the resources available for conducting the evaluation (e.g., how much time is it reasonable to ask study participants to devote to the evaluation session?).
Is It Necessary to Collect Performance Measures to Establish That Displays Enhance Performance?
Generally, when new systems are designed and implemented, there are specific expectations as to how they are likely to affect performance. This may include that the system will improve situation awareness, reduce workload, reduce error, or make performance faster or better in other ways. In the ideal, a work-centered evaluation should include performance measures explicitly intended to evaluate these hypothesized benefits. For example, we conducted an evaluation of a prototype decision support system that was intended to improve dynamic replanning of airlift missions (Scott et al., 2009). The study included multiple performance measures including time to generate a solution and quality of solution. We also included measures of situation awareness and workload. These measures were in addition to administering a work-centered questionnaire that elicited user assessments of the decision-support system with respect to usability, usefulness, and extent to which it met its cognitive support objectives.
While establishing the benefits of a new display or decision-aid via performance measures is highly desirable, schedule and resource constraints does not always permit inclusion of such measures, particularly when evaluating early prototypes in applied design projects. In those cases, we have relied on user-feedback via work-centered questionnaires as well as open-ended comments, to evaluate the usability and effectiveness of the system (e.g., Truxler et al., 2012). To maximize the quality of the user feedback, we include multiple work-based scenarios so that participants have an opportunity to experience performance of the system across multiple realistic situations prior to filling out the work-centered questionnaire.
Is It Necessary to Perform Statistics to See if the Responses to Different Questions Are Significantly Different From Each Other?
In the two illustrative examples we provided of work-centered evaluations of novel displays for the ED, statistical analyses were conducted to assess which differences among rating questions were statistically reliable. In both cases the design and evaluations were conducted as part of academic research. The objective was to establish generalizable results, justifying the need for more rigorous statistical analysis. However, most applied projects use a small number of participants (e.g., 5 to 10 is often what is recommended for usability evaluations) resulting in low statistical power. In those cases, rating scores can be evaluated relative to each other or by comparison to benchmark values. With respect to relative comparisons the evaluation team can look for scores that are much lower than others as an indication of need for more follow-up. Benchmark values can be based on rating scale anchor points (e.g., one can decide that anything above the midpoint of a scale is considered a positive assessment); comparison to the score averaged across all questions (e.g., one can decide that anything above the median score obtained in the study is considered a positive assessment); or comparison to predefined benchmark values that are derived from prior experience with using the rating scale (e.g., one can decide that any score above a 6 on a 9 point scale is a positive value based on prior experience in using the scale).
Are Work-Centered Questionnaires Suitable for Large Complex Systems as Well as Narrowly Focused Design Applications?
The two illustrative cases of work-centered design and evaluation were relatively narrowly focused applications intended to support situation awareness, communication, and coordination across ED staff. However, the same work-centered approach can be used to evaluate systems of various scope of application and level of maturity. For example, we conducted an evaluation of an entire advanced control room design for a nuclear power plant that employed a dynamic high fidelity control room simulator (Roth et al., 2010). In that case the evaluation was substantially more elaborate consisting of multi-person crews, each crew participating in high-fidelity simulator scenarios over a 4-day period. The general evaluation philosophy remained the same. We crafted scenarios that were representative of a range of normal and accident scenarios, collected multiple individual and team performance measures, and employed a final questionnaire to collect user-assessments of cognitive support objectives, usability, and usefulness of various elements of the control room design.
Summary and Conclusions
In this paper we described a work-centered approach to design and evaluation of new displays and support systems. We presented results from two evaluations that we recently conducted to illustrate the work-centered approach to user evaluation. We ended with a discussion of some pragmatic considerations in conducting work-centered evaluations.
One of the aspects that differentiates work-centered evaluations from other approaches is the use of tailored, work-centered questionnaires, that combine rating questions and open-ended feedback questions to probe how effective the system is in meeting pre-identified cognitive support objectives, how useful and usable different elements of the system are, and how frequently study participants anticipate they would use them.
The studies illustrated how work-centered questionnaires can be used diagnostically to establish whether a system under evaluation meets the cognitive support objectives which drove the design, and to identify which aspects of a design most need improvement. We believe these findings highlight the value of including work-centered questionnaires as part of user evaluation studies.
Our work-centered approach to questionnaire design differs markedly from standardized usability questionnaire approaches. Instead of using general questions intended to generate holistic assessments, work-centered evaluation questionnaires are made up of tailored questions intended to probe whether the system provides the cognitive support hypothesized by the developers (i.e., whether the cognitive support objectives are met) as well as questions intended to evaluate the usability and usefulness of specific design elements. In this sense work-centered questionnaires can provide more diagnostic information with which to iteratively improve designs than standard questionnaires. We acknowledge that development effort is required to create a tailored work-centered questionnaire (which translates to time and money) above what is needed to administer a standardized usability questionnaire. The two ED cases we presented in the paper were intended to demonstrate the kinds of additional insights that work-centered questionnaires can yield, justifying the relatively modest additional development effort incurred.
Ratings questionnaires also complement the use of more open-ended response approaches to eliciting user feedback such as open-ended written questions and verbal debriefs. Mean usability, usefulness and cognitive support objective scores obtained from a work-centered questionnaire can be reported as concise summary statistics to customers, sponsors and stakeholders as objective evidence that can be used to assess the validity of developer claims, and to measure progress over time. As such they provide an important additional tool for communicating results of user evaluations to stakeholders that complement more open-ended response approaches. In turn open-ended responses are invaluable for understanding the rationale behind rating question scores.
The results reported here are encouraging, but subject to limitations. While the relatively high positive correlation between our usability measure and SUS point to concurrent validity, more research is needed to validate the rating scales more fully. For example, it would be desirable to examine the psychometric properties of the usability, usefulness, and cognitive support measures. Unlike standardized questionnaires such as SUS for which benchmarks for “good” and “failing” scores exist, the work-centered rating scales do not have comparable established benchmarks. In addition, it would be useful to establish more clearly that our usability, usefulness, and cognitive support measures are tapping distinct constructs.
Footnotes
Appendix 1
Appendix 2
Acknowledgments
This study was funded by the Agency for Healthcare Research and Quality’s (AHRQ) R01 grant (R01HS022542) and R18 grant (R18HS020433). The authors would like to acknowledge Sudeep Hegde, Daniel J. Hoffman, Natalie C. Benda, Ella S. Franklin, David Lavergne, Shawna J. Perry, and Rollin J. Fairbanks as members of the R01 team. Additionally, Eva Hochberger produced Figures 1 and 2.
