Enhancing Ecological Validity in Wearable Sensor Stress Research: A Novel Laboratory Paradigm Integrating Cognitive,Physical,and Social-Evaluative Demands

Abstract

Popular laboratory stress tasks used to develop wearable stress detection systems lack key contextual features common in dynamic field environments. To address this gap, we developed a novel laboratory stress task integrating contextual factors present in field settings while maintaining key features from traditional stress tasks. Twenty participants performed the Open Multi-Attribute Task Battery (OM) under varying difficulty levels and with/without an evaluator present and with varying levels of light physical activity. During the Stress Trial, an evaluator provided critical evaluative feedback throughout the task. We measured physiological correlates of stress using electrodermal activity, electrocardiography, respiration, skin temperature, and physical activity. Self-report measures of stress were collected via questionnaires, and qualitative information was gathered through semi-structured interviews. Results indicated stress was significantly higher during the Stress Trial, confirming the efficacy of the experimental manipulations. This study demonstrates the effectiveness of a laboratory stress paradigm that integrates cognitive, physical, and social-evaluative demands, enhancing ecological validity compared to traditional stress tasks.

Keywords

psychophysiology laboratory-induced stress wearables

Introduction

The current study is part of a larger project aimed at developing wearable technology to support real-time tracking and display of reactions and recovery patterns among military personnel to a broad range of system casualties. Translation of wearable bio-behavioral monitoring systems from laboratory to real-world operating environments ultimately hinges upon the validity and generalizability of laboratory stress tasks. Despite advances in instrumentation, computing, and analytic methods, accuracy and reliability of wearable systems in the wild continues to be significantly lower relative to laboratory settings (Can et al., 2020). Although there are likely multiple contributing factors to this unreliability, poor ecological validity of laboratory stress tasks may be a significant source of this variance but has received little attention. The primary aim of this study was to integrate key contextual factors present in field settings (e.g., human-computer interaction, physical activity) into the design of laboratory experiments while also retaining key features from conventional stress tasks (e.g., social-evaluative threat). Thus, we developed and tested a novel laboratory stress task which combined a computer-based task with social evaluation and ergometer cycling to mimic the combination of physical and mental demands plus and social evaluation common in real-world environments.

Background

A meta-analysis of existing laboratory stress tasks revealed that the largest stress responses were observed when experimental tasks included a motivated performance episode containing elements of uncontrollability and social-evaluative threat, such as performing public speaking in front of judges (Dickerson & Kemeny, 2004). However, these tasks fail to mimic the interactive environment encountered in dynamic field settings, and lack other contextual elements such as physical activity, system monitoring, and need for corrective action. This is at odds with contemporary research in affective science (Hoemann et al., 2023) that suggests emotional processes such as stress are inherently contextual and manifest differently across environments (e.g., fear during an automobile accident vs. fear during a roller coaster ride). The Trier Social Stress Test (Kudielka et al., 2007; TSST) for example, includes two long performance episodes without exposure to externally generated stimuli (e.g., visual display of varying system conditions), lacking the dynamic interactions encountered in live operating environments. The TSST also precludes experimental control of discrete time-varying experimental stimuli, making it difficult to model a stimulus-response or dynamic control relationships that are needed to causally link stressors with physiological responses.

The present study prioritized simulating cognitive, physical, and social-evaluative demands at levels that would commonly be found in critical task environments in military operations. To accomplish this, participants completed the Open Multi-Attribute Task Battery (OM; Cegarra et al., 2020) at different levels of difficulty with and without an experimenter observing and offering critical performance feedback. OM allows the timing of stressors (i.e., system failures) to be programed down to the second, thus providing stimuli that are consistent across participants which can be synchronized with physiological time series. This was intended to strengthen the ability to make causal inferences between predictors (e.g., stressors) and outcomes (e.g., physiological stress measures), a major challenge in field studies (Ganster et al., 2018).

Approach

Twenty participants recruited from the University of Connecticut completed the experiment. Due to technical issues, electrodermal activity (EDA) data was missing for seven participants (n = 13) and electrocardiography data was missing for four participants (n = 16). To retain the maximum amount of data, partial records were included in analyses. The average age was 20 years old, and 60% were female. The experimental procedure lasted approximately 2 hr and entailed (1) a 5-min Cycling Trial, (2) an 8-min Baseline Trial, (3) an 8-min Control Trial, and (4) an 8-min Stress Trial. Further, participants were assigned to either a cycling (n = 8) or non-cycling group (n = 12). Participants assigned to the cycling group performed the OM while cycling at a leisurely pace (approximately 50% of max heart rate). To reduce stress associated with the novelty and unpredictability of behavioral experiments (Gossett et al., 2018), participants watched a nature documentary video (National Geographic, 2023) for approximately 10-min prior to the Baseline Trial. Further, to reduce boredom and fatigue effects during rest periods, participants continued watching the same video prior to and following the Stress Trial.

Computer-Mediated Task

The OM was used for the main experimental task. It is an open-source adaptation of MATB, originally designed to simulate aircraft piloting tasks. Participants were given an instructional pack and a demonstration of the OM, and then completed an 8-min training task. The OM comprises four distinct tasks: systems monitoring, tracking, communications, and resource management (Figure 1). In the Systems Monitoring Task, participants observed gages and indicator lights, responding to system anomalies by pressing specific keys to restore nominal performance. The Tracking Task required participants to dynamically control a cursor’s position using a joystick, maintaining alignment with central axes to simulate aircraft tilt adjustments. The Communications Task emulated air traffic control interactions, with participants assigned call signs and instructed to respond only to auditory instructions directed at them amidst distractor communications. Lastly, the Resource Management Task simulated fuel distribution management, where participants balanced fuel levels between main tanks within a specified range.

Figure 1.

Experimental set-up illustration with OpenMATB-II and desk cycle.

Physiological Signals

Physiological signals were sampled continuously over the entire study using the Empatica E4 (a research-grade wristband sensor capable of measuring blood volume pulse, electrodermal activity, and three-axis acceleration) and the Hexoskin smart shirt (a research-grade wearable garment with integrated ECG, accelerometer, and respiration band sensors). The combination of these systems allows for collection of distinct and separable component signals of the sympatho-adrenomedullary axis (SAM) axis. In addition, both devices have high usability and durability, making this system ideal for deployment in field settings. All HR parameters were based on the Hexoskin electrocardiography (ECG), which is less sensitive to motion artifact. Data from one participant was excluded for poor signal quality based on visual analysis and the expert judgment of a senior investigator.

Self-Report Questionnaires

Participants completed self-report assessments of stress, cognition, and affect both before and after experimental trials via the Short Stress State Questionnaire (SSSQ; Helton & Näswall, 2015), NASA Task Load Index (NASA-TLX; Hart & Staveland, 1988), and Appraisal of Life Events (ALE; Ferguson et al., 1999) scale. Race, ethnicity, age, and gender were assessed via self-report, and personality variables were assessed via the Big Five Inventory (BFI) questionnaire (John et al., 1991) and the Core Self-Evaluation (CSE) questionnaire (Judge et al., 2003) at the beginning of the research session. Change scores for the SSSQ were calculated by subtracting pre- from post-task scores.

Cycling Trial

All participants performed a 5-min cycling-only trial. To keep physical exertion constant across participants, each participant underwent a calibration procedure where the resistance of the cycle ergometer was adjusted until participants maintained a heart rate roughly 50% of their maximum heart rate (max HR; Tanaka et al., 2001) while pedaling at 70 rpm. Participants sat in a traditional office chair positioned behind a portable cycle ergometer (Desk Cycle) as seen in Figure 1. Participants tracked their cycling revolutions per minute (rpm) on a visual display. Following the Cycling Trial, participants completed a 10-min rest period, followed by a pre-task survey. The rest period was intended to allow participants to acclimate to the laboratory and return their cardiovascular system to resting levels.

Baseline Trial and Control Trial

After the pre-task survey, participants watched the documentary for an additional 8 min, which served as their Baseline Trial. This trial was intended to provide a resting comparison absent of 1) high mental workload, 2) social evaluation, and 3) physical activity. The control trial consisted of an 8-min OM trial. The difficulty of the OM trials was based on the number of tasks performed and the number of system events (i.e., alarms, malfunctions, and radio instructions) occurring in each 1-min period, similar to Kong et al. (2022). Participants were instructed to only monitor the tracking and resource management task during the Control Trial. There were no radio communications, and system events (e.g., pump failures) occurred at the low rate of 4 per minute.

Stress Trial

During the Stress Trial, participants were instructed to attend to all OM tasks, and the number of system events was increased to approximately 12 per minute. An evaluator standing alongside the participant monitored the participants’ performance and pointed out mistakes. Participants were told the evaluator was a senior lab member who was there to ensure high task performance. Evaluators displayed a neutral and unsympathetic demeanor during the task, consistent with prior stress tasks (Kudielka et al., 2007), and were trained to point out 1 to 3 errors per minute. Evaluators also used non-verbal indicators of performance evaluation, such as audible note-taking (i.e., loud scribbling) and exaggerated sighs. Following a 20-min recovery period, participants were informed of the nature of the study and took part in a short semi-structured interview focusing on the efficacy of the protocol (e.g., the effect of the evaluator’s presence) and other experiences not captured by the survey. Therefore by design, the Control Trial lacked the task failures and social evaluation found in the Stress trial. Thus, differences between the Control Trial and Stress Trial should be attributable to unique stress elements (e.g., uncontrollability, task failure, and social evaluation).

Analysis

Self-report data were analyzed using separate repeated-measures ANOVAs (rmANOVA) for SSSQ (pre-, post-, and change scores), ALE, and NASA-TLX, with Trial (Baseline, Control, Stress) as the within-subject factor and Cycle (Cycling or Non-cycling) as the between subject factor. Post-hoc Bonferroni-corrected pairwise comparisons were conducted if the omnibus effects were revealed. In addition, SSSQ, ALE, and NASA-TLX scores from the Stress Trial were separately regressed on CSE, neuroticism, and conscientiousness using ordinary least squares (OLS) regression. Physiological data was processed and initially analyzed using MATLAB. Heart rate (mean and max), heart rate variability (HRV), skin conductance levels, respiration, skin temperature, and activity levels were extracted. Due to small sample size and the goals of this preliminary work, only cardiovascular measures (e.g., HR mean, HR max, and HRV) were included in statistical comparisons. Separate rmANOVAs were conducted for max HR, mean HR, and HRV (root mean square of successive differences, RMSSD) with Trial as the within-subject factor and Cycling as the between-subject factor. Baseline Trials were excluded from the analysis of physiological data since participants in the Non-cycling Group would have artificially lower Baseline Trial scores due to absence of physical activity. To assess the relationship between physiological stress responses and self-reported stress, we computed Pearson correlations between physiological markers and subjective stress ratings across trials. Further, to remove the effect of individual difference in baseline heart rate (mean and max) and HRV, HR reactivity (i.e., difference scores) was calculated by subtracting Control Trial scores from the Stress Trial scores for correlation analysis. All statistical analyses were conducted using SPSS (version 27).

Outcome

Findings of the repeated measures ANOVA on pre- and post-trial SSSQ subscale scores are illustrated in Figure 2, and summary statistics are reported in Table 1. A significant main effect was observed for the experimental trial on SSSQ Engagement, Worry, and Distress. Pairwise comparisons revealed Distress (change score) was significantly greater for the Stress Trial compared to Control and Baseline Trials, and Control Trial Distress was significantly greater than Baseline. Further probing of change scores revealed the effect size of the Stress Trial ( ${η_{p}}^{2}$ = 0.64) was more than twice the size of the Control Trial ( ${η_{p}}^{2}$ = 0.31). Similarly, Worry was significantly greater during Stress compared to Baseline and Control. There was a significant pre-post decrease in Worry for the Baseline Trial (mean difference = −5.02; p < .01), while there was no significant difference for the Control Trial (mean difference = 0.35, p = .68) and a significant increase for the Stress Trial (mean difference = 3.69; p < .01). Last, post-hoc comparison of Engagement revealed Engagement was significantly greater for the Stress Trial compared to the Control and Baseline Trials, and Control Trial Engagement was significantly greater than Baseline. A significant main effect was observed for the experimental trial on ALE-challenge (p < .01; ${η_{p}}^{2}$ = 0.41), and ALE-threat (p < .01; ${η_{p}}^{2}$ = 0.84). Post hoc analysis revealed Challenge and Threat were significantly greater for the Stress trial compared to the Control and Baseline trials. There was no significant main effect of trial on NASA-TLX scores. Supplementary analysis did not find a significant association between personality measures and any of the self-report or physiological measures, which may be due to the small sample size.

Figure 2.

Pre- and post-task sum scores for the short state stress questionnaire (SSSQ).

Table 1.

SSSQ (Change Scores) and Cardiovascular Measures Across Trials.

	Control	Stress
Measures	M (SD)	M (SD)	η_p²
Eng.	2.69 (4.54)	4.76 (3.79)	0.59*
Worry	−0.24 (3.56)	4.29 (4.29)	0.63*
Distress	1.76 (2.64)	4.29 (3.41)	0.42*
HR mean	87.24 (11.63)	90.47 (14.65)	0.31*
HR max	96.55 (12.79)	100.909 (13.64)	0.34*
RMSSD	19.44 (14.76)	20.04 (14.76)	0.01

Note. SSSQ = short state stress questionnaire; Eng. = engagement; RMSSD = root mean square of successive differences; HR = heart rate.

p < .01.

Preliminary analysis and visual examination of physiological time series indicates the experimental manipulation was associated with stress-related changes in the physiological measures. Figure 3 displays the complete physiological time series for one participant, showing a notable increase in skin conductance response (SCR) and a decrease in heart rate variability (HRV) during the Stress Trial, effects not present during the Baseline Trial or Control Trial. Repeated-measures ANOVAs (Control and Stress only) on HR mean, HR max, and HRV revealed a significant effect for the experimental trial on HR mean and max, but not for HRV. Post-hoc pairwise comparisons revealed HR mean and HR max were significantly greater for the Stress Trial compared to the Control Trial. In addition, Control and Stress Trial HR mean and HR max were significantly greater for the Cycling group, but the increase in HR across trials was similar for both groups. In an attempt to assess fatigue, we compared physiological measures for the 1-min period immediately prior to the start of each trial. Pre-trial HR mean and max were not significantly different between Control and Stress trials, while pre-trial HRV was significantly lower prior to the Stress Trial compared to Control. This may indicate participants had greater anticipatory stress prior to the Control Trial.

Figure 3.

Time series visualization of physiological measures across full experiment for Participant nine.

Lastly, during debrief interviews participant reported experiencing greater stress during the Stress Trial, which many attributed to the increase in task difficulty and higher failure rates combined with the presence of an evaluator. One participant who was an orchestra musician compared the stress of the performance monitoring and evaluation of the Stress Trial to the stress of a musical audition (Figure 3).

Conclusion

This study evaluated a novel laboratory stress protocol combining cognitive, physical, and social-evaluative demands to enhance ecological validity, in hopes of improving the generalizability of using wearable systems for bio-behavioral monitoring outside of laboratory settings. Self-report and physiological stress measures were significantly higher during the Stress Trial than the Control Trial or Baseline Trial, even with moderate cycling-induced physical activity. Debriefing confirmed that stress responses were tied to the experimental manipulation of social evaluation.

The are several limitations to the current study. Trial order was fixed to amplify the stress response, an approach taken in prior work (Dedovic et al., 2005). Although we did not take steps to limit learning effects, the high difficulty and additional OM tasks utilized in the Stress Trial were intended to induce a high task failure rate across all participants, regardless of differences in learning effects. To limit cumlative fatigue, participants had extended breaks between trials. Post hoc analysis found pre-trial cardiovascular measures were similar across trials, and most participants demonstrated sustained or increased mental (e.g., engagement) and physiological responses through the Stress Trial, reflecting sufficient effort and energy mobilization. Further, any mild fatigue would be expected to intensify stress, mirroring real-world operational conditions, thus strengthening the validity of the protocol. Last, the sample consisted of college students, potentially limiting the generalizability of these findings. However, since the target end-users (military personnel) share similar age and health profiles, these findings are aligned with study and project goals. Future work could assess how evaluator traits, stress-modulation techniques (e.g., competition, mindfulness), and/or real-time feedback influence stress outcomes. Although further testing is needed, our initial findings support the ecological validity of using this stress protocol when developing wearable bio-behavioral monitoring systems for high-stakes operating environments.

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

This study was sponsored by the Office of Naval Research through the National Institute of Undersea Vehicle Technology.

ORCID iD

James Michael Hughes

References

Can

Y. S.

Gokay

Kılıç

D. R.

Ekiz

Chalabianloo

Ersoy

(2020). How laboratory experiments can be exploited for monitoring stress in the wild: A bridge between laboratory and daily life. Sensors, 20(3), 838.

Cegarra

Valéry

Avril

Calmettes

Navarro

(2020). OpenMATB: A multi-attribute task battery promoting task customization, software extensibility and experiment replicability. Behavior Research Methods, 52, 1980–1990.

Dedovic

Renwick

Mahani

N. K.

Engert

Lupien

S. J.

Pruessner

J. C.

(2005). The Montreal Imaging Stress Task: using functional imaging to investigate the effects of perceiving and processing psychosocial stress in the human brain. Journal of Psychiatry and Neuroscience, 30(5), 319–325.

Dickerson

S. S.

Kemeny

M. E.

(2004). Acute stressors and cortisol responses: A theoretical integration and synthesis of laboratory research. Psychological Bulletin, 130(3), 355–391.

Ferguson

Matthews

Cox

(1999). The appraisal of life events (ALE) scale: Reliability and validity. British Journal of Health Psychology, 4(2), 97–116.

Ganster

D. C.

Crain

T. L.

Brossoit

R. M.

(2018). Physiological measurement in the organizational sciences: A review and recommendations for future use. Annual Review of Organizational Psychology and Organizational Behavior, 5(1), 267–293.

Gossett

E. W.

Wheelock

M. D.

Goodman

A. M.

Orem

T. R.

Harnett

N. G.

Wood

K. H.

Mrug

Granger

D. A.

Knight

D. C.

(2018). Anticipatory stress associated with functional magnetic resonance imaging: Implications for psychosocial stress research. International Journal of Psychophysiology, 125, 35–41.

Hart

S. G.

Staveland

L. E.

(1988). Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. Advances in Psychology, 52, 139–183.

Helton

W. S.

Näswall

(2015). Short Stress State Questionnaire: Factor structure and state change assessment. European Journal of Psychological Assessment, 31(1), 20–30.

10.

Hoemann

Wormwood

J. B.

Barrett

L. F.

Quigley

K. S.

(2023). Multimodal, idiographic ambulatory sensing will transform our understanding of emotion. Affective science, 4(3), 480–486.

11.

John

O. P.

Srivastava

. (1999). The Big-Five trait taxonomy: History, measurement, and theoretical perspectives. In Pervin

L. A.

John

O. P.

(Eds.), Handbook of personality: Theory and research (Vol. 2, pp. 102–138). New York: Guilford Press.

12.

Judge

T. A.

Erez

A. M. I. R.

Bono

J. E.

Thoresen

C. J.

(2003). The core self-evaluations scale: Development of a measure. Personnel Psychology, 56(2), 303–331.

13.

Kong

Posada-Quintero

H. F.

Gever

Bonacci

Chon

K. H.

Bolkhovsky

(2022). Multi-attribute task battery configuration to effectively assess pilot performance deterioration during prolonged wakefulness. Informatics in Medicine Unlocked, 28, 100822.

14.

Kudielka

B. M.

Hellhammer

D. H.

Kirschbaum

(2007). Ten years of research with the trier social stress test (TSST)–revisited. In Social neuroscience: Integrating biological and psychological explanations of social behavior (pp. 56–83). The Guilford Press.

15.

National Geographic. (2023). Masterminds: Secrets of the octopus [Documentary]. IMDb. https://www.imdb.com/title/tt31137509/

16.

Tanaka

Monahanm

K.D.

Seals

D.R

. (2001). Age-predicted maximal heart rate revisited. Journal of the American College of Cardiology, 37(1), 153–156.