Designing and Building a 3D Human Motion Dataset for Vision-Based Ergonomics Risk Assessments

Abstract

Artificial Intelligence (AI) is increasingly used in ergonomics, particularly for assessing musculoskeletal disorder (MSD) risks. Recent advancements in vision-based AI have enabled the monitoring of MSD risks using ordinary cameras, providing more accessible and less intrusive alternatives to traditional observation-based methods. However, existing AI models, trained on generic computer vision-domain datasets, lack the keypoints necessary for calculating intricate angles in high-degree-of-freedom (DoF) joints. We present the design and building process of a large-scale 3D human motion dataset designed to train vision-based AI models for ergonomics risk assessments. The dataset captures 47-keypoint 3D human pose selected for high-DoF joint angle calculations and vision-based pose estimation, capturing 7 million frames of 10 subjects performing 9 categories of manual material handling tasks. A baseline MotionBert model trained on our dataset achieved a mean absolute angle error of 3.5° and demonstrated its generalization capability on real-world industry videos.

Keywords

vision-based ergonomics risk assessment work-related musculoskeletal disorders manual material handling 3D human motion capture 3D human pose estimation

Introduction

Artificial Intelligence (AI), including machine learning, is increasingly being applied in ergonomics, particularly for assessing musculoskeletal disorder (MSD) risks using methods like computer vision-based evaluation. Recent advancements in vision-based AI have enabled MSD risk monitoring using only an ordinary camera. Compared to traditional observation-based risk assessment methods, vision-based AI offers a more accessible and less intrusive alternative. These methods typically rely on 3D human pose estimation models, an AI model that extracts 3D joint locations from video frames to estimate body joint angles—a critical input for ergonomics risk assessments such as Rapid Entire Body Assessment (REBA) and Ovako Working Posture Assessment System (OWAS). However, existing pose estimation models are primarily designed for general-purpose computer vision tasks, such as human action recognition or motion capture for animation, rather than ergonomics risk assessments. As a result, these models often do not include body keypoints necessary for calculating the intricate angles in high-degree-of-freedom (DoF) joint angles, such as distinguishing wrist flexion/extension, radial/ulnar deviation, and pronation/supination or neck flexion/extension, lateral flexion, and rotation, crucial angles for ergonomics risk assessments. Additionally, these models are trained on datasets that primarily capture everyday activities, such as walking and sitting, which differ significantly from the motions observed in industrial settings. This mismatch in motion representation compromises the AI model’s performance when applied in real-world environments. This study presents a method for designing and building a 3D human motion dataset tailored for ergonomics risk assessments. We also demonstrate how this dataset can enable a vision-based AI pipeline for ergonomics risk assessments.

Background

Previous studies have explored the application of vision-based AI models for ergonomics risk assessment by leveraging off-the-shelf 3D human pose estimation models trained on standard computer vision datasets, such as Human3.6M (Ionescu et al., 2014). However, these datasets primarily capture generic motions, such as walking and sitting, rather than the manual material handling (MMH) motions typically observed in industrial settings. Moreover, these datasets rely on simplified joint center pose representations, which lack the necessary 3D spatial information for accurate joint angle calculations, particularly in high-DoF joints. As a result, pose estimation models trained on these datasets either apply simplified joint angle calculations that fail to distinguish intricate angles (e.g., neck flexion, side bend, and rotation) in high-DoF joints (Chen et al., 2024) or rely on additional assumptions and approximations when calculating joint angles required for ergonomics risk assessment (Chu et al., 2020). To address these limitations, researchers have developed custom human pose datasets tailored for ergonomics applications. Kwon et al. (2022) collected a manufacturing-specific dataset with additional hand keypoints for wrist angle calculation. However, they only collected 2D annotations, requiring reliance on 3D datasets from the computer vision domain to train their 3D model. Fan et al. (2024) collected a 3D motion dataset capturing 7 subjects performing 14 construction tasks. However, their dataset’s specific focus on construction tasks and limited size (.42 million frames) limit its suitability for training deep learning models for industrial ergonomics applications.

Approach

We designed and collected a large-scale MMH 3D human motion dataset to train vision-based AI models for ergonomics risk assessment. The dataset was designed with two key considerations. First, the keypoint selection aimed to capture critical biomechanical landmarks that provide sufficient information for high-DoF joint angle calculations while also including visually prominent features that enable accurate and consistent detection by vision-based AI models from videos. Second, the task motions were carefully designed to encompass typical MMH tasks commonly observed in industrial settings with sufficient motion variability to enhance the dataset’s generalizability for real-world applications.

For keypoint design, we defined 47 body surface keypoints based on the Plug-in-Gait marker set, a widely used motion capture marker. The Plug-in-Gait marker set includes critical bony landmarks to minimize soft tissue artifacts and ensure accurate joint angle calculations, making it a strong foundation for our keypoint selection. Building on the Plug-in-Gait set, we replaced keypoints that lacked sufficient visual features for computer vision-based detection (e.g., left front of the head) with more visually prominent alternatives (e.g., ear). Featureless keypoints are effective in marker-based motion capture but are unsuitable for vision-based pose estimation AI models, which rely on visual features from a single camera to detect and track keypoints across video frames. From the body surface keypoints, we also derived joint angle calculation steps for 22 angles across 6 body joints (i.e., neck, shoulder, elbow, wrist, back, and knee), aligning with the inputs required for common risk assessment methods like REBA and OWAS.

Beyond keypoint selection, capturing a dataset with sufficient motion variability was also essential for training a generalizable pose estimation AI model. Participants were instructed to simulate nine categories of MMH tasks typically observed in manufacturing, including lifting/lowering, carrying, pushing/pulling, poking, unboxing, and assembling within a controlled lab environment. An additional warm-up session was included to capture the full range of motion for each joint through guided stretching exercises. Participants were provided with task-related objects, including boxes, hand carts, and poles, to promote realistic task motions. To further promote variability, participants were instructed to randomly combine different task parameters, such as assembling heights, pushing weights, and lifting postures, rather than following a predefined script. The number of task parameter changes was counted for each participant and task category. On average, each participant performed 29.8 distinct combinations of task parameters during their 3-min session for each task category.

Using marker-based Vicon motion capture, we collected 7 million frames (19.4 hr) of 3D human motion from 10 participants (5 male and 5 female) recruited from the University of Michigan student population, with body mass index (BMI) values ranging from 17.53 to 32.23. Compared to the commonly used computer vision-domain 3D dataset Human3.6M (Ionescu et al., 2014; 3.6 million frames featuring 11 subjects) or the construction-specific 3D dataset collected by Fan et al. (2024; .42 million frames featuring 7 subjects), our dataset provides substantially more training data and motion variability while maintaining a comparable number of subjects. Using the collected dataset, we trained both image-to-2D and 2D-to-3D pose estimation models. For the 2D pose estimation model, we employed RTMPose, chosen for its leading performance on the COCO-WholeBody evaluation dataset (Jiang et al., 2023). To prevent overfitting our lab background, we trained the 2D pose estimation model on a merged dataset that combines the diverse COCO-WholeBody dataset and our lab-collected dataset. For the 3D pose estimation model, we used MotionBert, a transformer-based model with leading performance on the Human3.6M dataset (Zhu et al., 2023). We modified MotionBert’s input and output layers to accommodate the increased number of keypoints in our dataset. Once trained, the two models operate in a linear pipeline to estimate 3D keypoints from video frames, which are then used to compute joint angles that can serve as inputs for subsequent ergonomics risk assessments.

Outcome

We evaluated the joint angle estimation performance of the pose estimation models trained on our dataset by comparing their predictions against ground-truth motion capture data using a designated test set. The model achieved a mean absolute error (MAE) of 7.5° for the end-to-end pipeline and 3.4° when assuming accurate 2D input, demonstrating strong angle estimation performance. In comparison, prior work (Chu et al., 2020) reported an MAE of 11.7° for estimating the 3D orientation of human body segments relative to the world coordinate frame. While differences in angle definitions and evaluation datasets limit direct comparisons, our results indicate a substantial improvement in joint angle estimation accuracy, enhancing the reliability of subsequent ergonomic risk assessments. We also compared the angle accuracies across different joints. While most joints, such as the back (MAE = 1.75°) and knee (MAE = 2.21°), exhibited low errors, wrist joint angles showed higher errors (MAE = 5.58°), likely due to the proximity of wrist and hand keypoints, which makes it more sensitive to small inaccuracies in 3D pose estimation. We also applied our trained models to 23 real-world industry videos and visualized the estimated 3D postures and joint angles to demonstrate the model’s generalization capability to industrial settings. While we conducted quantitative evaluations in the lab and qualitative evaluations in industry scenarios, further quantitative evaluations in industry scenarios investigating the impact of real-world factors, like multi-person scenarios and occlusions, can provide more insights for practitioners applying the trained AI model.

Conclusion

This study presents a method for constructing a 3D human motion dataset tailored for vision-based ergonomics risk assessment and outlines the process of training pose estimation AI models for accurate joint angle estimation. We identified key body landmarks suitable for both vision-based pose estimation and joint angle calculation, particularly for high-DoF joints. We established the corresponding angle calculation steps to derive detailed joint angles for ergonomics risk assessment from the estimated 3D keypoints. Models trained using our pipeline achieved a joint angle MAE of 3.5° on our validation set, demonstrating high angle estimation accuracy. We further validated the model’s practical applicability by visualizing the 3D pose and joint angle outputs from real industry video examples. The proposed method provides guidelines for future researchers to collect custom datasets and develop AI-driven ergonomics risk assessment solutions, enabling accessible and less intrusive posture monitoring using ordinary cameras, ultimately contributing to improved workplace safety and injury prevention.

Footnotes

Declaration of Conflicting Interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: This research was supported financially by VelocityEHS. SangHyun Lee, Daeho Kim, and Meiyin Liu are external consultants to VelocityEHS. Veeru Talreja, Julia Penfield, and Rick Barker are full-time employees of VelocityEHS.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported financially by VelocityEHS.

ORCID iDs

Leyang Wen

Veeru Talreja

Meiyin Liu

SangHyun Lee

References

Chen

(2024). Real-time ergonomic risk assessment in construction using a co-learning-powered 3D human pose estimation model. Computer-Aided Civil and Infrastructure Engineering, 39(9), 1337–1353. https://doi.org/10.1111/mice.13139

Chu

Han

Luo

Zhu

(2020). Monocular vision–based framework for biomechanical analysis or ergonomic posture assessment in modular construction. Journal of Computing in Civil Engineering, 34(4), Article 04020018. https://doi.org/10.1061/(ASCE)CP.1943-5487.0000897

Fan

Mei

(2024). 3D pose estimation dataset and deep learning-based ergonomic risk assessment in construction. Automation in Construction, 164, Article 105452. https://doi.org/10.1016/j.autcon.2024.105452

Ionescu

Papava

Olaru

Sminchisescu

(2014). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339. https://doi.org/10.1109/TPAMI.2013.248

Jiang

Zhang

Han

Lyu

Chen

(2023). RTMPose: Real-time multi-person pose estimation based on MMPose. arXiv, arXiv:2303.07399. http://arxiv.org/abs/2303.07399

Kwon

Y.-J.

Kim

D.-H.

Son

B.-C.

Choi

K.-H.

Kwak

Kim

(2022). A Work-Related Musculoskeletal Disorders (WMSDs) risk-assessment system using a single-view pose estimation model. International Journal of Environmental Research and Public Health, 19(16), Article 9803. https://doi.org/10.3390/ijerph19169803

Zhu

Liu

Wang

(2023). MotionBERT: A unified perspective on learning human motion representations. arXiv, arXiv:2210.06551. http://arxiv.org/abs/2210.06551