Abstract
Introduction
Uptake of telehealth has surged, yet no previous studies have evaluated the clinimetric properties of clinician-administered performance-based tests of function, strength, and balance via telehealth in people with chronic lower limb musculoskeletal pain. This study investigated the: (i) test–retest reliability of performance-based tests via telehealth, and (ii) agreement between scores obtained via telehealth and in-person.
Methods
Fifty-seven adults aged ≥45 years with chronic lower limb musculoskeletal pain underwent three testing sessions: one in-person and two via videoconferencing. Tests included 30-s chair stand, 5-m fast-paced walk, stair climb, timed up and go, step test, timed single-leg stance, and calf raises. Test–retest reliability and agreement were assessed via intraclass correlation coefficients (ICC; lower limit of 95% confidence interval (CI) ≥0.70 considered acceptable). ICCs were interpreted as poor (<0.5), moderate (0.5–0.75), good (0.75–0.9), or excellent (>0.9).
Results
Test–retest reliability was good-excellent with acceptable lower CI for stair climb test, timed up and go, right leg timed single-leg stance, and calf raises (ICC = 0.84–0.91, 95% CI lower limit = 0.71–0.79). Agreement between telehealth and in-person was good-excellent with acceptable lower CI for 30-s chair stand, left leg single-leg stance, and calf raises (ICC = 0.82–0.91, 95% CI lower limit = 0.71–0.85).
Discussion
Stair climb, timed up and go, right leg timed single-leg stance, and calf raise tests have acceptable reliability for use via telehealth in research and clinical practice. If re-testing via a different mode (telehealth/in-person), clinicians and researchers should consider using the 30-s chair stand test, left leg timed single-leg stance, and calf raise tests.
Keywords
Introduction
Chronic lower limb musculoskeletal conditions, the most common of which include osteoarthritis, rheumatoid arthritis, and problems arising from accident/injury, 1 are major public health problems with enormous social, personal, and economic burden. 2 These conditions often give rise to chronic pain, causing individuals to seek care. In 2019, approximately 1.7 billion people worldwide had at least one musculoskeletal condition, an increase of 62% since 1990, 3 contributing to one of the highest expenditures of all health conditions.4,5
Chronic musculoskeletal conditions frequently lead to impairments in body functions, impacting an individual's ability to perform daily activities and limiting social and workforce participation.6–8 As such, an important part of healthcare is the measurement of impairments and function over time in order to evaluate treatment response and adjust treatment if required. A range of self-report measures (e.g. questionnaires) and/or performance-based tests 8 are typically clinician-administered and often used in both clinical and research settings. While easy to implement, questionnaires are highly subjective and influenced by personal and psychosocial factors like depression, self-efficacy, and pain, 9 and evidence suggests that performance-based measures in people with musculoskeletal conditions assess different aspects of physical function than self-reported measures alone.10–13 As such, the use of performance-based tests (e.g. measurement of walking speed, stair climbing ability, standing from sitting, balance, and strength) is important.
Physiotherapists are common providers of care for people with chronic musculoskeletal pain 14 and traditionally, care has been provided in person. However, physiotherapists are increasingly delivering care via telehealth to patients who may otherwise have difficulty accessing services,15,16 and there is evidence that these models of service delivery are effective.17,18 Implementation of telehealth-delivered physiotherapy services, particularly using videoconferencing,19,20 has exploded during the COVID-19 pandemic.21,22 Whilst clinicians have reported using a range of functional tests to assess patients via telehealth, 19 the reliability of administering such tests via videoconferencing, and their level of agreement with in-person test scores, is unclear. Indeed, clinicians have noted that difficulties with objectively assessing patients are a barrier to the implementation of telehealth.19,20,23
There is emerging evidence that assessment of musculoskeletal disorders via telehealth is reliable and has a sufficient agreement with in-person assessment.24,25 A recent systematic review identified 39 studies examining the clinimetric properties of physiotherapy assessments via telehealth, concluding that they were reliable for specific types of assessment in limited populations. 25 However, only four of the 39 studies focused on telehealth assessment of people with chronic lower limb musculoskeletal conditions.26–29 Those four studies used sophisticated videoconferencing systems involving remote-controlled patient cameras and inbuilt measurement tools to quantify patient physical performance.26–29 Such systems are not readily available to clinicians or patients, and thus findings from those studies are not generalisable to the telehealth services that are typically implemented in routine clinical practice, which tend to utilise simple internet-based videoconferencing software (e.g. Zoom) or videoconferencing modules within clinical software applications (e.g. Physitrack). 19 Another recent systematic review of psychometric properties of performance-based tests via telehealth for people with chronic conditions concluded that the current evidence is of low to very low quality, reflecting the small number of studies that have been conducted. 30 Thus, the aims of this study were to investigate the: (i) test–retest reliability of clinician-administered performance-based tests via telehealth, and (ii) agreement between scores obtained via telehealth and in-person for people with chronic lower limb musculoskeletal pain.
Methods
Study design
A prospective within-participant repeated-measure design was used with participants assessed on three occasions (twice via telehealth and once in-person) over approximately two weeks. This study was reviewed and approved by the University of Melbourne Human Ethics Advisory Group and all participants provided informed written consent.
Participants and sample size
Participants with chronic lower limb musculoskeletal pain were recruited from the community in metropolitan Melbourne, Australia, via advertisements in online newsletters and on social media, as well as invitations emailed to our Centre's research volunteer database. Eligible participants were aged ≥45 years, had current chronic lower limb musculoskeletal pain (pain at any lower limb site on most days of the past three months) that interfered with function, had the pain of ≥2 out of 10 on an 11-point numeric rating scale, and had access to the internet and a device capable of videoconferencing (e.g. tablet or laptop/desktop computer). Participants were ineligible if they failed Exercise and Sports Science Australia adult pre-exercise screening, 31 had any hearing or visual impairments that would preclude adequate participation in the telehealth assessment, had had any falls in the prior 12 months, had any neurological conditions that affected their balance, or were unable to understand written/spoken English.
A pre-defined intraclass correlation coefficient (ICC) of 0.80 was set as the optimal level of reliability or agreement for each performance-based test, with a minimum acceptable ICC of 0.70. 32 Using the estimation method, 33 a total sample size of 51 is required for an expected reliability of 0.80 with two measurements and a 95% confidence interval (CI) width of 0.2. To account for any potential dropouts (or participants excluded from completing an individual test due to safety concerns or excluded from analyses because of changes in their condition over time (described below)), we anticipated 10% attrition and aimed to recruit 57 participants.
Clinician assessors
Physiotherapists were recruited as our clinician assessors. To ensure physiotherapist availability for assessments and to maximise generalisability, we recruited four registered physiotherapists with at least one year of clinical experience to conduct assessments (demographic details in Appendix 1). Physiotherapists were trained in measurement protocols and underwent a mock in-person and telehealth testing session with one of the researchers prior to the recruitment of participants.
Performance-based tests
Participants performed up to seven performance-based tests at each testing session (Table 1): (1) 30-s chair stand; (2) 5-m fast-paced walk; (3) stair climb test; (4) timed up and go; (5) step test (on each leg); (6) timed single-leg stance (on each leg); and (7) calf raise test (on both legs and each single leg).
Overview of performance-based tests performed in-person and via telehealth.
Physiotherapists were provided with a standardised testing manual that described each performance-based test (Appendix 2 with a simplified clinician user manual available online 34 ), which was adapted from existing resources describing administration in in-person settings. 35 Physiotherapists used their discretion to determine whether an individual participant should be excluded from performing a test if there were safety risks (e.g. those with a gait aid did not perform the single leg stance test). Participants without necessary ‘equipment’ in their own homes were unable to complete all tests (e.g. those without any stairs in their homes did not perform the stair climb test).
Testing sessions
Participants underwent three testing sessions over approximately two weeks: one in-person and two via telehealth. The order of assessments (in-person or telehealth), the assessing physiotherapist, and the order of performance-based tests (clustered by whether the test required shoes or bare feet) for each participant was randomised using password-protected software (REDCap™). The first two sessions (i.e. the in-person and first telehealth session) were performed within 1–3 days of each other at approximately the same time of day. To evaluate test–retest reliability, the second telehealth session was performed approximately two weeks later to allow sufficient time to limit recall of test scores whilst also limiting the chance of real change in clinical status (consistent with other studies evaluating test–retest reliability of performance-based tests in people with musculoskeletal conditions39–41). The same physiotherapist assessed the same participant at all three sessions. The same order of tests as for the first session was followed for each subsequent session. Participant-reported global rating of change in pain was assessed (via a 5-point Likert with response options: ‘much worse’, ‘slightly worse’, ‘no change’, ‘slightly better’, and ‘much better’) prior to the second and third sessions to ensure only those whose condition was stable across test sessions were included. Participants who recorded ‘much worse’ or ‘much better’ did not complete the testing session.
Participants were assessed in-person in the human movement laboratory at the University of Melbourne. Equipment (e.g. chairs, tape measures, cones) was set up prior to each assessment. Physiotherapists actively guided participants through each test using the instructions described in the testing manual (Appendix 2).
Telehealth assessments were undertaken using Zoom (Zoom Video Communications, Inc., San Jose, CA). Participants were located in their own homes (or other private location) while physiotherapists were located at the University or their home/workplace. Prior to the session, participants were provided with instructions on how to download and use Zoom and a list of simple equipment to have on-hand during the session (including a chair without wheels (not of a specific height), a tape measure, and four cones/markers). Participants were encouraged to use a laptop or tablet, if possible, otherwise, a smartphone was acceptable. Physiotherapists verbally instructed the participant through each test using the instructions described in the testing manual (Appendix 2), including how to set up the environment (e.g. measure out a 5-m walkway) and where to position their device's camera so that the physiotherapist could view the participant performing the test.
Data collection
Participant demographic information was collected prior to the first assessment. Procedures for data collection for each performance-based test are described briefly in Table 1 and detail in Appendix 2.
For each test, clinical utility data were also collected, including the duration of each assessment session and whether the participant was unable to undertake any test (and reasons why). The type of equipment used by participants at home during the telehealth sessions was recorded. At the conclusion of each session, participants were asked to respond verbally to a series of 11-point numeric rating scales (NRS) evaluating: (i) how confident they felt with the method of assessment; (ii) how comfortable they felt during the session; (iii) how safe they felt during the session, and (iv) how easy it was to perform the tests within the session (on average for all tests).
Data analysis
Descriptive statistics were summarised for demographic and clinical utility data. Percentages of maximal scores (ceiling effects) were calculated for the timed single-leg stance test since the score for this test was capped at 30 s.
Test–retest reliability between telehealth assessments for each performance-based test scores was determined using ICCs, with 95% CIs, calculated using a two-way random-effects model. A Bland-Altman plot 42 of the difference between paired measurements versus their mean was then generated for each test and included the mean difference, 95% CI and the 95% limits of agreement (estimated by mean difference ±2 standard deviation of the difference). Similar analyses were undertaken to assess agreement between telehealth and in-person tests.
Interpretation of ICCs was based on published recommendations of poor (ICC<0.5), moderate (0.5–0.75), good (0.75–0.9), or excellent (>0.9), 43 with an a priori optimal level of agreement of 0.8 32 and minimum acceptable level of 0.7. 32 Lower limits of 95% CIs were inspected to determine whether ICCs met the pre-determined acceptable threshold.
Statistical analyses were performed using Stata version 16.1 (StataCorp LLC, College Station, TX, USA).
Results
Participant characteristics
Fifty-seven participants were recruited (Table 2). Most were female (70%), aged 63.1 (standard deviation (SD) = 9.3) years, exercised four or more times per week (58%), had not used telehealth before for physiotherapy (93%), and used videoconferencing software for communication at least once a week (63%). The most common body part that was painful was the knee (72%) followed by the hip (58%). Around half (53%) had received a diagnosis for their pain problems, the most commonly reported of which were osteoarthritis (43%), non-specific arthritis (20%), bursitis (7%), torn meniscus (7%), or plantar fasciitis (7%). Fifty-three (93%) participants were included in the analyses of agreement between telehealth and in-person tests; 54 (95%) in the test–retest reliability analyses. Three participants withdrew from the study after only one testing session and four did not complete both telehealth tests. Average time between in-person and telehealth testing was 2.4 (SD = 1.9) days and between the two telehealth tests was 14.6 (SD = 2.4) days. Missing data for each individual test, and reasons why, are described in Appendix 3. A description of materials used for telehealth and in-person testing sessions is in Appendix 4.
Participant characteristics (n = 57).
IQR = interquartile range (25th to 75th percentile).
Participants were asked ‘do you currently participate in regular exercise and/or physical activity?’
Not mutually exclusive; participants were able to choose all body parts that were affected
Rated on an 11-point scale ranging from 0 (no pain) to 10 (worst pain possible) in response to ‘select the number which indicates the average number of pain felt over the past week in the muscles and/or joints of your leg’
Rated on an 11-point scale ranging from 0 (not at all) to 10 (extremely) in response to ‘select the number which indicates how the problem with the muscles and/or joints of your lower limb have interfered with your physical function over the past week.’
Test–retest reliability of performance-based tests via telehealth
The estimated ICC for test–retest reliability for four tests were good to excellent (ICC = 0.84–0.91, Table 3) and above the pre-specified acceptable lower 95% CI threshold of 0.7 (95% CI lower limit = 0.71–0.79), including the stair climb, timed up and go, right leg timed single-leg stance, and calf raise tests. The estimated ICC for test–retest reliability for four tests was moderate to good (ICC = 0.69–0.81), however, the lower 95% CI did not reach acceptable levels (95% CI lower limit = 0.52–0.69), including the 30-s chair stand, 5-m fast-paced walk, step test, and left leg timed single-leg stance.
Test–retest reliability of performance-based tests measured via telehealth (n = 57).
SD = standard deviation; ICC = intra-class correlation coefficient; CI = confidence interval; SEM = standard error of measurement; MDC95 = minimal detectable change at the 95% confidence limits.
Inspection of maximum scores showed a consistent ceiling effect, with 70% and 72%, being able to perform the test for the maximum of 30 s at the in-person and first telehealth testing session, respectively.
Inspection of maximum scores showed a consistent ceiling effect, with 71% and 70%, being able to perform the test for the maximum of 30 s at the in-person and first telehealth testing session, respectively.
SEM = SD in first telehealth session × √(1 – (intra-class correlation coefficient, ICC[2,1]))
MDC95 = SEM × 1.96 × √2
Normality and uniformity assumptions of the mean and SD of the differences between telehealth sessions appeared reasonable (Figure 1) except for the timed single-leg stance test for both the left and right leg, as few participants scored below the maximum of 30 s. Differences between paired telehealth measurements did not increase in magnitude substantially with higher counts or longer times.

Bland–Altman plots of differences between telehealth sessions (session 1 minus session 2) versus averages of paired measurements for each performance-based test.
Agreement between telehealth and in-person performance-based tests
The estimated ICC for agreement for three tests was good to excellent (ICC = 0.82–0.91, Table 4) and above the pre-specified acceptable lower 95% CI threshold of 0.7 (95% CI lower limit = 0.71–0.85), including the 30-s chair stand, left leg timed single-leg stance, and calf raise tests. The estimated ICC for agreement for four tests ranged between moderate to good (ICC = 0.71–0.81), however, the lower 95% CI did not reach acceptable levels (95% CI lower limit = 0.52–0.69), including the stair climb, timed up and go, step test, and right leg timed single-leg stance. One test did not meet our minimum acceptable ICC and showed poor agreement (95% CI included values <0.5), which was the 5-m fast-paced walk (ICC = 0.55, 95% CI = 0.30–0.72).
Agreement between performance-based tests when measured in-person and by telehealth (n = 57).
SD = standard deviation; ICC = intra-class correlation coefficient; CI = confidence interval.
Inspection of maximum scores showed a consistent ceiling effect, with 69% and 70%, being able to perform the test for the maximum of 30 s at the in-person and first telehealth testing session, respectively.
Inspection of maximum scores showed a consistent ceiling effect, with 69% and 71%, being able to perform the test for the maximum of 30 s at the in-person and first telehealth testing session, respectively.
Assumptions of normality and uniformity of the mean and SD of the differences were, again, unviolated (Figure 2) except for the timed single-leg stance test for both the left and right legs. Differences between paired measurements did not increase in magnitude substantially with higher counts or longer times.

Bland–Altman plots of differences between methods (in-person minus telehealth) versus averages of paired measurements for each performance-based test.
Clinical utility of testing sessions
Participant confidence, comfort, safety, and ease of the performance-based tests were high and similar across telehealth and in-person sessions (Table 5), ranging from a mean of 8.5–9.8 out of 10 on an 11-point NRS for the in-person session, compared to 8.4–9.4 out of 10 for the telehealth sessions. The in-person testing session was, on average, shorter than the first telehealth session but of similar duration to the second telehealth session.
Clinical utility measures relating to in-person and telehealth testing sessions.
Rated on 11-point Likert scales ranging from 0 (not at all) to 10 (very confident/very comfortable/very safe/very easy).
SD: standard deviation.
Discussion
This study aimed to investigate the test–retest reliability of clinician-administered performance-based tests via telehealth, and agreement between scores obtained via telehealth and in-person, in adults with chronic musculoskeletal pain. We found that the stair climb test, timed up and go, right leg timed single-leg stance, and calf raise tests demonstrated acceptable test–retest reliability via telehealth. The 30-s chair stand, left leg timed single-leg stance, and calf raise tests demonstrated acceptable agreement between scores obtained via telehealth and in-person.
To our knowledge, no previous studies have evaluated the test–retest reliability, or agreement with in-person assessment, of telehealth-administered stair climb, calf raise, timed single-leg stance, and 5-m fast-paced walk tests. However, a small number of studies have examined the 30-s chair stand, timed up and go, and step test Two studies found that the 30-s chair stand test had excellent test–retest reliability via telehealth (ICC = 0.95, 95% CI = 0.92–0.97) 44 and good–excellent agreement with in-person assessment (correlation coefficient = 0.95, 44 Krippendorff's alpha reliability estimate = 0.85 29 (95% CIs not reported in either)) in healthy young adults (without any health condition or musculoskeletal pain) assessed via Zoom 44 and in people after total knee arthroplasty assessed via sophisticated videoconferencing software (wide-angle cameras with remote-controlled panning/tilting). 29 Four studies found that the timed up and go test had excellent test–retest reliability (ICCs = 0.96–0.98, 95% CI lower limit = 0.86–0.98) and excellent agreement with in-person test scores (ICCs = 0.83–0.98, 95% CI lower limit = 0.27–0.96) via both simple (i.e. Zoom, WhatsApp, and Adobe Connect) and sophisticated (i.e. eHAB, which allows real-scale measurement of performance) videoconferencing software in people with chronic heart failure, 45 chronic obstructive pulmonary disease, 46 and in healthy older adults.39,44 However, lower limits of 95% CIs for agreement between telehealth and in-person scores for two of those studies39,45 fell below our pre-determined acceptable threshold. One study found that the step test had an excellent agreement with in-person assessment (weighted Cohen's kappa = 0.95–0.97 (95% CIs not reported)) via sophisticated eHAB videoconferencing software in people with Parkinson's disease. 40
Our ICCs appear to be lower than some of those previous studies described above.29,40,44–46 In addition, we found that some tests did not meet our lower 95% CI acceptable threshold for reliability or agreement with in-person scores. This may be because our study utilised pragmatic methods that could be easily implemented in clinical and/or research settings (e.g. utilising non-standardised objects/spaces within people's homes and freely available videoconferencing software on any suitable device). As such, there was variation in equipment and environments used (e.g. chair height for 30-s chair stand in-person and via telehealth differed by a mean of 3 cm, step height for the step test differed by a mean of 2.5 cm, and the number of steps in stair climb test differed by a mean of 1.3 steps (Appendix 4)), and, as participants set up each test themselves, there was likely some imprecision in distances measured, which all likely contributed to variability in test scores. In contrast, most other studies conducted telehealth assessments using standardised equipment,29,40,44–46 conducted tests in a clinical setting (rather than the patient's home) while the assessor was adjacent in another room,29,40,44,45 and/or used sophisticated videoconferencing with in-built measurement tools.29,40 This likely reduced the variability of their test scores but also limits the usefulness and generalisability of their findings. Indeed, one study 39 in healthy older adults utilised a similar pragmatic telehealth approach to ours and observed similar ICCs to ours (ranging from 0.79 to 0.87, with 95% CI lower limits all below our pre-determined acceptable threshold of <0.7) for the timed up and go and tests of balance/gait. Collectively, this suggests that clinicians or researchers who are considering pragmatically assessing performance via telehealth should be aware that there may be increased variability between telehealth test scores and reduced agreement with in-person tests.
Participant satisfaction with our telehealth testing sessions was high, indicating people with chronic lower limb musculoskeletal conditions feel confident, comfortable, and safe performing tests via telehealth, and that test requirements were not perceived as difficult to complete. However, telehealth did present some challenges. Participants' home environments often lacked appropriate space or stairs which meant that some tests could not be performed (e.g. stair climb and 5-m walk test). Although no adverse events were reported, some participants (2–11% of participants) were unable to complete some of the single-leg tests on account of safety/balance concerns. Finally, our first telehealth assessment sessions took approximately 10 min longer than the in-person assessment sessions, which has implications for the viability of administering the full suite of our performance-based tests via telehealth in some healthcare settings (e.g. private physiotherapy clinics that may have limited consultation time). This was likely because physiotherapists were required to guide the participant through the set up and procedure for each test, as well as instruct the participant on the necessary camera angles to ensure they had an appropriate view. However, our second telehealth session was shorter, and of a similar duration to the in-person session, suggesting that experience can help reduce the time required. Our physiotherapists followed a detailed testing manual (freely available online 34 ), which was vital to help them adapt each test to a telehealth setting and instruct patients from afar.
Our study had limitations. Our sample comprised mostly women (70%) who were well-educated (82% had completed a university or tertiary degree or higher) and who were physically active four or more times per week (58%). We also excluded those who were at risk of falls or who did not pass pre-exercise screening. As such, our findings may not be generalisable to men, people with lower levels of education, those who do not engage in regular physical activity, or those with balance issues or other health conditions that may affect their ability to exercise safely. Although we did not collect any data about whether participants sought professional care for their musculoskeletal condition between testing sessions, we assessed changes in clinical status between test sessions to ensure only those whose condition was stable across sessions were included.
In conclusion, the stair climb, timed up and go, right leg timed single-leg stance, and calf raise tests have acceptable reliability for use via telehealth in research and clinical practice. If re-testing via a different mode (telehealth/in-person), clinicians and researchers should consider using the 30-s chair stand test, left leg timed single-leg stance, and calf raise tests.
Abbreviations
ICC = intraclass correlation coefficient
95% CI = 95% confidence interval
Footnotes
Acknowledgements
The author(s) would like to acknowledge Mr Alex Kimp for his assistance in recruiting and phone screening participants.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article. This work was supported by Arthritis Australia. RSH is supported by a National Health & Medical Research Council Fellowship (#1154217) and KLB by an NHMRC Investigator grant (#1174431).
