Abstract
Solving tasks with others is fundamental in our daily life and requires coordinating actions with other agents in time and space. To manage such real-time interactions, humans must deal with uncertainty caused by noise and delays in sensory and motor signals. One mechanism the sensorimotor system may employ to reduce uncertainty is exploiting information from multiple sensory systems. Here, we review empirical studies examining how visual, auditory, and haptic information contribute to joint actions. A systematic search following PRISMA guidelines yielded 24 eligible studies, which we classified according to the taxonomies by Knoblich et al. (2011) – emergent vs planned coordination – and Jarrassé et al. (2012) – co-activity, cooperation, and collaboration. Across emergent and planned coordination, access to multiple sensory channels generally enhanced interpersonal coordination. The review provides indications that the weighting of sensory signals depends on their reliability and task relevance. However, studies directly testing integration principles are rare, and learning in the context of multisensory integration in joint action remains unexplored. We argue that experimentally testing multisensory integration mechanisms in joint actions and investigating training-related changes offers valuable avenues for further research, advancing theoretical understanding and practical applications across domains such as sports, rehabilitation, and human–robot interaction.
Introduction
Our daily lives are filled with interactions with other people, such as navigating a crowded street, dancing Argentine tango with a partner, or passing a baton to a teammate in a relay race. These and many other sensorimotor interactions are joint actions, described as ‘any form of social interaction whereby two or more individuals coordinate their actions in space and time to bring about a change in the environment’ (Sebanz et al., 2006, p. 70). As illustrated by these examples, there are many different types of joint tasks with different characteristics.
One distinction that is regularly made in the study of joint actions (e.g., Felsberg & Rhea, 2021) is whether the coordination is instructed (or intentional) (e.g., Noy et al., 2017) or spontaneous (or unintentional) (e.g., Richardson et al., 2005). Coordination is called instructed when the agents involved in the joint task are explicitly advised to coordinate with each other. This is, for example, the case during handover tasks, when one participant is supposed to pass an object to another person (e.g., Brand et al., 2022). In contrast, coordination is called spontaneous when agents are required to perform a task together without being explicitly instructed to coordinate, and coordination patterns emerge unintentionally. This emergent phenomenon has often been studied in walking, in particular with a focus on whether two people walking next to each other fall into a pattern of synchronised steps as a function of various manipulated factors (for a review, see Felsberg & Rhea, 2021).
Another – albeit closely related – classification of joint tasks was proposed by Knoblich et al. (2011), who distinguished between emergent and planned coordination (Figure 1). While emergent coordination occurs when several individuals produce coordinated movements due to perception–action couplings without sharing the goal to do so, planned coordination occurs when multiple individuals act according to a joint goal and internally represent their own role in achieving this goal. This classification is permeable insofar as emergent and planned coordination may occur during the same joint task. For example, when participants are instructed to pass an object to somebody else, this is planned coordination insofar as the agents know they must pass or receive the object, respectively. However, for a successful handover action, the passer and receiver must have their hands in the right place at the right time; hence, they need to estimate the unfolding situation in real time and adapt their behaviour accordingly, which might be solely achieved through emergent coordination. Therefore, these two types of coordination, although distinct, should be considered complementary because they typically occur in all joint tasks (Knoblich et al., 2011). However, since planned and emergent coordination generally contribute to varying degrees to the achievement of a task, Knoblich et al.’s (2011) distinction can be used to categorise joint actions according to which type of coordination dominates. Taxonomies for joint actions based on the type of coordination (adapted from Knoblich et al., 2011) and the type of joint behaviour (adapted from Jarrassé et al., 2012), with the categories used in this review highlighted in black
Finally, Jarrassé et al. (2012) proposed a more fine-grained classification of joint tasks from a computational perspective (Figure 1). In their taxonomy, a task can first be characterised as either divisible or interactive. In a divisible task, each agent involved can complete its subtask independently of their partner(s) and thus requires no further information about them. For example, when two people are painting a wall together, they do not need to coordinate their movements or constantly know where the other person is currently working to complete their part of the painting. In this case, the agents’ behaviour is described as co-activity. Conversely, a joint task is called interactive when each agent depends on another agent to accomplish the task. For example, moving a sofa to a desired new location naturally takes at least two people who must continuously coordinate their movements. Interactive tasks can appear either as competition, namely, when agents act antagonistically, or, in the case of agonistic behaviour, as collaboration or cooperation, depending on whether the agents play the same or different roles in achieving the task goal. While the case of several people moving a sofa is a good example of collaboration, cooperation takes the form of assistance when another person helps to move objects out of the way or education if another person advises on how best to lift a heavy object, such as a sofa.
In summary, the taxonomy introduced by Jarrassé et al. (2012) offers an alternative way of classifying joint actions, this time with a focus on the type of joint behaviour, which can be classified as co-activity, cooperation, or collaboration. In conjunction with the distinction between different coordination types proposed by Knoblich et al. (2011), namely, emergent vs planned coordination, the categorisation schemes outlined above help to structure existing empirical research on joint action.
Beyond the categorisation issue, it should be considered that different types of joint action tasks place different sensorimotor demands on the agents to achieve successful interpersonal coordination. In a recent review on embodiment research, Maselli et al. (2025, p. 10) highlight that addressing the sensorimotor mechanisms allowing interpersonal coordination in joint actions is one of four key challenges in the field, stating that ‘in psychology and neuroscience, there is a long tradition of studying joint action and dyadic interaction in situated settings … However, few studies focus on sensorimotor aspects of joint decision dynamics’. Accordingly, the question arises of how, exactly, these sensorimotor aspects can and should be taken into account in the context of research on joint actions.
From a sensorimotor perspective, a fundamental challenge of joint actions is that actors must coordinate their movements with co-actors under inherent uncertainty (Leibfried et al., 2015; Pezzulo et al., 2013; Russo et al., 2025) arising from, amongst other factors, noise and delays in sensory and motor signals (Faisal et al., 2008; van Beers et al., 2002). In this context, Beck et al. (2023) outline five sensorimotor mechanisms to handle uncertainty in complex situations: multisensory integration (Ernst & Banks, 2002), prior knowledge integration (Körding & Wolpert, 2004), risk optimisation (Trommershäuser et al., 2003), redundancy exploitation (Scholz & Schöner, 1999; Todorov & Jordan, 2002), and impedance control (Burdet et al., 2001). Both multisensory integration and prior knowledge integration refer to the optimal merging of different sources of information based on Bayesian principles to obtain a more robust estimate of the current situation (Körding & Wolpert, 2006). Under such circumstances, all available sensory information, as well as prior knowledge, is combined and integrated according to its reliability. The most likely estimate of our environment obtained in this way includes the state of our own body and, particularly relevant in the present context, the people present with whom we need to accomplish the joint task. Following the same principle, multisensory integration entails combining information from multiple sensory modalities (e.g., visual, auditory, haptic), each of which may provide complementary or redundant cues about task-relevant properties of the environment (Ernst & Banks, 2002; Ernst & Bülthoff, 2004; Stein & Meredith, 1993). Integrating information across modalities is thus a key mechanism for reducing uncertainty in sensorimotor control (Franklin & Wolpert, 2011). Therefore, it appears to be of great importance for successful interpersonal coordination.
However, while there is a rich body of research on multisensory integration for perception, on the one hand (e.g., Alais et al., 2010; Calvert et al., 2004; Ernst & Bülthoff, 2004; Stein et al., 2020), and on joint action, on the other hand (e.g., Pezzulo et al., 2026; Schmidt & Richardson, 2008; Sebanz et al., 2006; Sebanz & Knoblich, 2021), there is, to the best of our knowledge, no review focusing on how humans use information from multiple sensory systems in joint action tasks. Thus, this review aims to address this gap and respond to the call by Maselli et al. (2025) to establish a basis for future research on joint actions with a focus on the underlying sensorimotor mechanisms that deal with uncertainty in complex joint action tasks. We thus provide an overview of all peer-reviewed studies that have investigated the role of multiple sensory systems (visual, auditory, haptic) in joint action. To structure the findings, we group the studies by different types of coordination (Knoblich et al., 2011) and types of joint behaviours (Jarrassé et al., 2012), enabling us to explore how the use of multisensory information may vary across different joint action tasks and to derive promising directions for future research. Doing so, we extend and complement previous reviews in the field, which focused on sensorimotor synchronisation (Repp & Su, 2013) or on spontaneous interpersonal synchronisation of gait (Felsberg & Rhea, 2021), by broadening the scope of joint action tasks while focusing on studies that explicitly examine the roles of multiple sensory systems.
Methods
This review was conducted according to the PRISMA Extension for Scoping Reviews guideline (Tricco et al., 2018). The checklist can be found in the supplementary material (Appendix A).
Inclusion and Exclusion Criteria
To be included, the articles had to report empirical data on a task performed by two or more participants together. At least one participant had to be a human, while their partner could be another human, an avatar, or a robot. All the tasks had to include body movements, from finger movements in laboratory settings to full-body movements in more complex conditions, such as engaging in sport or playing music. As one of the main foci of this review is the effect of multisensory integration on joint behaviour, the included articles had to examine at least two sensory systems with a particular interest for sensorimotor control, namely, the visual, auditory, or haptic systems (cf. Leib et al., 2023). This effect had to be investigated as an independent variable, meaning that the studies had to encompass at least two experimental conditions, for example, with vs without vision. As we focused on unrestricted human behaviour, participants had to be humans (i.e., no animal studies) and, more specifically, healthy (i.e., free from health restrictions) adults (i.e., between 18 and 65 years old). The reported results had to be quantitative and behavioural, meaning that qualitative studies or those exclusively focusing on neurophysiological data were not considered. Finally, all the included papers needed to contain original data, be written in English, and be published in peer-reviewed journals by the date of the last conducted search.
Identification and Screening
The last search was conducted on 5 March 2025. We searched six databases: PsychINFO, PubMed, ScienceDirect, Scopus, SPORTDiscus, and Web of Science. The following search strategy was used: ((joint AND action) OR (interpersonal AND coordination) OR entrainment OR synchron*) AND (multisensory OR visual OR auditory OR haptic OR tactile OR touch) AND (movement OR sport OR dance). The terms ‘sport’ and ‘dance’ were included to also capture studies involving gross-motor behaviours, which are often represented in sports and performing arts literature rather than experimental movement research. The search strategy specified above had to be slightly adapted for ScienceDirect because of the limited use of Boolean operators. Details of the specific search strategies in each of the six databases can be found in the supplementary material (Appendix B).
As shown in the PRISMA flow diagram (Figure 2), 10,068 records were initially identified and subsequently imported into Zotero. Subsequently, 3,750 duplicates were removed, leaving 6,318 papers when the title screening began. The titles were independently screened by two researchers, one being the first author. If at least one researcher did not identify any reasons for exclusion on the basis of the prespecified criteria at title level, the papers were included at the abstract screening stage. The two reviewers then independently screened the remaining 659 abstracts and discussed which to keep for full-text screening in case of disagreement. The full-text screening, containing 181 texts, was performed by the first author only and led to the further exclusion of 158 items. Finally, one more article was found as a reference in another thematically related review (Kopnarski et al., 2023) and thus included. At the end of the screening process, a total of 24 papers remained in this review. PRISMA flow diagram for the literature search
Results
General Overview
Studies of Multisensory Integration in Joint Actions, Sorted by Type of Coordination, Type of Joint Behaviour, and Name of (First) Author and Characterised by Investigated Sensory Systems, Experimental Task, Study Design, and Main Findings
Regarding the type of coordination, Table 1 first shows that the categories of emergent vs planned coordination proposed by Knoblich et al. (2011) correspond perfectly with the more task-related distinction of spontaneous vs instructed coordination, which was also introduced above. Furthermore, the classifications developed by Knoblich et al. (2011) and Jarrassé et al. (2012) are highly congruent. This means that all articles classified as emergent (or spontaneous) coordination were also classified as co-activity (n = 11, 46%). In contrast, all articles falling into the category of planned (or instructed) behaviour qualified as encompassing either cooperation (n = 9, 37.5%) or collaboration (n = 4, 16.5%), representing 13 studies in total (54%).
In terms of the sensory systems studied, Table 1 shows that the overall largest research focus has been on investigating how participants use visual and auditory information (n = 15, 62.5%). Fewer studies have researched the contributions of the visual and haptic systems (n = 5, 21%), the auditory and haptic systems (n = 1, 4%), or the three sensory systems together (n = 3, 12.5%). The number of studies addressing different types of coordination, types of joint behaviour, and sensory systems is illustrated in Figure 3. Number of studies by type of coordination (instructed/planned vs spontaneous/emergent), type of joint behaviour (co-activity vs cooperation vs collaboration), and the sensory systems investigated (vision vs audition vs haptics)
In the following sections, we present the studies grouped by the type of joint behaviour addressed, namely, co-activity, cooperation, or collaboration, the first behaviour reflecting emergent, spontaneous coordination and the latter two reflecting planned, instructed coordination. For each of these categories, we first summarise the primary characteristics of the conducted studies before reporting key findings, particularly in regard to the use of different sensory modalities for successfully performing joint actions.
Emergent, Spontaneous Coordination and Co-Activity
Studies classified as emergent, spontaneous coordination and co-activity (n = 11, 46%) cover various tasks from gross-motor skill performance, such as dancing (Bigand et al., 2024; Dotov et al., 2021), walking (Harrison & Richardson, 2009; Nessler & Gilliland, 2009; Zivotofsky et al., 2012), rocking chairs (Demos et al., 2012), and balancing (Miyata et al., 2021; Reynolds & Osler, 2014; Sofianidis et al., 2015), to fine-motor skill performance, such as hand (Richardson et al., 2005) or finger movements (Nowicki et al., 2013). All but one of these tasks were completed in pairs with another human, with the remaining study investigating dancing in a group (Dotov et al., 2021).
The majority of articles (n = 6) focused on the effect of visual and auditory information, and the sensory systems were mostly manipulated in an ‘on/off’ manner. For example, in the experiment by Harrison and Richardson (2009), two participants had to walk one behind the other while the use of visual and haptic information was manipulated. In one condition, the participants could see each other and were linked by a mechanical coupling, whilst in further conditions, they could either use visual information without additional haptic contact or vision was restricted, but the mechanical coupling was maintained.
In terms of the empirical findings obtained, visual information seems to be particularly important to elicit spontaneous coordination, as more than half of the included studies of emergent coordination agree on that point (Demos et al., 2012; Dotov et al., 2021; Harrison & Richardson, 2009; Miyata et al., 2021; Reynolds & Osler, 2014; Richardson et al., 2005). Consequently, if vision is removed, people lose the coordination they have acquired under full-vision conditions (Miyata et al., 2021). However, two studies found no effect of vision on synchronisation (Nessler & Gilliland, 2009; Nowicki et al., 2013), and Zivotofsky et al. (2012) reported that, for a side-by-side walking task, auditory and haptic information were more important for spontaneous gait synchronisation than vision.
Besides vision, auditory information proved to be important for emergent coordination (Demos et al., 2012; Miyata et al., 2021), but primarily in a specific form. For example, Nowicki et al. (2013) found that having access to one’s own auditory feedback improves synchronisation during a finger-tapping task, while hearing the auditory feedback coming from one’s partner decreased performance. Conversely, in a rocking chair task, receiving auditory information produced by the partner’s movements elicited stronger synchronisation in co-active behaviour, while music competed with the partner’s influence and reduced coordination (Demos et al., 2012). The effect of verbal interaction – which can be understood as a special kind of auditory information – is unclear, as one study found no influence (Richardson et al., 2005), while another revealed a facilitated coordination, but only when the partners could not see each other (Miyata et al., 2021).
Further, three studies showed that haptic information also contributes to emergent coordination (Harrison & Richardson, 2009; Nessler & Gilliland, 2009; Reynolds & Osler, 2014); for example, Nessler and Gilliland (2009) reported that pairs of participants in a side-by-side walking task were more coordinated when they were holding hands than when they were not.
Generally, combining multiple sources of sensory information enhanced coordination compared to enabling the use of only one sensory channel by restricting the others (Harrison & Richardson, 2009; Nessler & Gilliland, 2009). Notably, different sensory modalities seem to induce different, complementary aspects of coordination. For example, Bigand et al. (2024) investigated humans freely dancing in a ‘silent disco’ setting, manipulating both auditory (here: musical) and visual input. Both the auditory and visual information promoted synchrony. However, the music primarily elicited synchronisation in the anteroposterior direction, whereas vision of the partner mainly induced synchronisation in the lateral direction. In contrast, if visual and auditory information are redundant (e.g., both providing information about a step), the story is different: Nessler and Gilliland (2009) showed that when one sensory cue is absent (i.e., vision), the participants were able to use redundant sources of information (i.e., sound) to generate synchrony that did not differ from the condition in which vision and sound were normal. Furthermore, studies show that multiple sources of sensory information can modulate the effect of each source. For example, haptic information elicited spontaneous coordination in a balance task, but this spontaneous coordination was impaired when there were frequency-related changes in the auditory information (Sofianidis et al., 2015). In the same vein, the effect of visual information was stronger or weaker depending on the ‘groove’ and tempo of musical-auditory information in a dance task (Dotov et al., 2021), and haptic information increased coordination when vision was restricted for one or both partners (Reynolds & Osler, 2014).
Planned, Instructed Coordination and Cooperation
Nine of the 24 articles (37.5%) included in the present review were classified as planned, instructed coordination and cooperation. They cover various tasks, such as playing music (Bishop & Goebl, 2015), dancing (Chauvigné et al., 2019), trampolining (Heinen et al., 2014), and walking (Khan et al., 2020; Noy et al., 2017), but primarily tasks that require hand movements (Colomer et al., 2022; Döhring et al., 2020; Hansen et al., 2017; Werner & Gorman, 2023). In one study, the participants were in a group to perform a dance task (Chauvigné et al., 2019); otherwise, the participants were investigated in pairs. In two walking studies, the participants had to synchronise their steps with an avatar (Khan et al., 2020; Noy et al., 2017), while in all the other publications, the participants had to synchronise with another human.
As in the co-activity studies, the sensory systems were primarily manipulated in an ‘on/off’ manner. However, the manipulation was more subtle in four studies, namely, in handover tasks, by adding thick gloves to reduce haptic information (Döhring et al., 2020) or varying the distance between passer and receiver as well as the weight of the object to be passed (Hansen et al., 2017), and in joint walking tasks, by manipulating the velocity of the leading avatar (Khan et al., 2020) or the congruency between the available visual and auditory information (Noy et al., 2017).
In regard to the use of vision, coordination was found to be better when participants could see their partners than when they could not (Chauvigné et al., 2019; Döhring et al., 2020). However, coordination decreased with increasing uncertainty in the available visual information, for instance, induced by an increased velocity of the leading avatar in joint walking (Khan et al., 2020) or by an increased distance between the participants in a dancing task (Chauvigné et al., 2019).
Several studies showed that planned coordination is better when auditory information is available than when it is not, be it music in a dance task (Chauvigné et al., 2019) or the sound of the partner’s footsteps in joint walking (Khan et al., 2020; Noy et al., 2017). For example, Khan et al. (2020) and Noy et al. (2017), who researched how participants synchronise their steps with those of a virtual avatar, showed that auditory information leads to better performance than visual-only cues. However, not all tasks reveal an auditory advantage. In a remote driving task, where one participant (the driver) operated a car without seeing it, while another (the spotter) provided visual (hand gestures) or auditory (verbal instruction) cues, Werner and Gorman (2023) showed that the participants perform better with visual cues alone than with auditory cues alone. Studying piano duets, Bishop and Goebl (2015) showed that the effect of visual information (seeing the partner) depends on the access to auditory information (hearing the partner). When audio was available, removing vision had no effect on synchronisation. In contrast, when audio was absent, visual cues facilitated synchronisation, showing that visual information became important only when the reliable auditory cue was missing. In the same vein, Colomer et al. (2022) used a task in which pairs had to move a mobile slider with one hand each, based on either visual or auditory information, and found that the hearing partner takes the lead when the auditory information has a frequency of 2.15 Hz or more, but not when the frequency is lower. This outcome can be understood as an effect of varying degrees of uncertainty regarding available information.
Three studies investigated the role of haptic information in cooperative tasks (Chauvigné et al., 2019; Döhring et al., 2020; Hansen et al., 2017). Using a handover task, Döhring et al. (2020) showed that the quality of haptic information affects coordination because the passer’s grip forces increased, while those of the receiver decreased, when the passer wore thick gloves. Furthermore, Hansen et al. (2017) reported that the duration of the handover increased with the weight of the object to be passed. Finally, coordination in a dance task was better when partners were allowed to hold hands than when they could not (Chauvigné et al., 2019).
Surprisingly, not many studies directly addressed the integration mechanism used to combine information from multiple senses. A notable exception is the study by Noy et al. (2017), who tested the maximum likelihood estimation (e.g., Ernst & Banks, 2002). In their joint walking task, the researchers experimentally created a conflict between visual and auditory information and showed that the participants relied more on auditory signals. As a result of the incongruency between the sources of information, coordination between the participant and the avatar decreased. However, in studies where sensory information from different channels was congruent, coordination quality was generally better when both visual and auditory information about the partner was available (Heinen et al., 2014; Khan et al., 2020).
Planned, Instructed Coordination and Collaboration
Planned, instructed collaboration – which, as a reminder, differs from cooperation in that two or more agents play the same role rather than different roles in achieving a task goal (Figure 1) – is the least represented category in this review, as only four articles address it. Two of the researched tasks involve hand movements (Hessels et al., 2023; Mojtahedi et al., 2017), one finger movements (Masumoto & Inui, 2014), and one playing music (Liebermann-Jordanidis et al., 2021). All these tasks were carried out in pairs with another human.
Notably, the manipulation of the available sensory information was more diverse than in the other categories, since, beyond a mere ‘on/off’ manipulation, studies experimentally varied the congruency of the information (Liebermann-Jordanidis et al., 2021) or the type of information available in the same sensory systems by comparing conditions in which the participants were seated either face to face or side by side (Mojtahedi et al., 2017).
Regarding visual information in planned, instructed collaboration, Hessels et al. (2023) reported that pairs were faster to complete a puzzle when all pieces were visible than when some were hidden. Moreover, in a concurrently assigned joint force-production task, the participants were closer to the target force with than without access to visual feedback (Masumoto & Inui, 2014). Finally, two people were better coordinated in jointly lifting an object when they were positioned side by side than face to face (Mojtahedi et al., 2017).
With respect to auditory information, Hessels et al. (2023) found no effect of verbal interactions on coordination in the puzzle task, while Liebermann-Jordanidis et al. (2021) reported that the coordination between partners in joint music playing was better when they were required to produce different sounds rather than the same sound.
Only one study investigated the role of haptic information on coordination in collaborative tasks, namely, Mojtahedi et al. (2017), who found that various combinations of dominant and non-dominant hands of both partners do not affect the quality of their coordination in joint object lifting.
As only four articles about collaboration were included in this review, insights on multisensory integration remain limited within this category. Nevertheless, Masumoto and Inui (2014) showed that coordination in an isometric force-production task decreases under verbal dual-task conditions if visual feedback is available, but not if visual feedback is withdrawn. However, Hessels et al. (2023) found that talking with each other does not influence joint performance on the puzzle task, whether visual information is fully accessible or not.
Discussion
This review examined how humans utilise multiple sensory modalities in joint action tasks. In total, we identified 24 empirical studies addressing this question. To structure the findings, we used the taxonomies proposed by Knoblich et al. (2011) and Jarrassé et al. (2012), and the combination of these two classifications proved to be a useful framework to relate empirical findings to each other.
First, the review shows that the classifications by Knoblich et al. (2011) and Jarrassé et al. (2012) are highly congruent. All the articles that addressed emergent (or spontaneous) coordination were also classified as encompassing co-activity as a joint behaviour type (n = 11, 46%). All the studies addressing planned (or instructed) coordination fell into the category of either cooperation (n = 9, 37.5%) or collaboration (n = 4, 16.5%). In terms of sensory systems, research has principally investigated how participants use visual and auditory information (n = 15, 62.5%), while fewer studies have researched the contributions of the visual and haptic systems (n = 5, 21%), the auditory and haptic systems (n = 1, 4%), or the three sensory systems together (n = 3, 12.5%).
Generally, we found that the availability of multiple sensory cues enhances both emergent and planned coordination across joint action behaviour types (co-activity, collaboration, cooperation). For spontaneous coordination phenomena investigated in co-activity tasks, it was found that different sensory cues can elicit different aspects of coordination (Bigand et al., 2024; Miyata et al., 2021). For example, Bigand et al. (2024) reported that music (i.e., auditory information) primarily elicited synchronisation in the anteroposterior direction, while vision of the partner mainly induced synchronisation in the lateral direction, suggesting a complementary function of different sensory cues to synchronise to different aspects of the task.
Aligning with this idea, an overview of the studies suggests that, for both emergent and planned coordination, the relative importance of the sensory modality (visual, auditory, haptic) seems to depend on the demands of the task. In tasks with high timing demands (Bishop & Goebl, 2015; Khan et al., 2020; Nowicki et al., 2013; Noy et al., 2017; Zivotofsky et al., 2012), such as synchronising steps to those of a partner, coordination depends more on audition than on vision; a so-called auditory dominance effect, which is well-documented in the literature (Burr et al., 2009; Repp & Penel, 2002) and can be explained by the higher temporal resolution of the auditory system (Kandel et al., 2013). In contrast, in tasks with high demands for spatial localisation, such as navigating a car through obstacles (Werner & Gorman, 2023), visual information is more important than auditory information to coordinate joint actions. However, while this task-dependent weighting of sensory information provides a principled hypothesis, joint-action experiments directly testing this idea remain lacking.
For planned coordination tasks, a range of studies shows that the effect of sensory cues is modulated by the availability or reliability of another sensory source (Bishop & Goebl, 2015; Colomer et al., 2022; Döhring et al., 2020; Masumoto & Inui, 2014; Noy et al., 2017; Werner & Gorman, 2023). For example, in a piano duet, Bishop and Goebl (2015) showed that the effect of visual cues (seeing the partner) depends on the availability of the auditory cue (hearing the primo performer). When the primo audio was available, removing visual contact did not impair synchronisation. In contrast, when the primo audio was absent, visual cues facilitated synchronisation at critical moments, such as following long pauses, demonstrating that visual information became important only when the reliable auditory cue was missing. These findings are in line with the Bayesian idea that the sensorimotor system combines information from multiple sensory modalities and weights them according to their relative reliability (Ernst & Banks, 2002). However, only one of the reviewed studies, Noy et al. (2017), directly investigated the integration mechanism, namely, maximum likelihood estimation.
By examining a broader range of coordination types (emergent, planned) and joint action behaviours (co-activity, cooperation, collaboration), the current review complements a previous review on sensorimotor synchronisation by Repp and Su (2013). While their review also included synchronisation tasks beyond interpersonal coordination (e.g., coordination with a metronome), the authors only covered synchronisation tasks. Another related review by Felsberg and Rhea (2021) more narrowly focused on spontaneous interpersonal coordination during gait and pointed out that the role of each sensory system in spontaneous coordination remains unclear. By including a broader range of joint action tasks while more narrowly focusing on multisensory integration, our review identifies patterns in the findings and suggests hypotheses about general principles of multisensory integration in joint actions that can be tested in future studies. In this regard, based on the findings of our review as well as the current literature, we see three promising avenues for future research: (i) investigating multisensory integration in joint action tasks in other classification combinations, (ii) experimentally testing mechanisms of multisensory integration, and (iii) studying the learning processes by comparing interventions to improve multisensory integration in joint action tasks. (i) Our results surprisingly show that emergent, spontaneous coordination has been exclusively studied in co-activity tasks, and planned, instructed coordination has been exclusively studied in either cooperation or collaboration. However, since only studies of joint tasks with a motor component and with an experimental manipulation of at least two sensory systems were considered in this review, this finding does not rule out the possibility of investigating multisensory integration in joint actions with further classification combinations. For example, Richardson et al. (2007) asked their participants to pick up wooden planks of different sizes either alone or with someone else, which addresses emergent, spontaneous coordination, but as collaboration instead of co-activity. The question of how participants use multiple sensory inputs to solve this joint task could easily be incorporated into this study design. Therefore, further research on multisensory integration in joint action with alternative classification combinations is desirable. (ii) By selectively suppressing one sensory channel, current studies provide indications of the importance of particular sensory modalities to perform joint action tasks. Yet only one study included in our review (Noy et al., 2017) aimed to test the principles under which information from multiple sensory modalities is integrated. To test those integration mechanisms experimentally, an elegant technique – primarily used in perceptual studies to date – is to create incongruencies between sensory inputs and observe participants’ bias towards one or the other cue as a function of different conditions (Trommershäuser et al., 2011). This logic has been widely applied in size estimation (e.g., Ernst & Banks, 2002), localisation tasks (e.g., Alais & Burr, 2004), and more recently, sensorimotor synchronisation (e.g., Elliott et al., 2010), and is thus readily applicable to scenarios of interpersonal synchronisation as well. Using this approach, testing two hypotheses would particularly advance our functional understanding of multisensory integration in joint actions. First, to test the Bayesian principle that multiple signals are weighted according to their relative reliability (Ernst & Banks, 2002), researchers could experimentally create conflicts between two or more sensory cues and manipulate their respective reliability. If participants integrate the signals in a reliability-based manner, their estimate should shift towards the more reliable source. Second, to test the hypothesis that the sensorimotor system weights signals according to their task relevance, a promising avenue would be to design joint action tasks in which temporal and spatial demands can be manipulated independently of each other. In this case, a task-relevance integration mechanism would predict that auditory information should be weighted higher if the temporal timing demands of the task increase, while visual information should be weighted higher if the spatial demands increase. (iii) Strikingly, none of the articles included in this review focused on performance changes and thus learning in joint actions. While learning processes have been addressed in the fields of joint action (e.g., Knoblich et al., 2011) and multisensory integration (e.g., O’Brien et al., 2023) in isolation, the question of how multisensory integration in joint action tasks changes through practice remains unexplored. Thus, addressing learning appears a highly promising avenue for future research both to improve our theoretical understanding of joint actions and to substantiate applications in relevant practical fields, such as sports, dance, rehabilitation, and human–robot interaction (e.g., Chen et al., 2015).
In summary, this review integrates and organises the growing body of research on how humans exploit multiple sensory modalities in joint actions. We propose that combining the classifications from Knoblich et al. (2011) and Jarrassé et al. (2012) creates a useful framework to structure previous and future research in this field. Current research covers emergent and planned coordination phenomena in a wide range of tasks. While the body of research provides indications that the reliability and task-relevance of sensory information might drive their use in joint action tasks, studies that systematically test integration principles are needed. We suggest that testing these integration mechanisms and studying how multisensory integration is learned during joint actions are highly relevant avenues for further research, both in terms of fundamental understanding and applied consideration across fields.
Supplemental Material
Supplemental Material - Multisensory Contributions in Joint Actions: A Scoping Review
Supplemental Material for Multisensory Contributions in Joint Actions: A Scoping Review by Mathilde Truffer and Stephan Zahno in Perceptual and Motor Skills.
Supplemental Material
Supplemental Material - Multisensory Contributions in Joint Actions: A Scoping Review
Supplemental Material for Multisensory Contributions in Joint Actions: A Scoping Review by Mathilde Truffer and Stephan Zahno in Perceptual and Motor Skills.
Footnotes
Acknowledgements
The authors thank Ellen Straalman for her help with the screening process, Damian Beck for his advice on conducting a systematic review, and Ernst-Joachim Hossner for his valuable contributions to the design and presentation of this research.
Ethical Considerations
No approval of research ethics committees was required as this is a review of existing literature.
Author Contributions
Conceptualisation: MT & SZ; Protocol development: MT & SZ; Systematic literature search: MT; Data extraction and synthesis: MT; Writing: MT & SZ.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The authors confirm that all data are available within the article or the supplementary materials. Data extraction materials are available upon request from the corresponding author. No preregistration was undertaken for this review.
Supplemental Material
Supplemental material for this article is available online.
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
