Abstract
Karadöller and colleagues propose an interesting analysis of multimodality in spoken and signed language acquisition. In this commentary, we aim to extend the authors’ approach and abandon speech-centred and brainbound perspectives. By considering multimodality as a collage of multiple skills, in which abilities are acquired and exploited with new purposes, we will avoid integrating gestures and signs into pre-existent speech-centred models. This will enable us to move confidently into a future in which embodied dynamic interactions between skills and contexts are analysed in their ability to broaden the child’s world beyond the here and now and mould the language surge.
Karadöller and colleagues (2025) provide a purposeful review of studies highlighting the relevance of pointing and iconicity in children’s first language acquisition. Their endeavour is praiseworthy, as it allows to better understand the central and diverse roles of multimodality in spoken and signed languages. In particular, this work interestingly shows how spoken and signed languages may share some common ground in their reliance on pointing and iconic components, while maintaining relevant differences in the ways in which these components are exploited (Boyes Braem & Volterra, 2023; Pizzuto et al., 2007). The target article also outlines a viable way in which multimodality may be effectively analysed in spoken and signed language acquisition. But on this point, we are compelled to move beyond the theoretical perspective proposed by the authors and seek a more radical perspective. In particular, we argue here that to fully embrace a multimodal approach to language in childhood, we need to abandon speech-centred and what have also been termed ‘brainbound’ approaches (Clark, 2010). This will involve considering other skills, beyond the visual-perceptual domain, while attempting to capture their dynamic relations and allowing a more embodied and embedded perspective on language and its unfolding in time and space (Capirci et al., 2022; Pouw et al., 2014).
First of all, we need to abandon implicit references to speech-centred theoretical approaches. A viable way is to avoid describing multimodal components playing a part in communicative acts only in their relation towards speech-based structures. For example, in spoken languages, pointing is more than an ‘indicator’, ‘anticipator’ or ‘predictor’ of first words or a ‘scaffold’ for narrative recall (Karadöller et al., 2025). Pointing is a deictic gesture with its own developmental trajectory, requiring specific patterns of gaze coordination and motor planning (Masataka, 2003; Sparaci, 2013). Furthermore, while linked to patterns of joint attention and the presence of a referent in its immediate surroundings, pointing can prove extremely flexible, even when used in the absence of speech (e.g., think of the multiple uses of pointing for oneself). Along the same lines, iconic gestures are more than just tools to ‘compensate’ for the lack of verbal knowledge, lexical terminology or missing information; they should not be considered as mere acts ‘supporting’ communication of spatial relations and conceptually-challenging domains or ‘introducing’ character perspectives before speech (Karadöller et al., 2025). On the contrary, iconic gestures extend their roots into motor experiences with objects and actions in the real world, they are a reflection of children’s lived stories, carrying contents and coordination of symbolic skills that may be linked but never limited to speech-based structures (Sparaci & Gallagher, 2024; Sparaci & Volterra, 2017). Finally, iconicity in signed languages should not be considered as leading to an advantage in vocabulary development for signing children, but rather analysed in terms of semantic similarities in first words and signs between speaking and signing children (Volterra & Iverson, 1995; Rinaldi et al., 2014, 2018). Overall, we need to abandon the tendency to consider multimodal components as the support band to speech, which ends up being the headliner, well consolidated on the central stage of language development. To do so, we are proposing to consider the role of multimodal components in themselves or in relation to skills other than speech, which may even be more closely related to them (e.g., consider the relation between iconic gestures and grasping) (Sparaci & Volterra, 2017). Therefore, rather than calling, as stated in the target article, for a ‘revision’ of current theories on language and cognitive development, we propose a bolder approach, aiming to de-structure current speech-centred theories.
In dethroning speech, we can start by considering the multiple and different dimensions that characterize multimodal communication, starting from early actions through gestures (Volterra et al., 2018). This implies moving beyond the purely visuo-perceptual domain, as justly underscored by Karadöller and colleagues (2025). In fact, vision is not nearly enough, and multiple components (e.g., auditory, motor, proprioceptive, postural) need to be considered, as well as the relations between them (Schroer & Yu, 2023). It is important to recognize that these components may influence language to various degrees. For example, gross motor skills may have an indirect effect: as children acquiring independent sitting or walking, can free their hands and extend their peri-personal spaces, this will in turn affect both gestures and vocabulary (Clearfield, 2011; Iverson 2022; Karasik et al., 2011; West et al. 2019; Schroer & Yu, 2023; Slone et al., 2019). In other cases, the relation is more direct and embodied. For example, during communicative acts, we directly experience the physical relation, or entanglement, between body motions, respiration, and vocal activities (Pouw & Fuchs, 2022). Both indirect and direct influences play a role in the here and now of the communication flow, which requires the dynamic interrelation of multiple skills. In a way by de-structuring language we gain new tools, partially shared from dynamic systems approaches: a major emphasis on time (i.e., language unfolds in the here and now and each communicative step is influenced by the ones that preceded it and moulds the ones that follow), the concept of language as multiply determined and softly assembled from the non-linear coordination of multiple skills (e.g., auditory, visual, motor and proprioceptive feedbacks require constant flexible adjustments by participants in a communicative exchange) and emphasis on cognition as essentially embodied and embedded (i.e., no communication can effectively be carried out in a void, but everything occurs in a situated context) (Spencer et al., 2006).
Mentioning contexts leads to our second and final argument: in order to overcome speech-centred theorizing, we should also start abandoning an exclusively brainbound perspective on language, in favour of more extended and situated models (Clark, 2010). Language acquisition is always influenced by extraorganismic environments (i.e., the different social, cultural and material contexts in which it is embedded): in fact, it is an activity-in-the-world (Bonsignori & Sparaci, 2022; Capirci et al., 2022). These contexts are not mere backstages or scaffolds but actively contribute towards shaping multimodal communication (Clark, 2011). For example, during dyadic interactions with toddlers, adults have been observed to exploit modifications in tone and in the use of emotional expressions, to signal the presence of pretend scenarios (Lillard & Witherington, 2004; Nishida & Lillard, 2007). Mothers may change pitch or smile more frequently when pretending to eat than when really eating (Lillard & Witherington, 2004). This is an example of how a specific communicative context, may require modulating multimodal cues (e.g., accompanying a specific gesture or action with a smile or a higher tone of voice) to enhance communicative intent. To fully capture how contexts may mould communicative structures, we need, therefore, to accompany our focus on the communicative actors in the here and now with broader considerations of the social, cultural and material world of people and things surrounding the child. This will also allow a non-WEIRD and better understanding of cross-linguistic differences (Henrich, Heine & Norenzayan, 2010; Sparaci & Gallagher, 2023).
Once we move beyond speech-centred and brainbound approaches, we can start grasping functional relations within multiple skills as structuring a communicative act. While a full exploration of this topic is beyond the scope of the present commentary, we can attempt to provide an example of how this may be done. Consider the case in which a child describes the spatial relation of objects using only speech or alternatively using speech and gestures, as mentioned in Karadöller and colleagues (2025). These two forms of communication are both valid and effective, but their use, according to what we stated above, will largely depend on communicative contexts. Ruth Millikan once argued that certain performative acts are only directive (representing what is to be done), while others are both directive AND descriptive (also describing what is the case) (Millikan, 1995). But choice among them is moulded by the specific contexts or circumstances in which they are used: a bid may sometimes be made by explicitly stating “I bid” in some contexts or by simply raising a finger in others (Millikan 1995, p. 195). So, an important question, to avoid a brainbound model, may involve exploring what contexts may lead a child to use speech-gesture combinations rather than relying exclusively on speech. Then we can move on to analyse not only spatial and temporal characteristics of the gestures and speech used, but also other multimodal components (e.g., motor planning, object affordances, proprioceptive space, posture) and how they interact dynamically in time and are soft-programmed to allow embodied communicative acts which adapt to constant changes in the communication flow.
In conclusion, Karadöller and colleagues’ (2025) work has the great merit of setting our compass towards taking seriously the inherently multimodal nature of child communication. But to fully capture the importance of multimodality in language acquisition, we need to wander further away from the beaten path, possibly abandoning speech-centred and brainbound models, which still characterize the literature on multimodality in language acquisition.
Footnotes
Author contributions
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
