Abstract
The widespread availability of generative artificial intelligence tools has had a dramatic effect on language education, transforming how we think about and go about writing. This includes ‘multimodal composing’, which combines moving image, soundtrack, speech, and on-screen writing in digital communication. In this article, we focus on the potentials and pitfalls of generative artificial intelligence in a pedagogy of digital multimodal composing. The article aims to explore how the affordances and constraints of generative artificial intelligence tools powerfully shape the multimodal composing process in a way that enables or constrains individual agency, creativity, and criticality; and conceptualize a model of best practice for the use of generative artificial intelligence in digital multimodal composing. As language and literacy educators, we report the findings of a practitioner inquiry study into our own pedagogically motivated digital multimodal composing practices, using generative artificial intelligence tools to collaboratively complete a digital multimodal composing project (a digital video scientific documentary) designed for second language learners of English for science. We recorded this process by keeping copies of input and output as well as maintaining regular research journals. The analysis identified five themes relevant to agency, creativity and criticality in digital multimodal composing, namely: (a) feelings; (b) relating to generative artificial intelligence: the iterative, collaborative loop; (c) ideological positioning; (d) modal interactions: process and product; and (e) voice and authenticity. We suggest an integrated critical digital literacies model to inform digital multimodal composing pedagogy that engages with generative artificial intelligence at three key levels, extending previous work by highlighting the role of affect: a macro level focusing on feelings, materialities, and ideology; a meso level focused on the design of a collaborative, iterative loop in generative artificial intelligence digital multimodal composing workflow; and a micro level focused on interaction with generative artificial intelligence.
Keywords
1. Introduction
[A]ttending to all relevant concerns while simultaneously giving students access and exposure to AI [artificial intelligence] can feel like an impossible line to walk. But as with most thorny topics, progress may be made in wrestling with these sometimes-conflicting impulses. When we ask students and ourselves to examine the risks and benefits, affordances and restrictions, fears and hopes involved in composing with AI, we might be able to find a way forward that is ethical, equitable, and expansive. (Smith et al., 2025, pp. 12–3)
This article examines the potentials and pitfalls of the use of generative artificial intelligence (GenAI) for the purposes of digital multimodal composing (DMC), a pedagogical approach involving ‘activities that engage learners in the use of digital tools to construct texts in multiple semiotic modes, including writing, image, and sound (to name a few)’ (Hafner, 2015, p. 487). In doing so, it addresses not only the potential affordances of generative AI tools to promote multimodal composing but also potential affordances to promote composing more generally. Furthermore, as the quote above suggests, it also engages with a plethora of related pedagogical and ethical issues. The introduction and widespread availability of GenAI tools have had a dramatic effect on how people can now go about processes of reading and writing and this necessarily raises questions about how those processes can and should be taught. Yet, at the same time, they present myriad challenges for language teachers and learners, who must somehow balance ‘risks and benefits, affordances and restrictions, fears and hopes’, as outlined in the quote above. In this article, we draw on practitioner inquiry in order to understand this balancing act and propose a systematic way forward, with particular reference to creative DMC projects, an under-researched area. In particular, we are interested in: (a) exploring how the affordances and constraints of GenAI tools powerfully shape the multimodal composing process in a way that enables or constrains individual agency, creativity, and criticality; (b) conceptualizing a model of best practice for the use of GenAI in DMC.
2. Theoretical Framework
The overarching theoretical framework for this research is derived from digital literacies scholarship, where digital literacies are defined as ‘practices of communicating, relating, thinking, and being associated with digital media’ (Jones & Hafner, 2021, p. 17). This framework is based on the social practice approach to literacy developed in the new literacy studies (Gee, 2015), which sees practices of reading and writing as fundamentally socially situated in identifiable ‘literacy events’ that engage readers and writers in negotiating meanings and identities (Barton, 2007). When such literacy events involve digital practices, those events take place across an online/offline nexus, as acknowledged by post-digital (Wang & Canagarajah, 2024) and socio-material (Darvin, 2025) approaches that understand online interactions as fundamentally situated in wider material and social contexts. Research on digital literacies emphasizes that mastering digital literacies is not a simple matter of developing technical skills alone, as exemplified by the definition provided above, which includes ways of thinking and being as important elements of digital literacy. Similarly, Lankshear and Knobel (2011) point out that digital literacies involve both ‘new technical stuff’ – the technical abilities demanded by digital literacies – and ‘new ethos stuff’ – the ways of thinking that are involved. This realization is also important to the study of GenAI literacy practices, which involve technical skills like ‘prompt engineering’ but also fundamentally challenge the way that we think about processes of reading and writing, and, on an even deeper level, how such processes implicate our own humanity (McGuire, 2023).
It may be useful to consider AI literacy as a particular kind of digital literacy, while also recognizing that the guiding principles identified below ‘are not exclusive to AI technologies’ but shared with critical digital literacies more generally (Darvin, 2025, p. 4). AI literacy in this sense can be defined as ‘a set of competencies that enable individuals to critically evaluate AI technologies; communicate and collaborate effectively with AI; and use AI as a tool online, at home, and in the workplace’ (Long & Magerko, 2020, p. 1). AI literacy can also be conceived of in terms of ‘knowledge and beliefs about artificial intelligence which aid their recognition, management, and application’ (Deuze & Beckett, 2022, p. 1913), a position that stresses the importance of not only technical skills but also ‘ethos’ – attitudes, beliefs, and ideologies – as a component of AI literacy. Research on AI literacy is not new but has acquired particular importance now that GenAI has become ubiquitous and GenAI literacy practices have become part of many people’s everyday lives. As a result of this change there has been a surge in studies that examine AI education, as well as AI in education, for non-technical learners (where, traditionally, AI literacy was of primary interest to students in various technical computer studies fields, as Long and Magerko (2020) note).
Within this digital literacies framework, concepts of creativity, criticality, agency, and voice assume an important role for the purposes of our study, understood as follows.
2.1. Creativity
‘Creativity’ can be defined as ‘the ability to come up with ideas or artefacts that are new, surprising and valuable’ (Boden, 2004, p. 1). This is not limited to grand acts of historical creativity that transform the way people think. Rather, it includes acts of everyday creativity as well, such as novel and surprising combinations of semiotic resources deployed for purposeful effect in multimodal designs, with the potential to ‘push and break boundaries between the old and the new, the conventional and the original’ (Li, 2011, p. 374).
2.2. Criticality
Creativity can be seen as an expression of ‘criticality’, which ‘examines how ways of thinking are constructed through discourse’ (Darvin, 2020, p. 584), following particular ideological positions. With GenAI, examining discourse in this way also means examining interactions with chatbots, uncovering underlying ideologies and positioning.
2.3. Agency
‘Agency’ refers to ‘people’s ability to make choices, take control, self-regulate, and thereby pursue their goals as individuals’ (Duff, 2012, p. 419). GenAI tends to challenge agency if individuals knowingly or unknowingly surrender control and over-rely on AI-generated output in their literate activity.
2.4. Voice
Finally, ‘voice’, also challenged by GenAI writing practices, can be seen as a property of both discourse and individual identity. It involves ‘feeling-hearing-sensing a person behind the written words, even if that person is just a persona created for a particular text or a certain reading’ (Bowden, 1999, pp. 97–98; as cited in Hirvela & Belcher, 2001).
3. Composing and DMC with GenAI
Research in the higher education context has examined how university writing instruction can benefit from the incorporation of GenAI tools, addressing the important question of how such tools should be conceptualized and how they fit into the writing process. There is the potential for these tools to both enable and constrain students in terms of creativity, agency, and writing proficiency; that is, they can be both empowering and disempowering (Washington, 2023). Current proposals for GenAI use suggest that it can act as a co-author (McGuire, 2023), an assistant (Aguilar, 2024), a partner (Vetter et al., 2024), a ‘machine in the loop’ capable of ‘rhetorical load sharing’ (Knowles, 2024), or as a participant in ‘human-machine teaming’ (Bedington et al., 2024). It is notable that all of this work envisages an approach in which humans are collaboratively ‘writing with’ GenAI and does not conceive of GenAI as a simple replacement for the human writer. In addition, other research has examined possible human/GenAI ‘collaboration’ in terms of stages in the writing process. For example, Su et al. (2023, pp. 6–9) consider that GenAI could support students learning argumentative writing at a number of stages: (a) ‘the preparation stage – facilitating idea generation and providing feedback on outlines’; (b) ‘the editing stage – providing feedback on the draft and supplying different perspectives’; (c) ‘the proofreading stage – providing error corrections’; and (d) ‘the reflection stage – facilitating reflection through the chat history’. Such observations about ‘traditional’ writing are all applicable to digital multimodal composing as well.
As mentioned above, this article focuses on the potential and pitfalls of GenAI tools for DMC. DMC is an approach to language teaching pedagogy that engages language learners with multimodal forms of expression common in digital media, such as the production of infographics or various kinds of videos. As such, DMC attempts to address the contemporary communication needs of language learners by developing multimodal literacy. It is an approach that has attracted considerable attention, as evidenced by recent systematic reviews in the school and university context (Smith et al., 2021; Zhang et al., 2023). Kang and Yi (2023) suggest that multimodal GenAI can be productively employed for the development of multimodal literacy, with students prompting GenAI and evaluating the resulting visuals and how they complement the input text. However, at the present time, relatively few studies have examined how GenAI can be incorporated into DMC processes (in spite of notable contributions: see the special issue of Computers & Composition edited by Lim et al., 2025).
Existing work examining student use of GenAI in DMC has engaged with a range of tasks, including student construction of photo essays (Tan et al., 2025), advocacy posters (J. Jiang, 2024), remix/retelling (Smith et al., 2025), video pitch presentations (D. Li et al., 2025), and English video documentaries (L. Jiang & Lai, 2025). Interventions have promoted the use of GenAI tools at all stages of the composing process, including idea generation, design, and revision phases. Although idea generation can involve GenAI to create sample artefacts that stimulate DMC processes (J. Jiang, 2024), the design phase can involve GenAI for script generation, image generation, video generation, voice generation, and voice ‘filtering’ (see e.g. L. Jiang & Lai, 2025). Studies show that GenAI use tends to be iterative, often shaping creative possibilities by engaging students in overcoming GenAI constraints on multimodal meaning-making, with an important role played by play, humour, and surprise at GenAI outputs (Smith et al., 2025). There is process evidence that suggests that students are developing multimodal literacy and prompt literacy in GenAI-assisted DMC (D. Li et al., 2025; Tan et al., 2025) as well as evidence to suggest that GenAI supports improved multimodal compositions (L. Jiang & Lai, 2025). However, in such interventions, the interplay between creativity and constraint requires further investigation. As Smith et al. (2025, p. 12) conclude: ‘The contradiction between the potential of GenAI to enhance creativity and the risk of it constraining human agency requires careful consideration.’ The study that we report here attempts to fill this gap, drawing on a more peripheral context and examining the perspectives of experienced teacher/researchers.
In this article, we aim to expand on existing research that is specifically concerned with the role that GenAI can play in processes of DMC, in particular video composing. This is a topic that merits detailed attention, given the many processes that are at work in multimodal composing, from the planning phase (research, outlines, mind maps) to the design and sharing phase (scripts, storyboards, drafts, cuts, final products) to the reflection phase (conversations and reports) (Hafner & Ho, 2020). We consider an important goal of DMC to be to engage with digital literacies, in particular as these relate to digital multimodal design, understanding, for example, the affordances and constraints of different semiotic resources and how these can affect meaning-making. In this study, we consider the following questions (outlined above) to be especially important and take these as research questions that guide the study.
How do the affordances and constraints of generative AI tools powerfully shape the multimodal composing process in a way that enables or constrains individual agency, creativity, and criticality?
What are the implications for practice with respect to the use of generative AI in DMC?
4. Methods
4.1. Methodological Approach
In this study, we reflect on the possibilities of GenAI to support a DMC project on an English-for-specific-purposes course that we have designed, making use of practitioner inquiry, the ‘systematic, intentional inquiry by teachers about their own school and classroom work’ (Cochran-Smith & Lytle, 1993, pp. 23–24). In particular, the study involved us in making use of GenAI tools to complete a major student project as if we were students on our own course and examining our own processes of AI-supported digital multimodal composing. Under the broad umbrella of practitioner research, we draw on collaborative autoethnography (CAE) as a research methodology that invites researchers to serve as sites of inquiry, making use of a range of ethnographic techniques as described below. When we talk about ‘ethnography’ or an ‘ethnographic orientation’ in this paper, we are using these terms in a limited sense to refer to what Lillis (2008) calls ‘ethnography as method’, as opposed to ‘ethnography as methodology’ or ‘ethnography as “deep theorizing”’, an approach that borrows ethnographic tools without claiming to conduct a full ethnography.
The ethnographic orientation adopted is in line with other studies of writing that seek to provide highly contextualized accounts of writing processes as socially situated activities (e.g. Paltridge et al., 2016). Such studies typically draw upon a wide range of data sources that can include: observations of the writing process including interactions between collaborators; recordings and field notes taken during such observations; artefacts created in the process, like outlines, drafts, and written feedback from collaborators; interviews with authors to gain access to their emic perspectives on the writing process. These emic perspectives may shed light on participants’ understandings of the genre that they are constructing but can also record life histories and larger narratives that situate the writing process and provide information on writerly identities and the way that writers orient to the writing task, including how emotion plays a role.
CAE (Chang, 2021), also referred to as duoethnography and coethnography, is an approach that is becoming more common in applied linguistic research. The present study differs from a CAE in that it draws on a wider range of data sources – supporting narratives with other in-depth evidence of writing processes. As noted above, this is consistent with ethnographic approaches to L2 writing studies. CAE research frequently emphasizes the narratives and life histories of the researchers, whether these are jointly constructed or in dialogue with one another. In spite of this difference, we align with a number of the guiding principles of such research. For example, among the tenets of CAE that Norris and Sawyer (2012) list we find the following: the notion of ‘currere’ – that is, the self as a site of investigation; the importance of researcher difference, so that researchers are challenged by the perspectives of the ‘Other’; the notion that CAE is polyvocal and dialogic, aiming to make the voice of each researcher explicit; the importance of change and transformation, highlighting also the tensions and conflicts that led to such change; the importance of trust to foster an environment in which disclosure can be made.
4.2. Background to the Study
The course in question is an English-for-science course designed for students doing a range of science majors at a university in Hong Kong. It was designed by Christoph, and both Christoph and Jenifer have taught on it in the past. It is a site that has been documented at some length and that we have collaboratively written about before (see e.g. Hafner & Ho, 2020, 2024). Our practitioner inquiry took place in January and February 2024, a little over 1 year after ChatGPT had been made available to the public. At that time, we were interested in experiencing and reflecting on the way that GenAI tools like ChatGPT might change multimodal composing processes on a key groupwork task assigned to students on the English-for-science course. We saw this as an important first step before engaging our students with GenAI tools (so, the present study does not involve students but only ourselves). Consequently, we undertook to work together and go through the same multimodal composing process that our students go through. This involves completing a simple scientific study and reporting the results in the form of a digital video scientific documentary, 5 to 7 minutes in length. Wherever possible, we aimed to incorporate GenAI tools into the process, making use of both text-based chatbots and multimodal video creation tools. Even though GenAI tools had been around for a while when we began our study, we were both relative novices and had not had a great deal of experience in using GenAI tools at the time. Early on in the process, we consulted online guides in order to learn the basics of prompting as well as asking ChatGPT for advice on the topic.
4.3. Composing Process
Our composing process involved both individual and collaborative work, with 10 meetings over a period of 5 weeks, during which time we had to collect data, develop a script, and produce the 5–7-minute video. We chose a project entitled ‘Rhythm and Beat’, whose aim was to determine whether listening to fast or slow music could affect heart rate. This was determined by measuring the pulse under the two conditions (listening to fast and slow music) and we used ourselves as the participants. At the start of the DMC process, we identified GenAI tools to assist, looking for those that our students would have easy (and free) access to and that could generate full scripts and videos. Applying these criteria, we selected ChatGPT 3.5 accessed for free through Poe and the free version of InVideo, which has the ability to generate full videos from prompts by combining online stock footage with an AI-generated script read by an AI voice generator. 1 At the time of the study, a number of AI-powered video tools were already available on the market. Jenifer conducted a systematic search of the tools available and selected InVideo based on her understanding of the goal of the project at that moment – to identify a tool that could offer the ‘greatest’ assistance to the DMC process. During the search process, other tools such as Synthesia and HeyGen were also identified, but their capabilities were more focused on avatars and voice cloning, which were not considered apt for our project objectives.
We considered that the most effective way to make use of these tools was to use ChatGPT to support brainstorming and script writing, and InVideo to generate the final video. However, the assignment requirements imposed an important constraint: videos must be voiced over by members of the project team. We therefore could not use the voice generation tool on InVideo and had to find a way to incorporate our own audio. Eventually we did so by exporting visuals generated by InVideo and using iMovie to combine them with our own audio narration. It should be noted that the disadvantage of using free tools potentially limited functionality and increased technical constraints; however, these are the constraints that most of our students face.
4.4. Data Sources and Analytical Methods
As we went about the composing process, we collected a range of different kinds of data, summarized in Table 1 below. Each of our nine (1–2 hour) meetings involved discussions of the composing process, decisions about how to use the GenAI tools, and joint writing activities: for example, AI-supported script writing. During each meeting we entered observations about the process in the form of field notes in a collaborative research journal (abbreviated below as ‘Collaborative FN’). After each meeting, we added field notes and reflections to our own, individual research journals (‘FN’). One year after the initial DMC activity, we expanded our field notes with an additional 1-year reflection (‘1YR’) telling the story of our evolving AI literacy with respect to changes and developments in knowledge, attitudes, and competencies, as well as what contributed to those changes and developments. In addition, we kept ChatGPT chat logs, the resulting video scripts, and video output.
Summary of Data Sources.
GenAI = generative artificial intelligence; DMC = digital multimodal composing.
Field notes were analysed using reflexive thematic analysis (Braun et al., 2022), an ‘artfully interpretive’ approach that seeks to ‘capture patterns of shared meaning, clustered around a central concept or idea, and tell stories about what such patterns mean and why they matter’ (p. 27), ‘valuing researcher subjectivity as a resource for research’ (p. 20). Within this qualitative interpretive paradigm, we took the following steps to enhance the credibility and trustworthiness of the analysis. First, the collaborative analysis incorporated the perspectives of both researchers, acting as a check on individual bias. Second, researchers engaged in a process of repeated reading and constant comparison (Richards, 2003), triangulating different data sources to arrive at trustworthy interpretations.
4.5. Researcher Positionality
Our position as former teachers and (in one case) the course designer provided us with a rather unique insider perspective to address questions on the use of GenAI on the course. At the same time, we must acknowledge that our position is best thought of as that of quasi-insiders (Goundar, 2025): our own experiences of DMC described here will not reflect those of our students exactly, as we differ from our students in terms of age, educational attainment, and (in one case especially) ethnic background. The CAE techniques adopted allowed us to draw on very different perspectives owing to our different histories and backgrounds: Jenifer is an ethnic Chinese in her 30s who was born and raised in Hong Kong, completed her doctoral studies overseas, and subsequently returned to work in Hong Kong; Christoph is in his 50s, a Swiss/New Zealand national and Caucasian, who has spent his later life living and working in Hong Kong. The discussions that arose as a result of these different perspectives and our collaborative, dialogic engagement with one another strengthened our analysis by mitigating potential bias that can emerge as a result of limited perspectives.
5. Findings
Here we present findings grouped according to the following themes: (a) feelings; (b) relating to GenAI: the iterative, collaborative loop; (c) ideological positioning; (d) modal interactions: process and product; (e) voice and authenticity. In presenting these findings, data sources are indicated as follows: collaborative research journal field notes (Collaborative FN), field notes (FN), and 1-year reflection (1YR), with position in the data (Pos) provided.
5.1. Feelings
In our journals, we noted conflicting feelings about GenAI tools, ranging from anger at the disruption that they had caused, at one end of the scale, to a sense of awe and joy at the impressive capabilities of the tools and their sometimes quirky responses, at the other. Initially, we noted a range of negative or ambivalent feelings that acted as a barrier to engagement, as these extracts from our research journals show: I went through a period in which I didn’t even want to learn about it, as I didn’t see it as being helpful, but only a substitute to the human mind. (Jenifer, FN, Pos2) My feelings have rather gotten in the way of further exploring these AI tools because I still think that a very reasonable response to them is to say, ‘I don’t want to have anything to do with them’. With this project, I have an opportunity to take a look at the tools, experience them in more depth, go beyond the initial sense of despair, frustration, anger, disappointment and hopelessness, and do a more thorough evaluation of what educators need to actually do with these tools. (Christoph, FN, Pos13)
Jenifer’s perception that GenAI is ‘only a substitute to the human mind’ is echoed and amplified by Christoph, who initially sees GenAI in terms of an existential challenge, noting: ‘I really feel as though my humanity is threatened by this: what does it mean to be a creative, thinking, living being, if an inert machine can replace your creative product?’ (Christoph, FN, Pos13). Beyond this, we frequently returned to a range of worries and concerns, including a range of losses that might arise from over-reliance on GenAI, such as loss of control, loss of creativity, and loss of agency, all of which was accompanied by some sense of frustration at times when the GenAI output was disappointing. These concerns led us to discuss whether the impressive capabilities of AI tools were genuinely useful for language teaching and learning.
In spite of these negative feelings, our notes also show change in feelings over the course of the project, as we began to appreciate what they were capable of, finding the process of using the AI tools to be intellectually stimulating and the output impressive. Along with this came an openness to considering how GenAI could usefully support English language teaching, as indicated below: In addition to the serious scepticism that I had experienced, I also began to experience some genuine feelings of fascination, wonder and joy at what the tools were able to output. I began to see some of the potential of these tools going beyond the disruption that they had caused and experienced some of the excitement that had caught the attention of others. (Christoph, 1YR, Pos3) During my time working closely with Christoph, I unconsciously dismissed these potentials as I didn’t see how allowing students to clone their voices and generate a life-like avatar could be useful for them, Now, I found myself too narrow-minded in such thinking. I recall in one of my discussions with Christoph, he mentioned that he felt the AI-generated voice read the script better than him. This thought lingered in my mind for a long time. I started to appreciate the potential uses of voice cloning and avatars. In my mind, I asked myself: can they be good ‘models’ for students to imitate? Can they practice their pronunciation, tone and pace by following the AI-generated voice? Can they practice their delivery by watching how the avatar does it? (Jenifer, 1YR, Pos6)
From initially holding a sceptical view of GenAI and regarding it as a threat, to developing a feeling of fascination and appreciation, both of us have experienced a dramatic change in our perception of GenAI after repeated exposure individually, and as a team. This change in perception has increased our motivation to further incorporate GenAI in our everyday lives and be more open to exploring how it can be used in ways that were previously unimagined.
5.2. Relating to GenAI: The Iterative, Collaborative Loop
There was a clear change in the way we related to ChatGPT, a change that occurred shortly after we got started on the project. We began our composing process by interacting with ChatGPT to produce a documentary script. We first asked for an outline based on the project prompt, then ideas for an engaging opening, and finally a script for an opening sequence (and subsequently other sections of the documentary). In terms of how we thought about the process, Christoph noted the influence of media depictions of GenAI: Certainly, much of the discourse surrounding ChatGPT and other generative AI frames these new tools as agents that can take over and do the job for you. (Christoph, FN, Pos12) It is interesting that the discourse of GenAI seems to portray the tools alternatively as a slave (to do all of our work for us) or as a master (who will take over the world and decide our fate). (Christoph, FN, Pos18)
Consequently, at first we naively hoped that ChatGPT would produce an acceptable script after one or two shots of prompting. In field notes, Jenifer commented on the difficulty of fine-tuning the output and the need to adopt a ‘step by step’ approach in prompting: Exploring the use of GenAI in creating a script for the English for Science course, my initial idea was to simply input the assignment instructions, and ask ChatGPT to come up with a script for me for further revision, but based on my experiences with ChatGPT, it is difficult to ask it to revise parts of the script and keep the rest of it. In other words, once I prompt it to improve certain parts of the writing, a whole new writing is generated, which is not ideal. Christoph then suggested that based on the outline, we prompt ChatGPT step by step, starting with the introduction. (Jenifer FN, Pos3)
It became clear that a longer, more interactive and iterative process was required. As a result, we began to think of ChatGPT as a ‘collaborative partner’: Since the start of the AI project with Christoph last year, I have started to appreciate the need to ‘converse’ and ‘form a relationship’ with AI. In addition to seeing it as simply a search engine that gives me an answer, I started to see it as a critical friend who I can discuss ideas with . . . I want to understand more about this ‘human-AI partnership’ and how to make the best use of it. I still appreciate keeping the human touch and not completely outsourcing everything to AI. (Jenifer, 1YR, Pos7)
We also began to notice that the conversations that we were having with ChatGPT in order to generate a script engaged us in a rich process of evaluation, offering opportunities to learn about the documentary genre, a process that Christoph described as ‘more mindful and intellectually engaging’ (Christoph, FN, Pos12): as one goes about the script generation process, one is indeed confronted with choices between different kinds of output that one could make use of in one’s own script. This is a choice that one has to make using critical abilities – one needs to look at things like register and how engaging the text is or not, in order to critically evaluate whether the text meets the needs of the audience. So, if one has generated some alternatives, choosing between them requires a critical, thoughtful evaluation. (Christoph, FN, Pos22)
The iterative, collaborative ‘loop’ can be seen in extracts of chat logs. Below we reproduce ‘our side’ of the conversation that we engaged in when generating the discussion section of the script. Each prompt produced lengthy responses that had to be evaluated: Q: Can you please write the script for this interpretation/discussion section? Please also include one or two sentences on the limitations of the study. ChatGPT: [. . .] Q: Can you please rewrite this script of the interpretation and discussion but this time avoid the dialogue format and just use one researcher, i.e. create a monologue? ChatGPT: [. . .] Q: Can you please rewrite that but make it more engaging and also provide a memorable closing at the very end? ChatGPT: [. . .] Q: I think you overdid the engagement. Can you write a version that still engages with the audience but in a more subtle way? ChatGPT: [. . .] Q: Can you write a version with less engagement features but with a memorable closing? ChatGPT: [. . .] Q: That’s better but the references are missing. Can you please write a version that cites actual studies, is not too engaging, has a memorable closing and lists the references at the end? (ChatGPT log)
Here, we negotiated aspects of the form of the script after the initial output was generated, asking for a monologue, an appropriate level of audience engagement, a memorable closing, and a list of references. In particular, it is interesting to reflect on Q4 (‘I think you overdid the engagement’). With this interpersonal dimension of register, multiple turns were required until we were satisfied with the output. The chat log also demonstrates the way that we related to ChatGPT, positioning the bot as a collaborative partner and author, making use of polite requests and evaluations (‘Can you please write the script’; ‘That’s better but the references are missing’). In the next section, we consider how we perceived that GenAI positioned us.
5.3. Ideological Positioning
Ideological positioning was most obvious to us when using InVideo. At this point, we had co-constructed a script with ChatGPT and sought to use this as the basis for an InVideo production. The video production process is described in our collaborative field notes, as follows: As you generate the video, you have an option to input a script and then you can make a range of choices about what kind of video it is that you are creating (for example, the gender of the narrator and the accent as well as music choices) and who the audience is (for example, science buffs) and what platform it is designed for (which reveals some interesting understandings about platform ideologies). (Collaborative FN, Pos46)
This process is summarized in detail in Figure 1, illustrating the socio-technical structure that guided our video creation. It is worth noting the ability to ‘clone your voice’ as well as the range of ‘workflows’ provided, some of which evoke genres of a sort (‘YouTube explainer’ and ‘YouTube shorts’) and others a process (‘Script to video’, which we selected).

The sociotechnical structure guiding the artificial intelligence (AI) video creation.
Options for the video to be created are provided, as are options for editing once the video has been produced. In spite of the many options provided, we felt that this platform and this process positioned us as a distant master (a notable contrast to the collaborative partnership we experienced with ChatGPT). As a result, we felt disconnected from the composing process, passive and disempowered: The entire interface is premised on the notion that you want to assign the whole video production task to a team that will take care of it for you. As a result, using this tool relieves you of the (productive) burden of creating any part of the video. To do the job in what we would think of as an active way, you need to work around the tools. Put differently, the student becomes passive with respect to the multimodal design or much of it. (Collaborative FN, Pos91)
In addition, the tools seemed to us to be designed for particular kinds of videos and not others. As mentioned earlier, there were a lot of constraints in terms of adding our own voice recording, adding our own footage, and creating a dialogue with, for example, two AI-generated voices interacting with one another (at one stage, our script called for this). What we were able to do was combine the script that we provided with a range of b-roll footage (supplemental footage that complements the main shot and narration), always accompanied by a soundtrack, in more or less effective ways (see below). Although there was an editing option, this was limited, in the free version of InVideo at least. There was also the option of changing the b-roll generated by performing another search, but the extent of customization was still limited to what was available on the Internet. Furthermore, we noted that the kinds of modifications possible seemed to presuppose a particular kind of ideal video output. For example, in the editor, visuals are divided into segments that can be modified but only in limited ways: ‘It is not possible to fully regulate length – segments will be cut off after 5–8 seconds. For us, this indicates an assumption about what makes a good video – i.e. a good video is one with lots of visual variety’ (Collaborative FN, Pos65). Similarly, while automatically generated text on screen titles can be modified, this is similarly constrained: The duration of titles is very short. If you want a longer title then it will appear quickly and disappear and be impossible to read. For us, this reveals a kind of assumption about titles or text overlays – that they ought to be short. All of this is in the context of trends like short video production, where social media content is becoming shorter and this brevity is valued. (Collaborative FN, Pos66)
Although the default design of the platform was to offer convenience and to ‘take over’ the entire production process, such a design was (unexpectedly) seen as a constraint in the DMC process.
5.4. Modal Interactions: Process and Product
When engaging with ChatGPT to develop ideas and a script for a documentary, the chatbot provided suggestions on music and visuals: for example, that we use ‘captivating music’ or ‘inspire viewers with a montage’. In producing scripts, our prompt specified that visual direction should be provided. The following example of output shows how ChatGPT was able to make suggestions for coherent multimodal ensembles (even though there seemed to be an unusual preference for close-up shots, an option that was suggested very frequently): ChatGPT: Script for Opening Sequence: [Scene 1: Close-up shot of a person, eyes closed, swaying to music, with a look of pure joy] Person: When music takes hold, my body responds, moving to its rhythm, becoming one with the beat. (ChatGPT log)
In contrast, we noted that the InVideo production frequently lacked meaningful interaction between modes: Looking at the b-roll, one does not get a sense of progression; where it should progress into methods, results, and discussion, visually it is not progressing sufficiently but cycling through similar kinds of images. Jenifer did notice that in the methods, some kind of new images are appearing (e.g. people exercising in the lab hooked up to machines) but there is a mismatch correspondence between text and image here. Similarly in the results, it was also striking that no graph was displayed when the script mentioned and discussed a graph. (Collaborative FN, Pos51)
An example of this issue of modal interaction is illustrated in Figure 2, which shows a still image of the video and the scripted narration that accompanied it. Although the narration introduces and describes tabulated results, the image depicts a person seated at a desk, upper body dancing in front of a laptop.

Example issue of modal interaction.
As well as limitations in the product, we also noted limitations in the process. Specifically, we felt that steps in the DMC process that involve students in multimodal design and are therefore useful in promoting multimodal literacy, were being taken over by the video generation tool. These steps include storyboarding, filming, locating footage, and editing. With respect to storyboarding, Christoph noted: By outsourcing the storyboarding, which is essentially what we have done here, we have effectively skipped that step and I think we end up with a video that is much less meaningful than if we did the hard work of finding and editing the b-roll ourselves. (Christoph, FN, Pos52)
We made a similar observation with respect to the soundtrack: ‘There was very little opportunity to select the music and very limited control over music. This would militate against the learning of any kind of “skill” to judiciously match sound and image’ (Collaborative FN, Pos89). The limitations in the product and the limitations in the process were particularly felt when comparing our experiences of using ChatGPT in the script-writing process, and the use of InVideo in the video production process. Although it was relatively easy to give ChatGPT a specific prompt and expect an output which reflects the prompt, it was not as easy with video output generated by InVideo. For students who would like to have some control over the DMC process, using InVideo could at times even complicate the process: I think this tool [InVideo] could be useful as a search engine to locate suitable b-rolls for the documentary, rather than using it directly to generate a video for submission. I would imagine a lot of effort is needed to create a satisfactory video using this tool, given the complexities involved in the editing process. Perhaps it would take even less effort to simply use the script and film the video ourselves. (Jenifer, FN, Pos32) In general, I think the time spent on modifying the ‘product’ is way too long and after spending all the time, I could only make small changes. There are many constraints imposed on the user, such as not being able to alter the ‘ideal duration’ of each b-roll, the modality of the b-rolls (e.g. photos vs cartoon, colour vs black and white, etc.). It was simply difficult to be creative and alter the ‘structure’ that InVideo has already set up. Interestingly, it is easier to create a new video than making small changes to an existing video. (Jenifer, FN, Pos35)
5.5. Voice and Authenticity
Considering the product (both script and video) raised questions of voice and authenticity for us. Our evaluation of the GenAI output was mixed: for example, we noted an apparent ‘understanding’ of the genre, as demonstrated by sound organizational structure in a ChatGPT outline; however, we also considered scripted output to miss the main point, to be overly generic, overly formal, fake, or contrived: When it comes to generating the script, there are distinct shortcomings; our first reaction was that the script was very cliché; we also think that the opening, while highlighting the ‘power of music’ (which while cliché might actually work), fails to get at the question that is central to the research – how does it affect our bodies? (Collaborative FN, Pos17) One problem is that some of the turns are a bit repetitive (e.g. repeating ‘Indeed’ to pick up on a point – this just sounds fake/contrived). It is very scripted – there seems to be an attempt to make it conversational but words like ‘Indeed’ make it look scripted. (Collaborative FN, Pos35)
Trust in this scripted output was a key concern and, when we asked the chatbot to provide references, we tended to treat the output with suspicion. We remarked that ‘these are studies that actually exist not just hallucinated references but it is difficult for us to verify whether the studies actually say what ChatGPT says that they say’ (Collaborative FN, Pos22). This feeling that output could not be trusted went hand in hand with a concern that students might overly rely on the plausible output generated. Christoph noted: ‘One initial concern that I have is that the resulting outline could be seen as having a kind of “final authority” rather than being seen as a starting point’ (Christoph, FN, Pos10).
With respect to the video output, narration and visuals also raised issues for us. When we reviewed an initial cut of the video, we noted of the AI voice that the ‘reading’ is impressive given that it is not a human but it is somehow off, inexpressive, lacking understanding and appropriate pacing (Collaborative FN, Pos49). As noted above, we found the interactions between speech and moving image to be problematic and noted a lack of complementarity between them. On another level, we also considered that the visuals generated by InVideo seemed ‘fake’, noting that the stock footage generated failed to evoke a meaningful sense of connection to the multimodal composers behind the video because our own bodies, voices, and the settings that we typically inhabit were absent from the video generated: ‘This raises an issue of authenticity – the video seems disconnected from us as authors. It seems fake. It’s hard to imagine that people wouldn’t notice and respond to that fakeness’ (Collaborative FN, Pos57). This, as Jenifer pointed out, was compounded by bias in the selection of visuals. The initial cut had ‘a lack of representation of Asian faces’ (Jenifer, FN, Pos34), an issue that could be addressed by adjusting prompts in an editing phase. Given that our own students are overwhelmingly of Asian ethnicity, this bias caused us to reflect further on the effect of the generated visuals on discourse identity, voice, and authenticity: There are no Asian faces in the b-roll footage that has been provided and this is something that we would expect to see in the [students’] video. This seems to have the effect of taking the video out of context and reflecting the student identity poorly – there is not a very good match between the images selected and the actual context for the video production. (Collaborative FN, Pos77)
Taken as a whole, these observations point to considerable constraints on authentic expression that were introduced by the tools, especially in terms of limitations on the visual representations selected.
6. Discussion
It is important to note that the study is limited to a single pedagogical context and so its findings must be interpreted with caution. Nevertheless, a number of potential insights emerge.
The study has examined how GenAI tools could be applied to an existing DMC task, one that has been established on our English-for-science course for many years. Our goal was to use GenAI in the DMC process as much as we possibly could, in order to determine how it could potentially shape the DMC process, as well as processes of teaching and learning. Our narratives and supporting data drew attention to a range of issues in the AI-supported DMC process, including: the importance of feelings and emotions; the composing process as an iterative, collaborative loop; ideological positioning, both how we positioned the tools and how they positioned us; modal interactions, which we judged to work well with ChatGPT but poorly with the InVideo software; and, finally, the issue of voice and authenticity, including potential for bias, in the final video product. Some of these issues (such as modal interactions, and voice and authenticity) pick up on questions that have been raised in the literature on DMC. Existing research identifies students’ use of stock images as a potential ‘hindrance’ to authorial voice (Nelson, 2006), with ‘remixed’ visuals (rather than students’ own visuals) capable of overpowering and ‘compromising learner voice’ (Hafner, 2015). Yet the use of GenAI, which makes stock image/footage selections on behalf of students, would seem to exacerbate these issues. Furthermore, other issues related to the composing process with GenAI appear novel to AI-supported DMC, requiring rethinking of the DMC activity.
It is interesting to note that the CAE methodology employed highlighted an aspect of the AI-supported DMC process that we did not initially consider to be important, when designing the study. We are referring here to the importance of feelings and emotions that emerged in our narratives. Both Jenifer’s and Christoph’s research journals revealed an emotional dimension to the experience of AI-supported DMC, linked to our histories with AI. Jenifer reported ambivalence with respect to GenAI, while Christoph reported strong feelings including an ‘initial sense of despair, frustration, anger, disappointment and hopelessness’. These feelings developed over time until we both felt more positive, seeing value in and enjoying our interactions with GenAI tools. This emotional dimension is picked up in existing research on AI-supported writing. For example, Stornaiuolo et al. (2024) comment on the potential playfulness of AI-supported learning activities, a finding echoed by Tan et al. (2025). In a slightly different vein, Smith et al. (2025) note ‘polarized’ attitudes among students, ‘where some see it as a creative ally and others view it with skepticism’. Going forward, we would suggest that attempts to incorporate AI-supported DMC take this emotional dimension into account in discussions with students. Our narratives suggest that attitude to AI, as well as related beliefs and values, should be treated as an important component of a critical AI literacy, as is acknowledged by some existing AI literacy models (e.g. Ng et al., 2021).
A primary goal of this study was to determine how GenAI powerfully shapes the multimodal composing process, paying special attention to its effects on agency, creativity, and criticality. One important observation here relates to ideological positioning, both the way that we positioned the tools and the way that they positioned us, making certain actions possible whereas others were made impossible. Here, it is interesting to compare the text-based ChatGPT with the video generation tool InVideo, as a clear contrast emerged in our narratives. In interacting with ChatGPT, we positioned this tool as a ‘collaborative partner’ and this positioning was facilitated by the dialogic presentation of the interface. In essence, the ChatGPT interface as presented on Poe mimics a text messaging interface, positioning the human and machine as conversation partners. This is in contrast with the interface of InVideo, which offered a range of choices for us to select from and which, once those selections were made, we considered to position us as a passive recipient of the generated video product. Such a positioning has an important effect on learning opportunities. As noted in the collaborative research journal, ‘using this tool relieves you of the (productive) burden of creating any part of the video’. We also noted a built-in preference for short video clips in the InVideo tool, which we interpreted as ideological: the tool actively promotes particular kinds of video and not others. We suggest working with students to foster a ‘conscious stance’ (Jones & Hafner, 2021, p. 135) to the ideological positioning inherent in these tools, in order to raise awareness of assumptions that are built in to their sociomaterial architecture (see Darvin, 2025).
As noted earlier, DMC – and in particular video production – is a multifaceted process that involves a greater range of creative activities when compared with traditional writing, largely because students not only need to write a script or storyboard but also need to film and produce video that incorporates writing, sound, and moving image. Our narratives show that engaging in AI-supported DMC could be both empowering and disempowering. An important question for language educators to consider, when designing AI-supported DMC activities, is how to design activities that are empowering for learners. For us, empowering experiences included: (a) evaluating the tools as either fit for purpose or not; (b) developing an iterative process of prompting; (c) developing a collaborative relationship with ChatGPT; and (d) emotionally growing and developing as we reevaluated our emotional responses to GenAI. Disempowering experiences included: (a) losing control of fine-grained editing processes when using ChatGPT; (b) losing control of the video editing process when using InVideo; (c) noticing problematic issues of modal interaction when using InVideo; (d) noticing bias; and (e) noticing a lack of authenticity with respect to voice, not only in visual selections but also in register/tone of the script produced. Some of the constraints were overcome: for example, reprompting InVideo to select footage featuring ‘Asian faces’ was effective. However, others were not.
It is worth reflecting on the process that we went through, as this could provide insights into the critical digital literacies that students and their teachers may need in order to engage effectively with AI-supported DMC. Because of the complexity of the DMC process, it was necessary to break the whole task down and consider all of the sub-tasks involved in designing and producing a video. It was then necessary to evaluate these sub-tasks in order to determine which ones would be amenable to AI support and what kind of AI support. In our case, the project process involved reading and researching, collecting data, developing a script, locating stock images/assets, filming and recording voiceover, editing, and finally producing the 5–7-minute video. In our process, we resorted to text-based generative AI in order to assist with the writing of the script, while at the same time testing the tool to see if it was able to generate images, charts, and so on. We also tested the video generation capacities of the tool chosen (InVideo) with a view to generating the kind of video that we had in mind. In doing all of this, we engaged different critical capacities of task evaluation and tool evaluation, and, ultimately, text evaluation as well. The sense of empowerment that we describe above frequently derived from exerting agency in the process, as well as over the process. That is to say that although it was sometimes possible to be an agentive ‘human in the loop’, we also derived a sense of agency from crafting the loop itself and deciding which machines were going to be in it and why.
Existing scholarship in L2 writing has drawn attention to the need to develop critical digital literacies with respect to GenAI in two main ways. Focusing on issues of materiality, indexicality, and ideology, Darvin (2025, p. 4) calls attention to a sociomaterial perspective of critical digital literacies, one that ‘highlights how interactions with GenAI tools are always sites of struggle – a negotiation of platform design and user design that indexes relations of power, ideologies, and colluding or colliding interests’. Focusing on the challenges of integrating GenAI in instruction to best meet students’ goals, Warschauer et al. (2023) outline the need for students to understand, access and navigate, prompt, corroborate, and integrate GenAI tools in their writing. Building on these insights, we suggest that teachers interested in developing an AI-supported approach to DMC can approach this by addressing issues at three levels, as illustrated in Figure 3. Alternatively, the figure also illustrates a process that students can independently apply.

Integrated critical digital literacies for artificial intelligence (AI)-supported digital multimodal composing (DMC).
We suggest that teacher guidance is beneficial at every level of this model. At the macro level, the focus is on the development of a critical (self-)awareness with respect to AI and one should attend to feelings, ideology, and materiality, which are all factors that shape and frame the AI-assisted DMC activity. At the meso level, the focus is on the design of the collaborative, iterative loop, which we can see as a DMC ‘workflow’ that incorporates different kinds of AI tools through brainstorming, design, and revision phases. Here, one must attend to tasks and tools, strategically selecting apt AI tools, distributing the rhetorical load, and achieving rhetorical effects. At the micro level, students get involved in interaction, evaluation, and integration and must attend to issues of authenticity and voice. Future research may examine how this model can be applied and what teacher guidance might look like in the context of AI-supported DMC activities with language learners.
7. Conclusions
In summary, this study has highlighted some of the key issues that teachers and students making use of GenAI to support processes of DMC are likely to encounter. These will include not only technical issues associated with understanding, selecting, and operating various GenAI tools but also issues that relate to teachers’ and students’ emotional dispositions to the tools and their understanding of ideological positionings realized by the sociomaterial structures designed into interfaces that mediate interaction, input, and output. In addition, it will be necessary to discuss with students changing processes of writing and conceptions of authorship and what this all means in terms of ethics and accountability. GenAI has the potential to greatly enhance students’ experience of DMC, offering, for example, the ability to flexibly design visuals. At the same time, it is important that teachers and students remain clear about the goals of interacting with GenAI and retain agency over the process. Our experience suggests that an important component of this is the careful design of an iterative, collaborative loop that purposefully pairs GenAI tools with particular tasks in the complex DMC process. Drawing on these observations, the study makes a particular contribution through the development of a model of integrated critical digital literacies for AI-supported DMC, which expands on existing work to draw attention to feelings as an important factor in shaping GenAI engagement. The model envisages student involvement in: the development of a critical (self-)awareness; the design of an apt iterative, collaborative loop; and the interaction with, evaluation of, and integration of GenAI tools and output. This is a model that can be applied in curriculum design and pedagogy in L2 writing classrooms in order to support critical, ethical, and productive use of AI in DMC.
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
