Abstract
The release of ChatGPT in November 2022 has sparked great interest in Generative Artificial Intelligence (Gen AI). Discussions have arisen regarding whether Gen AI has passed the Turing Test, a measure of a machine’s ability to mimic human intelligence. There are also concerns about the potential threats posed by advanced AI to humanity. Using the lens of multimodal grammar, Cope and Kalantzis’s work offers a balanced analysis of Gen AI’s capabilities and limitations. This paper builds on their work, examining Gen AI’s strengths and weaknesses, and proposes trans-semiotizing the Turing Test to benchmark Gen AI.
Introduction
The introduction of the Generative Artificial Intelligence (GenAI) model, ChatGPT3.5, in November 2022, has sparked discussions from various fields. An important part of this dialogue is Cope and Kalantzis’s well-balanced analysis of GenAI’s strengths and weaknesses from a multimodal grammar perspective. In response to their call for greater attention to multimodality, we organize our paper into two sections. The first section discusses the evolution of AI research, emphasizing the need to understand Gen AI’s current capabilities and limitations as indicators of both the progress and remaining challenges in AI research since the 1950s. This historical perspective is crucial to contextualize Gen AI’s current performance in responding to multimodal input as its bottleneck. In the second section we propose the need to focus on trans-semiotizing (Lin, 2015, 2019; Turner and Lin, 2023) the Turing test as the way forward in benchmarking and evaluating Gen AI’s performance. The Turing Test, also called the “Imitation Game,” is a thought experiment proposed by Alan Turing (1950) to assess a machine’s ability to exhibit human-like intelligence. In AI research, it is generally taken as a benchmark for measuring how well a machine can hold a conversation indistinguishable from a human.
Important breakthroughs in the development of AI: Symbolic approach and neural-net/deep learning approach
The field of AI generally recognizes two main approaches to AI: symbolic (known as rule-based or good old-fashioned AI) and statistical (known as statistical, neural-net, machine learning and deep learning AI). Symbolic AI, which dominated AI research from the 1950s until the 1980s, involves the programming of rules that the AI system uses to make inferences about the world (Newell and Simon, 1956). John McCarthy (1960)’s work and his development of the concept of artificial intelligence itself, are seminal in this field. Symbolic AI excels in domains where clear, unambiguous rules can be defined, such as mathematical theorem proving.
Statistical AI, on the other hand, is focused on machine learning from huge sets of training data scraped from the Internet. This approach uses artificial neural networks (ANN), which are computing systems inspired by the biological neural networks constituting animal and human brains. The concept of neural networks has been around since the 1940s, with researchers like Warren McCulloch and Walter Pitts (1943) laying the groundwork. However, it was not until the development of backpropagation and gradient descent by researchers like Geoffrey Hinton, Yann LeCun and Yoshua Bengio in the 1980s and 1990s that ANN really took off (Rumelhart et al., 1986; LeCun et al., 1989; Bengio et al., 1994).
Another breakthrough in the development of statistical/neural-net AI was marked by Vaswani et al. (2017) at Google Brain’s 2017 paper, “Attention is All You Need,” presenting the Transformer architecture. This architecture incorporates an encoder and a decoder, each comprising multiple layers, for processing text input and generating task-related predictions. These multiple layers in the model are inspired by the human brain’s hierarchical information processing system, allowing the model to extract features and their association patterns by combining different features at various layers. The standout feature of the Transformer architecture is its ability to handle long-range dependencies in the data, leading to better context capture in sentences, thereby improving results in natural language processing (NLP) tasks like question answering, reading comprehension, translation, summarization, and sentiment analysis. This is exemplified by the performance of large language models (LLM) like ChatGPT (Radfort et al., 2019).
Symbolic AI and deep learning AI each have their own unique strengths and weaknesses. Symbolic AI using explicit rules and logic excels at tasks requiring logical reasoning and decision making, and it does not require large datasets. However, it struggles with ambiguity and situations outside its preprogrammed rules. On the other hand, deep learning AI can “learn” from huge datasets, handle ambiguity, and extrapolate to handle new tasks. However, these systems are often seen as “black boxes” and can hallucinate or confabulate.
Current AI research actively seeks to develop hybrid models that combine symbolic rules with deep learning. One example of a hybrid model that combines both symbolic and deep learning approaches is the Neuro-Symbolic Concept Learner (NS-CL) developed by researchers at MIT (Mao et al., 2019). The researchers demonstrated that this hybrid approach could outperform pure neural networks on tasks that required both visual perception and commonsense reasoning, such as answering questions about images or generating descriptions of scenes.
The current bottleneck in deep learning AI research
Despite the impressive performance of LLMs like GPT4, LeCun (2022) points out that GenAI does not have the intelligence of a house cat who can plan a series of actions to achieve a goal in the multimodal world. Current GenAI models are still limited to its textual world without any embodied intelligence. This resonates with Cope and Kalantzis’s delineation of the lack of multimodal grammatical attention in current GenAI models and their injunction to pay attention to multimodal grammar, especially in attending to Reference, Agency, Structure, Context and Interest.
Can this limitation be overcome by current Deep Learning AI with the unveiling of the multimodal GenAI chatbot, Bard, by Google in December 2023? When prompted, Bard has the ability to extract information from the Internet and formulate a response and provide the source links. It can describe an image and generate output in audio, visual, image.
We evaluated Bard’s analytical and descriptive capabilities by assigning it the task of describing two cartoon strips created by the first author (Figures 1 and 2) and one picture we are authorized to use (Figure 3). Cartoon strip A.
While the cartoon strip is open to multiple interpretations, Bard’s responses were disappointing when the Turing test required trans-semiotizing, that is, drawing on multiple semiotics to make meaning. Analyzing Bard’s responses through Cope and Kalantzis’s lens of multimodal grammar illuminates some limitations in Bard’s response:
Reference: Bard exhibited visual mis-recognition—it mis-recognized the green flower bush as “puddle”; it described four panels (when there were three panels); it said the duck was chasing the bee with a mirror, which was obviously inaccurate. It also failed to construct the continuity of the duck character across its transformation into a peacock. This suggests a potential limitation in its ability to track character development and transformation.
Context & Interest: Bard seemed to fail to draw on the implicit cultural reference to the “Ugly Duckling” fable, which serves as an inspiration for the cartoon’s transformation theme. The lack of cultural resources prevents it from seeing the possibility of the address term “kid” as referring to the transformed duck, hindering its comprehension of the overall interest (values, ideological message) conveyed in the cartoon, although Bard can still derive some elements of the message from the title, “Vanity.”
Structure: Bard’s analysis could benefit from improvement in recognizing and understanding the spatial configuration of the objects/characters. This might have prevented it from connecting the transformed peacock to the initial duck character as both of them were located at the flower bush, a crucial element in understanding the continuity of character development and the structure of the narrative (ugly duckling turned into a beautiful peacock). It seems that without achieving accurate visual recognition, GenAI cannot derive any sensible interpretations (Figure 2). Cartoon strip B.
Analyzing Bard’s performance through the lens of Cope and Kalantzis’s multimodal grammar, we can say that Bard successfully recognized the owl in each panel but failed to construe the owls as the same character across the day/night shift (reference, structure). This might have resulted from its limitation in coordinating visual cues with storytelling conventions—an important trans-semiotizing act: for example, in storytelling, there is usually a continuity of the main character. In Cartoon Strip B, a human storyteller is likely to trans-semiotize by interweaving the visual temporal change cues into the story following some narrative conventions: for example, A long, long time ago, in a remote forest, lived Lady Owl; she guarded the Tree of Life day in day out, awaiting her partner to come home. The intended contrast between day and night, achieved through color and light variations were not grasped by Bard, leading Bard to perceive the owls as two different entities and hindering its ability to infer emotions or moods based on contextual understanding of the character.
While the limited information in this two-panel cartoon might pose equal challenges for human interpretation, Bard’s responses raise concerns about potential distortions stemming from its lack of a multimodal grammar. Humans readily draw on cultural experiences in constructing narratives based on a few images in cartoon strips. For instance, Figure A depicting a transformed duck and a parent’s anguished cry, “Kid, where are you?” can evoke emotions related to family disintegration and question the impact of beauty transformation. Without the capacity for such multi-faceted cognitive and affective engagement with a dynamic multimodal grammar (Reference, Agency, Structure, Context and Interest), which arguably guides human sense-making in the world, Bard may struggle to generate convincing coherent stories. On the other hand, humans can often “fill in the gaps” when presented with limited information in the cartoon strips, crafting coherent storylines like a lady owl patiently awaiting her partner’s return based on personal experiences of waiting for loved ones. GenAI can of course also fill in the gaps, as in predicting the missing words or the next word in a text. However, Bard seems to lack the multimodal world knowledge as well as the emotional experiences to “fill in the gaps” in construing a coherent story based on images, requiring human intervention to supply additional information. Furthermore, effective interpretation of multimodal texts often hinges on understanding and incorporating relevant cultural references. Currently, Bard exhibits limitations in drawing upon diverse cultural resources and traditions. This may be attributed to its primarily English-language training, potentially limiting its exposure to broader cultural nuances. To further evaluate this aspect, we conducted an additional analysis involving an image infused with some Eastern cultural significance that Bard might not have been trained on.
If Bard could not recognize the more or less globalized cultural reference of the “Ugly Duckling” in the earlier example, the chances of Bard recognizing a more specific cultural reference are even slimmer. Figure 3 depicts a widely recognized motif in Chinese culture: the tale of monkeys mistaking the moon’s reflection in water for the actual moon itself. Convinced that the moon has fallen, the monkeys engage in a futile attempt to retrieve it, expending considerable effort without ever achieving its goal. This image served as a test prompt for Bard to evaluate its ability to interpret cultural references. Moneys catching the moon.
Bard’s response suggests a lack of readily accessible cultural references necessary to fully comprehend the narrative behind the image. However, it is noteworthy that Bard acknowledges its limitations in this specific context, potentially indicating a need for learning from human feedback on specific cultural nuances. Deep Learning researchers are aware of these limitations. Self-supervised learning using video input, for example, by masking some parts of videos in training AI models to predict the missing parts, has not met with success yet; the reason is there are so many different possibilities to happen in the next part of the video (LeCun, 2022). As multimodal researchers, we propose the need to dynamically trans-semiotize the Turing test to evaluate future AI models.
Trans-semiotizing the turing test
Trans-semiotizing refers to the dynamic process of coordinating different meaning-making resource systems (e.g., gestures, facial expressions, intonation, body movements, sounds, visuals) to make sense of/to each other and the world (Lin, 2015, 2019; Turner and Lin, 2023). While the issue of multimodality is generally recognized by Deep Learning AI researchers, without a model of human intelligence which is necessarily trans-semiotic, dynamic, and infused with human interest (values, ideologies), it is difficult to develop a benchmarking test to evaluate GenAI models such as ChatGPT or Bard (now renamed as Gemini)’s performance. AI researchers indeed understand the importance of interaction with the physical world and talk about the need for incorporating “world models” in developing future AI models, as LeCun puts it: Large language models (LLMs) seem to possess a surprisingly large amount of background knowledge extracted from written text. But much of human common-sense knowledge is not represented in any text and results from our interaction with the physical world. Because LLMs have no direct experience with an underlying reality, the type of common-sense knowledge they exhibit is very shallow and can be disconnected from reality. (2022, p. 45)
Indeed, GenAI models trained merely on texts, even with multimodal texts such as images, audios, and videos would still prove to be limited as what is missing seems to be interactions with the physical world. However, the physical world is not just physical; it is infused by human actors with social and cultural meanings (Lemke, 2005). Michael Halliday describes it as the simultaneously semiotic and material world (1993). The “world model” that AI developers talk about should therefore comprise dynamic, semiotic-material processes, what we call trans-semiotizing processes. To illustrate this point, let’s turn to a discussion of how babies learn about the world. We shall quote from Michael Halliday’s observation of features of child language development; below we cite the first feature:
A human infant engages in symbolic acts, which I have referred to as acts of meaning. Children are predisposed, from birth, (a) to address others, and be addressed by them (i.e., to interact communicatively); and (b) to construe their experience (i.e., to interpret experience by organizing it into meanings). Signs are created at the intersection of these two modes of activity. Signs evolve (a) in mediating or, better, in enacting interaction with others, and (b) in construing experience into meaning; specifically, in exploring the contradiction between inner and outer experience (between what is perceived as going on “out there” and what is perceived as going on “in here,” within the child’s own consciousness). Thus, typically, at 0;3 to 0;5 (years;months) babies are “reaching and grasping,” trying to get hold of objects in the exterior domain and to reconcile this with their awareness of the interior domain (they can see the objects). Such an effort provokes the use of a sign, which is then interpreted by the adult caregiver, or an older child, as a demand for explanation; the other responds in turn with an act of meaning. There has been “conversation” before; but this is a different kind of conversation, in which both parties are acting symbolically. A typical example from my own data would be the following, with the child at just under 0;6…: There is a sudden loud noise from pigeons scattering. Child [lifts head, looks around, gives high-pitched squeak] Mother: Yes, those are birds. Pigeons. Aren’t they noisy! (Halliday, 1993, pp. 94-95)
We can perhaps liken this to the mother “training” the baby to “label” (or use signs, “birds”, “pigeons” to describe what they are both attending to multimodally (sound, sight—multisensory experiences). But the mother does not stop at “labelling” (or describing) these experiences only; she is also sharing a stance: “Aren’t they noisy!”. In this short mother-child conversation, the child is guided by the mother to construe their shared experience with a multimodal grammar: the child learns to reference the phenomenon as birds/pigeons making noise with a logical causal structure (X causing Y); they share a stance and emotions as active agents with evaluative attitudes and affective responses towards what’s happening in a context with human (social, cultural, ideological) interest (e.g., a specific cultural value/ideology: making loud noise is possibly annoying).
We can imagine hundreds of thousands of these “iterations” of child-care-giver conversations happening in a semiotic-material world as the child incrementally develops and adapts her/his constantly evolving “world model”. Now, what does it take to train an AI model to develop this dynamic semiotic-material world model? This will perhaps constitute a worthwhile research question for AI researchers and trans-semiotizing researchers to collaborate on in the future.
Sources of visuals
The “Vanity” and “Lady Owl” cartoon strips were authored by Angel M. Y. Lin and taken from Lin (2012). The “Monkeys Catching the Moon” visual was authored by 雅羽(Yayu). This picture is made available for non-commercial use by the artist.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
