Reflections on Cope and Kalantzis: How intelligent is generative AI? Towards trans-semiotizing the Turing test

Abstract

The release of ChatGPT in November 2022 has sparked great interest in Generative Artificial Intelligence (Gen AI). Discussions have arisen regarding whether Gen AI has passed the Turing Test, a measure of a machine’s ability to mimic human intelligence. There are also concerns about the potential threats posed by advanced AI to humanity. Using the lens of multimodal grammar, Cope and Kalantzis’s work offers a balanced analysis of Gen AI’s capabilities and limitations. This paper builds on their work, examining Gen AI’s strengths and weaknesses, and proposes trans-semiotizing the Turing Test to benchmark Gen AI.

Keywords

Gen-AI Google bard Google Genimi ChatGPT trans-semiotizing

Introduction

The introduction of the Generative Artificial Intelligence (GenAI) model, ChatGPT3.5, in November 2022, has sparked discussions from various fields. An important part of this dialogue is Cope and Kalantzis’s well-balanced analysis of GenAI’s strengths and weaknesses from a multimodal grammar perspective. In response to their call for greater attention to multimodality, we organize our paper into two sections. The first section discusses the evolution of AI research, emphasizing the need to understand Gen AI’s current capabilities and limitations as indicators of both the progress and remaining challenges in AI research since the 1950s. This historical perspective is crucial to contextualize Gen AI’s current performance in responding to multimodal input as its bottleneck. In the second section we propose the need to focus on trans-semiotizing (Lin, 2015, 2019; Turner and Lin, 2023) the Turing test as the way forward in benchmarking and evaluating Gen AI’s performance. The Turing Test, also called the “Imitation Game,” is a thought experiment proposed by Alan Turing (1950) to assess a machine’s ability to exhibit human-like intelligence. In AI research, it is generally taken as a benchmark for measuring how well a machine can hold a conversation indistinguishable from a human.

Important breakthroughs in the development of AI: Symbolic approach and neural-net/deep learning approach

The field of AI generally recognizes two main approaches to AI: symbolic (known as rule-based or good old-fashioned AI) and statistical (known as statistical, neural-net, machine learning and deep learning AI). Symbolic AI, which dominated AI research from the 1950s until the 1980s, involves the programming of rules that the AI system uses to make inferences about the world (Newell and Simon, 1956). John McCarthy (1960)’s work and his development of the concept of artificial intelligence itself, are seminal in this field. Symbolic AI excels in domains where clear, unambiguous rules can be defined, such as mathematical theorem proving.

Statistical AI, on the other hand, is focused on machine learning from huge sets of training data scraped from the Internet. This approach uses artificial neural networks (ANN), which are computing systems inspired by the biological neural networks constituting animal and human brains. The concept of neural networks has been around since the 1940s, with researchers like Warren McCulloch and Walter Pitts (1943) laying the groundwork. However, it was not until the development of backpropagation and gradient descent by researchers like Geoffrey Hinton, Yann LeCun and Yoshua Bengio in the 1980s and 1990s that ANN really took off (Rumelhart et al., 1986; LeCun et al., 1989; Bengio et al., 1994).

Another breakthrough in the development of statistical/neural-net AI was marked by Vaswani et al. (2017) at Google Brain’s 2017 paper, “Attention is All You Need,” presenting the Transformer architecture. This architecture incorporates an encoder and a decoder, each comprising multiple layers, for processing text input and generating task-related predictions. These multiple layers in the model are inspired by the human brain’s hierarchical information processing system, allowing the model to extract features and their association patterns by combining different features at various layers. The standout feature of the Transformer architecture is its ability to handle long-range dependencies in the data, leading to better context capture in sentences, thereby improving results in natural language processing (NLP) tasks like question answering, reading comprehension, translation, summarization, and sentiment analysis. This is exemplified by the performance of large language models (LLM) like ChatGPT (Radfort et al., 2019).

Symbolic AI and deep learning AI each have their own unique strengths and weaknesses. Symbolic AI using explicit rules and logic excels at tasks requiring logical reasoning and decision making, and it does not require large datasets. However, it struggles with ambiguity and situations outside its preprogrammed rules. On the other hand, deep learning AI can “learn” from huge datasets, handle ambiguity, and extrapolate to handle new tasks. However, these systems are often seen as “black boxes” and can hallucinate or confabulate.

Current AI research actively seeks to develop hybrid models that combine symbolic rules with deep learning. One example of a hybrid model that combines both symbolic and deep learning approaches is the Neuro-Symbolic Concept Learner (NS-CL) developed by researchers at MIT (Mao et al., 2019). The researchers demonstrated that this hybrid approach could outperform pure neural networks on tasks that required both visual perception and commonsense reasoning, such as answering questions about images or generating descriptions of scenes.

The current bottleneck in deep learning AI research

Despite the impressive performance of LLMs like GPT4, LeCun (2022) points out that GenAI does not have the intelligence of a house cat who can plan a series of actions to achieve a goal in the multimodal world. Current GenAI models are still limited to its textual world without any embodied intelligence. This resonates with Cope and Kalantzis’s delineation of the lack of multimodal grammatical attention in current GenAI models and their injunction to pay attention to multimodal grammar, especially in attending to Reference, Agency, Structure, Context and Interest.

Can this limitation be overcome by current Deep Learning AI with the unveiling of the multimodal GenAI chatbot, Bard, by Google in December 2023? When prompted, Bard has the ability to extract information from the Internet and formulate a response and provide the source links. It can describe an image and generate output in audio, visual, image.

We evaluated Bard’s analytical and descriptive capabilities by assigning it the task of describing two cartoon strips created by the first author (Figures 1 and 2) and one picture we are authorized to use (Figure 3).

Figure 1.

Cartoon strip A.

This cartoon strip portrays a duck transforming into a peacock after interacting with a bee, leading to the duck father failing to recognize his child. While multiple alternative interpretations are possible, we present Bard with the cartoon strips and two prompts. The following table shows Bard’s responses.

Describe what you can see in the cartoon strips
Sure, here’s what I can see in the cartoon strips you sent me:
The first panel shows a bee and a duck sitting on a park bench. The bee is looking at the duck and smiling. The duck has her back to the bee and is looking at a reflection of herself in a puddle.
The second panel shows the duck looking surprised. The bee is now holding a small mirror in front of the duck. The third panel shows the duck looking angry. She is standing up and pointing at the bee. The fourth panel shows the bee looking scared. The duck is chasing the bee with the mirror.
The cartoon is titled “Vanity.” Do you have any other questions about the cartoon strips?
Explain the message expressed in the cartoon strips
The cartoon strip “Vanity” expresses a message about the potential pitfalls of being overly concerned with oneself. Let’s break it down:
• Initial Admiration: The bee’s admiration for the duck’s natural beauty starts innocently.
• Shifting Focus: The duck becomes distracted by her reflection, prioritizing self-admiration over external connection.
• Confrontation: The bee uses a mirror to highlight the duck’s self-absorption, causing surprise and then anger.
• Chase and Fear: The duck’s anger turns destructive, chasing the bee away with the mirror, reflecting both her vanity and potential aggression.
The message can be interpreted in two ways:
1. Critique of Vanity: The cartoon satirizes the negative aspects of vanity, suggesting it can lead to self-obsession, disconnect from others, and even aggression.
2. Self-Acceptance: Alternatively, it might encourage self-acceptance and appreciating natural beauty rather than seeking validation through external reflection.
Ultimately, the interpretation depends on the viewer’s perspective. Do they see the duck’s reaction as justified or excessive? Does the cartoon condemn vanity or simply point out its potential consequences?

While the cartoon strip is open to multiple interpretations, Bard’s responses were disappointing when the Turing test required trans-semiotizing, that is, drawing on multiple semiotics to make meaning. Analyzing Bard’s responses through Cope and Kalantzis’s lens of multimodal grammar illuminates some limitations in Bard’s response:

Reference: Bard exhibited visual mis-recognition—it mis-recognized the green flower bush as “puddle”; it described four panels (when there were three panels); it said the duck was chasing the bee with a mirror, which was obviously inaccurate. It also failed to construct the continuity of the duck character across its transformation into a peacock. This suggests a potential limitation in its ability to track character development and transformation.

Context & Interest: Bard seemed to fail to draw on the implicit cultural reference to the “Ugly Duckling” fable, which serves as an inspiration for the cartoon’s transformation theme. The lack of cultural resources prevents it from seeing the possibility of the address term “kid” as referring to the transformed duck, hindering its comprehension of the overall interest (values, ideological message) conveyed in the cartoon, although Bard can still derive some elements of the message from the title, “Vanity.”

Structure: Bard’s analysis could benefit from improvement in recognizing and understanding the spatial configuration of the objects/characters. This might have prevented it from connecting the transformed peacock to the initial duck character as both of them were located at the flower bush, a crucial element in understanding the continuity of character development and the structure of the narrative (ugly duckling turned into a beautiful peacock). It seems that without achieving accurate visual recognition, GenAI cannot derive any sensible interpretations (Figure 2).

Figure 2.

Cartoon strip B.

Describe what you can see in the cartoon strips.
Sure, here’s what I can see in the cartoon you sent me:
The cartoon shows two owls sitting on a tree branch. The owl on the left is blue and yellow, and the owl on the right is brown and white. The blue and yellow owl has large eyes and a surprised expression. The brown and white owl has small eyes and a relaxed expression. The text above the blue and yellow owl says “LADY OWL.” The text below the cartoon says “BY ANGELLIN https://www.toondoo.com/.”
Is there anything else you would like to know about the cartoon?
Explain the message expressed in the cartoon strips.
Unfortunately, I don’t have enough context to fully understand the message of the cartoon you described. The single image without surrounding panels or dialogue leaves room for multiple interpretations.
Here are some possibilities based on the details you provided:
1. Stereotypes and Expectations: The contrast between the large, surprised eyes of “Lady Owl” and the small, relaxed eyes of the other owl could be a commentary on gender stereotypes. Perhaps “Lady Owl” is reacting to something unexpected or out of her comfort zone, while the other owl remains calm and collected.
2. Personality Differences: The cartoon might simply portray two owls with different personalities. “Lady Owl” is expressive and easily shocked, while the other owl is more stoic and unfazed.
3. Missing Context: Without knowing the surrounding panels or additional dialogue, it’s impossible to understand the specific situation causing “Lady Owl’s surprise. This crucial context could reveal the intended message.
4. Open to Interpretation: The artist might have intentionally left the cartoon open to individual interpretation, allowing viewers to draw their own conclusions about the characters and their interactions.
If you could provide more information about the cartoon, such as the surrounding panels, dialogue, or the artist’s known style, I might be able to provide a more accurate interpretation of the message.

Analyzing Bard’s performance through the lens of Cope and Kalantzis’s multimodal grammar, we can say that Bard successfully recognized the owl in each panel but failed to construe the owls as the same character across the day/night shift (reference, structure). This might have resulted from its limitation in coordinating visual cues with storytelling conventions—an important trans-semiotizing act: for example, in storytelling, there is usually a continuity of the main character. In Cartoon Strip B, a human storyteller is likely to trans-semiotize by interweaving the visual temporal change cues into the story following some narrative conventions: for example, A long, long time ago, in a remote forest, lived Lady Owl; she guarded the Tree of Life day in day out, awaiting her partner to come home. The intended contrast between day and night, achieved through color and light variations were not grasped by Bard, leading Bard to perceive the owls as two different entities and hindering its ability to infer emotions or moods based on contextual understanding of the character.

While the limited information in this two-panel cartoon might pose equal challenges for human interpretation, Bard’s responses raise concerns about potential distortions stemming from its lack of a multimodal grammar. Humans readily draw on cultural experiences in constructing narratives based on a few images in cartoon strips. For instance, Figure A depicting a transformed duck and a parent’s anguished cry, “Kid, where are you?” can evoke emotions related to family disintegration and question the impact of beauty transformation. Without the capacity for such multi-faceted cognitive and affective engagement with a dynamic multimodal grammar (Reference, Agency, Structure, Context and Interest), which arguably guides human sense-making in the world, Bard may struggle to generate convincing coherent stories. On the other hand, humans can often “fill in the gaps” when presented with limited information in the cartoon strips, crafting coherent storylines like a lady owl patiently awaiting her partner’s return based on personal experiences of waiting for loved ones. GenAI can of course also fill in the gaps, as in predicting the missing words or the next word in a text. However, Bard seems to lack the multimodal world knowledge as well as the emotional experiences to “fill in the gaps” in construing a coherent story based on images, requiring human intervention to supply additional information. Furthermore, effective interpretation of multimodal texts often hinges on understanding and incorporating relevant cultural references. Currently, Bard exhibits limitations in drawing upon diverse cultural resources and traditions. This may be attributed to its primarily English-language training, potentially limiting its exposure to broader cultural nuances. To further evaluate this aspect, we conducted an additional analysis involving an image infused with some Eastern cultural significance that Bard might not have been trained on.

If Bard could not recognize the more or less globalized cultural reference of the “Ugly Duckling” in the earlier example, the chances of Bard recognizing a more specific cultural reference are even slimmer. Figure 3 depicts a widely recognized motif in Chinese culture: the tale of monkeys mistaking the moon’s reflection in water for the actual moon itself. Convinced that the moon has fallen, the monkeys engage in a futile attempt to retrieve it, expending considerable effort without ever achieving its goal. This image served as a test prompt for Bard to evaluate its ability to interpret cultural references.

Figure 3.

Moneys catching the moon.

Describe what you can see in the picture.
Sure, the image you sent me shows two monkeys hanging from a tree branch over a river. The monkeys appear to be spider monkeys, which are known for their prehensile tails that they can use to swing from branch to branch and to grasp objects. The monkeys in the image are hanging by their tails, and their bodies are curled up so that their undersides are facing the viewer. The river below them is a fast-flowing one, with whitewater rapids visible in the distance. The trees on the other side of the river are lush and green, and there are mountains in the background.
The text “YAYU 2020.3.23” is written in the top left corner of the image. This may be the name of the place where the photo was taken, or it could be the photographer’s name and the date the photo was taken.
Overall, the image is a beautiful and peaceful scene of two monkeys enjoying their natural habitat.
Explain the message expressed in the picture.
Unfortunately, I don’t have enough information to definitively explain the message expressed in the picture. While I can describe the scene and speculate based on it, understanding the true message requires additional context that the image alone doesn’t provide. Here are some possibilities:
Literal Interpretation:
• Enjoying the moment: The monkeys might simply be relaxing and enjoying the view from their comfortable position.
• Playing: Swinging by their tails could be a playful activity for them.
• Cooling off: Hanging near the water on a hot day could be a way to cool down.
Symbolic Interpretation:
• Peace and harmony with nature: The monkeys living in their natural habitat undisturbed could symbolize peace and harmony with nature.
• Freedom and adventure: The ability to swing freely through the trees could represent freedom and adventure.
• Vulnerability and trust: Hanging upside down by their tails exposes the monkeys’ vulnerable underbellies, suggesting trust and comfort in their environment.
Without knowing:
• The photographer’s intent
• The cultural context of the image
• Any accompanying text or information
It’s impossible to say for sure what message the picture is trying to express. If you have any additional information, please share it, and I might be able to provide a more accurate explanation.
Remember, art is subjective, and the beauty lies in individual interpretation!

Bard’s response suggests a lack of readily accessible cultural references necessary to fully comprehend the narrative behind the image. However, it is noteworthy that Bard acknowledges its limitations in this specific context, potentially indicating a need for learning from human feedback on specific cultural nuances. Deep Learning researchers are aware of these limitations. Self-supervised learning using video input, for example, by masking some parts of videos in training AI models to predict the missing parts, has not met with success yet; the reason is there are so many different possibilities to happen in the next part of the video (LeCun, 2022). As multimodal researchers, we propose the need to dynamically trans-semiotize the Turing test to evaluate future AI models.

Trans-semiotizing the turing test

Trans-semiotizing refers to the dynamic process of coordinating different meaning-making resource systems (e.g., gestures, facial expressions, intonation, body movements, sounds, visuals) to make sense of/to each other and the world (Lin, 2015, 2019; Turner and Lin, 2023). While the issue of multimodality is generally recognized by Deep Learning AI researchers, without a model of human intelligence which is necessarily trans-semiotic, dynamic, and infused with human interest (values, ideologies), it is difficult to develop a benchmarking test to evaluate GenAI models such as ChatGPT or Bard (now renamed as Gemini)’s performance. AI researchers indeed understand the importance of interaction with the physical world and talk about the need for incorporating “world models” in developing future AI models, as LeCun puts it:

Large language models (LLMs) seem to possess a surprisingly large amount of background knowledge extracted from written text. But much of human common-sense knowledge is not represented in any text and results from our interaction with the physical world. Because LLMs have no direct experience with an underlying reality, the type of common-sense knowledge they exhibit is very shallow and can be disconnected from reality. (2022, p. 45)

Indeed, GenAI models trained merely on texts, even with multimodal texts such as images, audios, and videos would still prove to be limited as what is missing seems to be interactions with the physical world. However, the physical world is not just physical; it is infused by human actors with social and cultural meanings (Lemke, 2005). Michael Halliday describes it as the simultaneously semiotic and material world (1993). The “world model” that AI developers talk about should therefore comprise dynamic, semiotic-material processes, what we call trans-semiotizing processes. To illustrate this point, let’s turn to a discussion of how babies learn about the world. We shall quote from Michael Halliday’s observation of features of child language development; below we cite the first feature:

A human infant engages in symbolic acts, which I have referred to as acts of meaning. Children are predisposed, from birth, (a) to address others, and be addressed by them (i.e., to interact communicatively); and (b) to construe their experience (i.e., to interpret experience by organizing it into meanings). Signs are created at the intersection of these two modes of activity. Signs evolve (a) in mediating or, better, in enacting interaction with others, and (b) in construing experience into meaning; specifically, in exploring the contradiction between inner and outer experience (between what is perceived as going on “out there” and what is perceived as going on “in here,” within the child’s own consciousness). Thus, typically, at 0;3 to 0;5 (years;months) babies are “reaching and grasping,” trying to get hold of objects in the exterior domain and to reconcile this with their awareness of the interior domain (they can see the objects). Such an effort provokes the use of a sign, which is then interpreted by the adult caregiver, or an older child, as a demand for explanation; the other responds in turn with an act of meaning. There has been “conversation” before; but this is a different kind of conversation, in which both parties are acting symbolically. A typical example from my own data would be the following, with the child at just under 0;6…:

There is a sudden loud noise from pigeons scattering.

Child [lifts head, looks around, gives high-pitched squeak]

Mother: Yes, those are birds. Pigeons. Aren’t they noisy! (Halliday, 1993, pp. 94-95)

We can perhaps liken this to the mother “training” the baby to “label” (or use signs, “birds”, “pigeons” to describe what they are both attending to multimodally (sound, sight—multisensory experiences). But the mother does not stop at “labelling” (or describing) these experiences only; she is also sharing a stance: “Aren’t they noisy!”. In this short mother-child conversation, the child is guided by the mother to construe their shared experience with a multimodal grammar: the child learns to reference the phenomenon as birds/pigeons making noise with a logical causal structure (X causing Y); they share a stance and emotions as active agents with evaluative attitudes and affective responses towards what’s happening in a context with human (social, cultural, ideological) interest (e.g., a specific cultural value/ideology: making loud noise is possibly annoying).

We can imagine hundreds of thousands of these “iterations” of child-care-giver conversations happening in a semiotic-material world as the child incrementally develops and adapts her/his constantly evolving “world model”. Now, what does it take to train an AI model to develop this dynamic semiotic-material world model? This will perhaps constitute a worthwhile research question for AI researchers and trans-semiotizing researchers to collaborate on in the future.

Sources of visuals

The “Vanity” and “Lady Owl” cartoon strips were authored by Angel M. Y. Lin and taken from Lin (2012). The “Monkeys Catching the Moon” visual was authored by 雅羽(Yayu). This picture is made available for non-commercial use by the artist.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Author biographies

Angel MY Lin is Chair Professor of Language, Literacy and Social Semiotics in Education at the Education University of Hong Kong. Dr. Lin has been at the forefront of English language education and critical literacies since the late 1990s when she started working on classroom research projects in schools in Hong Kong. She has published widely on second language education, discourse analysis, translanguaging (TL), trans-semiotizing (TS), Content and Language Integrated Learning (CLIL), and critical media literacies.

Qinghua Chen is a postdoctoral fellow at the English Language Education Department of the Education University of Hong Kong. His current research interests include subjectivity, emotions, and identity of pre-service teachers and English language learners. He is particularly interested in exploring how these factors shape teaching and learning experiences in multicultural and multilingual contexts.

References

Bengio

Simard

Frasconi

(1994) Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks 5(2): 157–166.

Halliday

(1993) Towards a language-based theory of learning.. Linguistics and Education 5(2): 93–116.

LeCun

(2022) A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review 62(1): 1–62.

LeCun

Boser

Denker

, et al. (1989) Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4): 541–551.

Lemke

(2005) Textual Politics: Discourse and Social Dynamics. Abingdon: Taylor & Francis.

Lin

AMY

(2012) What’s New in New Literacies? New Delhi: University of New Delhi.

Lin

AMY

(2015) Egalitarian bi/multilingualism and trans-semiotizing in a global world. In: Wright

Boun

García

(eds) The Handbook of Bilingual and Multilingual Education. West Sussex, UK: Wiley Blackwell, 19–37.

Lin

AMY

(2019) Theories of trans/languaging and trans-semiotizing: implications for content-based education classrooms. International Journal of Bilingual Education and Bilingualism 22(1): 5–16.

Mao

Gan

Kohli

, et al. (2019). The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. ArXiv Preprint arXiv:1904.12584.

10.

McCarthy

(1960) Recursive functions of symbolic expressions and their computation by machine, Part I. Communications of the ACM 3(4): 184–195.

11.

McCulloch

Pitts

(1943) A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics 5(4): 115–133.

12.

Newell

Simon

(1956) The logic theory machine--A complex information processing system. IEEE Transactions on Information Theory 2(3): 61–79.

13.

Radford

Child

, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8): 9.

14.

Rumelhart

Hinton

Williams

(1986) Learning representations by back-propagating errors. Nature 323(6088): 533–536.

15.

Turing

(1950) Computing machinery and intelligence. Mind 59(236): 433–460.

16.

Turner

Lin

AMY

(2023) Translanguaging and trans-semiotizing. In: Mckinney

Makoe

Zavala

(eds) The Routledge handbook of multilingualism. London/New York: Routledge, 96–109.

17.

Vaswani

Shazeer

Parmar

, et al. (2017) Attention is All you need In Advances in Neural Information Processing Systems, 5998–6008.