Music teachers’ labeling accuracy and quality ratings of lesson plans by artificial intelligence (AI) and humans

Abstract

This study explored the potential of artificial intelligence (ChatGPT) to generate lesson plans for music classes that were indistinguishable from music lesson plans created by humans, with current music teachers as assessors. Fifty-six assessors made a total of 410 ratings across eight lesson plans, assigning a quality score to each lesson plan and labeling if they believed each lesson plan was created by a human or generated by AI. Despite the human-made lesson plans being rated higher in quality as a group (p < .01, d = 0.44), assessors were unable to accurately label if a lesson plan was created by a human or generated by AI (55% accurate overall). Labeling accuracy was positively predicted by quality scores on human-made lesson plans and previous personal use of AI, while accuracy was negatively predicted by quality scores on AI-generated lesson plans and perception of how useful AI will be in the future. Open-ended responses from 42 teachers suggested assessors used three factors when making evaluations: specific details, evidence of classroom knowledge, and wording. Implications provide suggestions for how music teachers can use prompt engineering with a GPT model to create a virtual assistant or Intelligent Tutor System (ITS) for their classroom.

Keywords

Artificial intelligence ChatGPT content generation education technology intelligent tutor system virtual assistant

Introduction

The information technology company OpenAI launched a consumer version of their Artificial Intelligence Technology (AIT), known as ChatGPT, in November 2022. It demonstrated a massive leap in the ability of Artificial Intelligence (AI) to interact with humans while appearing human-like. Built on a Generative Pretrained Transformer (GPT) model, ChatGPT rapidly utilizes hundreds of millions of pretrained decision points for evaluating, responding to, and generating human language, including “grammar, syntax, semantics, and some degree of world knowledge” (van den Berg & du Plessis, 2023, p. 2). An exciting component of GPT models is that they can be trained for specific applications, which leads to Intelligent Tutor Systems (ITS) for specific educational contexts (Clancey & Hoffman, 2021). However, as more sophistication is trained into GPT models and the results are increasingly human-like, educators report being concerned about an increase in plagiarism and cheating among students (Fütterer et al., 2023). While there is potential conflict when adopting AI in educational settings, forward-thinking scholars have already begun theorizing on the usefulness of AIT in the field of music education (Rohwer, 2023). It is likely GPT models and future expansions of AIT will only increase in usefulness for music educators as time goes on.

Artificial intelligence to generate musical products

A long-held goal for developers of AI has been to create a machine that is imperceptible to humans as being a machine. Various adaptations and interpretations of the Turing imitation game (Turing, 1950) are commonly used to test if this goal has been achieved. The original test entails a human “interrogator” (labeled an “assessor” in the present study) will interact with a second party through text. The second party will be either a machine pretending to be a human or an actual human. The interrogator must decide after 5 minutes if they were interacting with a machine or a human. For a machine to pass the test, the interrogators should be correct in labeling the interaction as with a human or with a machine no more than 70% of the time.

Perhaps to appear more than a hyper-speed data retrieval system, AI scientists have explored the notion of creativity to demonstrate the robustness of content-generation beyond text-based applications. Within music, it has been suggested a Turing-like test could be “altered by making aesthetic artifacts, music or other creative forms, the medium of the test” (Ariza, 2009, p. 55). Boden (2010) recommends two criterions for an artistic creation to pass the test: (1) the artifact must be indistinguishable from one made by a human and (2) the artifact must be seen as having as much esthetic value. A lesson plan for music education is uniquely a text-based musical product which could be evaluated using similar criteria. Boden notes successful attempts to pass this test, such as Cope’s EMMY program (Cope, 2005; Cope & Hofstadter, 2001). Other examples include the album Hello World by SKYGGE, a human and AI collaboration with Sony’s Flow Machines (Avdeeff, 2019), and the Dartmouth-based tests of AI creativity, which has produced AI dance music deejays (Chen, 2020). Additional research has shown AI can be used to generate piano music from algorithms that is indistinguishable from human-made music (Schubert et al., 2017) and that algorithms can be used to manipulate sound bites of acoustic instruments to make virtual instruments and recordings that are indistinguishable from live recordings of real musicians (Ruiz et al., 2020). While these examples are not specifically GPT models, they do underscore the power of computer assistance to generate human-like musical products.

Artificial intelligence to assist with music education

Teachers are expected to understand lesson planning for group instruction and differentiated instruction for individuals as part of their professional competencies. Research has begun investigating the usefulness of AI in music education for assisting with individual differentiation through the development of an Intelligent Tutor System (ITS). It has been suggested an ITS is potentially more effective in helping students than traditional teacher-to-student tutoring (Clancey & Hoffman, 2021) and can increase interest in certain school subjects (Jauhiainen & Guerra, 2023). Empirical research has shown an ITS can lead to growth for individual piano students (Li & Wang, 2023) and is a cost-effective solution to providing high quality music education in rural settings (Zhang & Song, 2023), presuming the rural setting has access to high-speed internet and reliable electricity. Detailed frameworks have been proposed for developing AI for preschool music classrooms (Lin & Ding, 2020) and self-paced, online music learning systems (Yu et al., 2023). However, these systems are not standalone solutions to music learning. It has been suggested teachers are still needed to assist students in using AI tutors for music (Della Ventura, 2019) and that a course on Music and AI would be a worthy component of a music teacher training program to help prepare teachers to use AI to their advantage (Yuan, 2020). In theory, an ITS for music teachers could act as a virtual assistant, employed strategically and creatively to serve them and their students. A component not studied has been the ability of AI to assist music teachers in developing lesson plans for group instruction.

Present study

The purpose of the present study was to explore the potential of AI to generate lesson plans for group music classes that are indistinguishable from lesson plans created by humans with current music teachers as assessors. This was done by having music teachers evaluate eight lesson plans, four created by professional music education organizations and four generated by AI (ChatGPT). Evaluations were based on the two criterions suggested by Boden (2010). Assessors assigned a quality score to each lesson plan and labeled if they thought the lesson was created by human music teachers or generated by AI. Prior research has shown music teachers are reliable expert judges on the quality of musical products using a consensual assessment technique (Cooper, 2016; Kang & Yoo, 2021). The secondary purpose was to learn about the thought process of the assessors when making their judgements. This was done through open-ended qualitative response.

Primary research questions

To what extent will music teachers accurately label if a lesson plan was created by humans or generated by AI?

What factors predict the ability of a music teacher to correctly label if a lesson plan was created by humans or generated by AI?

To what extent will ratings of quality differ between lesson plans created by humans or generated by AI and why?

Secondary research questions

4. What can be discovered about the thought used by music teachers when rating the overall quality of a lesson plan and labeling if it was created by humans or generated by AI?

5. How can the findings from this study be used to assist music teachers in using AI as a virtual assistant or an ITS?

Method

Participants

The participants were 56 music teachers in the United States (Table 1). Participants were recruited by placing an advertisement in the newsletter of a music education professional development business. The recruitment advertisement specified an expectation of 10 to 20 minutes completion time. The average time spent from consent to finish was 14 minutes and 33 seconds for those that completed 100% of the study, suggesting participants interacted with text-based input for at least 5 minutes (Turing, 1950). All participants were offered a free course on the site whether they completed the survey or not. A total of 49 teachers completed every rating, with an additional seven teachers supplying at least two ratings of lesson plans but not completing the demographics section of the study, for a total of 410 ratings. The seven teachers making partial ratings had their data included with missing data being excluded pairwise. Participants were a majority general music teachers (57.4%) with over 10 years of teaching experience (59.2%). A little over half the sample had used AI for professional use (53%) and personal use (51%).

Table 1.

Population characteristics.

	Number of responses	Percentage of responses (%)
Years teaching (years)	49
0–3	5	10.2
4–9	15	30.6
10+	29	59.2
Assignment	47
Choral	14	29.8
General^a	27	57.4
Instrumental	6	12.8
Professional use of AI?	49
Yes	26	53
No	23	47
Personal use of AI?	49
Yes	25	51
No	24	49

General music (including theory, composition, and technology).

Procedure for data collection

Participants first reviewed an informed consent document approved by the Florida International University Institutional Review Board (IRB) stating the nature of the study, including that they would be reviewing lesson plans created by human music teachers and generated by AI, so that there was no intentional deception. After agreeing to be in the study, participants were shown one lesson plan at a time and asked to rate the overall quality of the lesson plan (1 = Poor, 10 = Excellent), followed by if they believed the lesson plan was created by a human music teacher or generated by AI. Randomization was utilized to guard against order effects and to present the eight lesson plans in a random order to every assessor, with 40,320 possible display combinations (8! = 40,320). Participants could hit a back button to change their responses as they progressed.

After completing the ratings for eight lesson plans, participants were taken to a final page with one open-ended question and six potential mediator questions. The open-ended question stated, “(optional) Feel free to explain y thinking for why you felt lessons plans were created by music teachers or by artificial intelligence (AI).” 42 of the 49 teachers that completed every rating chose to supply qualitative data through the open-ended response. The additional mediator questions included (1) years teaching, (2) teaching assignment, if they had used AI for (3) professional or (4) personal use, and how useful do they think is AI for music teachers (5) now and (6) in the future (1 = not at all useful, 10 = extremely useful).

Description of lesson plans and content-generation process

Four lesson plans (labeled Human #1–4) were taken from publicly posted websites by two professional music education organizations. The headquarters for both organizations were in the United States. The names of the contributing authors and their parent organizations were removed from the text to protect anonymity. Human #1 and Human #2 were taken from the same source and included a middle school general music lesson (#1) and a high school choir lesson (#2). Human #3 and Human #4 were taken from the same source and included a beginning band lesson (#3) and elementary general music lesson (#4). Human lesson plans all included a brief title, essential question(s), references to national standards, explicit learning objectives, a list of materials needed, a detailed procedure, a detailed assessment, and differentiation or extension ideas.

Four additional lesson plans (AI #1–4) were created using the free version of ChatGPT (3.5). The four lesson plans were created to mimic the real lesson plans, such that the topics were middle school general music (AI #1), high school choir (AI #2), beginning band (AI #3), and elementary general music (AI #4). Two different input methods were used. AI #1 and AI #2 were created by copying and pasting Human #1 and Human #2 into the chat box prompt and requesting to “Please create a similar lesson plan.” This represents a minimal degree of pre-training an existing GPT model. AI #3 and AI #4 were created without any input, using only a prompt based on the focus of the Human #3 and Human #4 lesson plans:

AI #3: “Write a lesson plan for a beginning band class. It is 40 minutes long. They should be rehearsing a specific song for beginning band students. The standard should be about learning to perform individual parts together.

AI #4: “Write a lesson plan for 3^rd grade music class. It is 30 minutes long and there are 25 students. They should sing and play an instrument in this lesson. The standard should be about performing with a steady beat.”

The four AI lesson plans all included a brief title, learning objectives, a detailed procedure, a detailed assessment, and adaptations or extension ideas. Only AI #1 generated an essential question and no AI lesson plan contained references to national standards for music. AI lesson plans did not reference specific songs or standards even when instructed to do so. These were left unchanged to express a potential limitation in the current ability of untrained content-generating AI to create readymade lesson plans for music teachers without additional human intervention.

Results

All data were analyzed using SPSS 29.0.1.1. The reliability of responses was evaluated using Cronbach’s alpha. The quality ratings of lesson plans were reliable (α = .78), and all skewness and kurtosis values fell within absolute 2.0, suggesting normally distributed data.

Lesson plan labeling accuracy and quality ratings

The accuracy of music teachers to label if a lesson plan was created by humans or generated by AI, and the quality score assigned to each lesson plan, is displayed in Table 2. Overall, music teachers were 55% accurate when labeling if a lesson plan was created by human music teachers or generated by AI. A 50% success rate would reflect the expected outcome for guessing between two choices. The success rate was not statistically different from 50% overall (p > .05) and on seven of eight individual lesson plans, with only Human #4 labeled correctly by assessors, t(50) = 6.67, p < .001. Of the 49 assessors to rate all eight lesson plans, zero correctly labeled all eight. Only two assessors correctly labeled seven out of eight lesson plans.

Table 2.

Labeling accuracy and quality score ratings by teachers.

	N	Accuracy^b M (SD)	Quality score (out of 10) M (SD)
Grand means	410^a	55% (21%)	6.50 (1.43)
Human lessons overall	204	56% (25%)	6.79 (1.27)
Human #1	52	62%	7.35 (1.74)
Human #2	51	38%	6.39 (2.00)
Human #3	50	42%	6.15 (1.98)
Human #4	51	84%^c	7.88 (1.37)
AI lessons overall	206	56% (27%)	6.22 (1.82)
AI #1	51	59%	6.30 (2.00)
AI #2	50	59%	6.48 (1.99)
AI #3	52	56%	6.49 (1.88)
AI #4	53	55%	6.08 (2.17)

56 unique assessors. A total of 49 assessors rated all eight lesson plans. No rater got 8/8 (100%).

All accuracy ratings not significantly different than 50% (one-sample t-test) except human #4.

Significantly different from 50%, p < .001.

Paired-samples t-tests were used to look at quality scores and labeling accuracy between Human and AI lesson plans. The ability to correctly label a lesson plan did not differ by if the lesson plan was created by humans or generated by AI, t(53) = 0.248, p = 81, d = 0.03 (Table 3). Despite not being able to reliably tell if a lesson plan was created by humans or generated by AI, the quality score for the Human lesson plans was higher overall, with a medium to large effect size, t(53) = 3.21, p < .01, d = 0.44. Achieved power for this comparison was .90, suggesting a satisfactory degree of probability the results are not due to sample error.

Table 3.

Paired-samples t-tests.

	M (SD)	df	t	p	r	d
Labeling ability (human vs. AI)		53	.248	.805	.20	.03
% Correct human lessons	56% (25%)
% Correct AI lessons	56% (27%)
Quality ratings (human vs. AI)		53	3.21	.002	.70*	.44
Quality of human lessons	6.79 (1.27)
Quality of AI lessons	6.22 (1.82)
Usefulness of AI for music teachers		46	5.787	<.001	.65*	.84
Now	6.02 (2.14)
In the Future	7.43 (1.68)

p < .001 for r.

Predictors of labeling accuracy by assessors

Multiple regression was used to determine if the accuracy of correctly labeling a lesson plan as created by humans or generated by AI could be predicted by (1) years teaching, (2) teaching assignment, if assessors had used AI for (3) professional or (4) personal use, how useful assessors think AI will be for music teachers (5) now and (6) in the future, or the mean quality scores by each assessor for (7) human-made lesson plans and (8) AI-generated lesson plans. A backward stepwise regression method was used with (1) years teaching, (2) teaching assignment, (3) AI for professional use, and (5) current usefulness of AI for music teachers ultimately being removed from the equation (Models 1–4). Model 5 was significant with four predictors and a large effect size, F = 5.03, p < .01, adjusted-R² = .268, f² = 0.366 (Table 4). Multicollinearity was not present (Spearman’s rho, ρ < .70 between all variables). The two positive predictors of accuracy on the labeling task were “Yes” to AI for personal use (b = 0.139, β = .362, p < .05; “No” coded as 0, “Yes” coded as 1) and the quality score assigned to Human lesson plans (b = 0.083 β = .549, p < .01; continuous scale). The two negative predictors of accuracy on the labeling task were the rating of AI as useful to music teachers in the future (b = -0.044, β = -.379, p < .05; continuous scale) and the quality score assigned to AI lesson plans (b = -0.062, β = -.585, p < .01; continuous scale). Achieved power using adjusted-R² to calculate f² for effect size showed the results were highly reliable for this regression (power = 0.98).

Table 4.

Regression model for teacher’s ability to label lesson plans as human or AI.

	Adjusted R²	Unstandardized beta	β	p	f ²
Model 5, F = 5.029	.268			.002	0.366
Constant		.619		<.001
Predictors^a
AI quality rating		−.062	−.585	.003
Human quality rating		.083	.549	.006
Usefulness—future		−.044	−.379	.018
Personal use of AI		.139	.362	.018

Non-significant predictors: professional use of AI, years teaching, usefulness—now, teaching assignment (choral, general, and instrumental).

Open-ended responses on music teacher thought process

To assist in explaining why music teachers made their judgements, optional open-ended responses were collected with 42 assessors making comments. Many assessors included multiple reasons for their ratings. A total of 38 assessors made comments about why they felt lesson plans were generated by AI and 13 assessors made comments about why they felt lesson plans were created by humans. Responses were first analyzed through inductive coding. Three overarching themes emerged from applying axial coding to the results of the inductive coding: specific details, evidence of classroom knowledge, and wording (Table 5).

Table 5.

Emergent themes from coding of open-ended question.

Theme	Why lessons were AI	Why lessons were Human	Total
Specific details	19	6	25
Classroom knowledge	16	7	23
Wording	17	2	19

Overall, music teachers did not find the task to be easy and lacked confidence in their ratings. Several reported being unsure and that the task was difficult:

“I am not very confident in my choices” (assessor #12, 75% accurate).

“This was challenging” (assessor #40, 50% accurate).

“I had a really hard time differentiating between the music teacher and AI lessons” (assessor #41, 62.5% accurate)

“I did not find it easy. . . It was harder than I had anticipated” (assessor #18, 37.5% accurate).

“I am sure I missed many” (assessor #36, 62.5% accurate).

Specificity was mentioned several times as reasons a lesson plan was thought to be created by humans or generated by AI. This was explicitly stated through comments such as, “The lessons that were very general in the descriptions, or extremely open-ended seemed like AI to me” (assessor #3, 75% accurate), and “I feel that some activities were too vague for students to meet expectations with fidelity” (assessor #17, 75% accurate). The most common comment on specificity was looking to see if lesson plans “included more specifics such as the songs that would be used for listening” (assessor #8, 87.5% accurate), ascertaining that “the ones by teachers had specific songs listed and standards” (assessor #26, 50% accurate). These teachers accurately noticed that the Human lesson plans contained specific musical content information that was often lacking in the AI lesson plans and used this as a determining factor when assigning labels.

Various types of classroom knowledge were perceived as noticeably different between Human and AI lesson plans. One teacher noticed incorrect information about a composer in a lesson plan (assessor #11, 75% accurate) and another noticed some materials were listed but never used (assessor #12, 75% accurate). Time was a common theme, such that “there was so much in the one lesson that I am guessing an actual teacher would know better than to place too much into one lesson” (assessor #16, 50% accurate). Several teachers pointed out they felt AI lesson plans “were created by someone who does not have insight of actually living in a music classroom day in and day out” (assessor #23, 75% accurate), while lesson plans perceived to be human “included more information about why the concept was important to learn” (assessor #9, 75% accurate). This type of insider knowledge was evaluated by some teachers from the perspective of “whether or not it would make sense for these activities to be done with my students” (assessor #18, 37.5% accurate). It was important for teachers when making their ratings that they picked up on realistic sequencing and scaffolding of activities.

The final theme to emerge was related to the wording of the lesson plans. Teachers who focused on the wording appeared to make more errors. If a teacher felt uncomfortable with the “style of language” (assessor #42, 37.5% accurate), that “terminology got mixed up” (assessor #38, 37.5% accurate), or that there was “awkward wording” (assessor #14, 12.5% accurate), they tended to be less accurate. Some teachers felt the language used was “more formal. . .than what real people would use to write lesson plans” (assessor #19, 50% accurate) or “rigid and wordy” (assessor #22, 50% accurate). One teacher reported that they “found a few typos and associated that with music teachers” (assessor #40, 50% accurate). While many teachers appeared to have been influenced by the wording of lesson plans, looking for nuance in the grammatical wording and phrasing of lesson plans did not appear to help teachers be more accurate.

Discussion

Conclusions

Music teachers could not accurately label if a lesson plan was created by humans or generated by AI despite AI lesson plans being rated lower in quality than human-made lesson plans. The quality scores of Human #1 and Human #4 likely contributed to the significant difference in overall quality between groups. The ability of music teachers to correctly label each lesson plan generated by AI was never statistically different than 50%, far below Turing’s suggested threshold of 70%. Based on labeling accuracy and qualitative responses, it appears there was a lot of guessing by assessors. The open-ended responses would indicate assessors were most confident a lesson plan was made by AI when they noticed a lack of specificity or if the musical or pedagogical knowledge didn’t meet the needs they expected in their own classrooms. A limitation of the results is that there is not an explicit measurement of confidence attached to each assessor, although qualitative responses indicated confidence to be accurate was not high as a group. No qualitative response indicated an assessor was confident in their ability to be accurate on the labeling task.

The regression equation showed the stronger a music teacher felt AI would be useful in the future for music teachers, the worse they were at telling the difference between the source of the lesson plans. The other negative predictor of labeling accuracy was the quality score assigned to lesson plans generated by AI. When assessors were more accurate on the labeling task, they rated AI lesson plans lower in quality. This may suggest that music teachers intentionally rated known AI products worse in quality. Boden (2010) noted that people withdraw value from creative products that are known to be generated by AI. This reflects a phenomenon known as the Sociology of Expectations and how workers have “Positive and Negative fictional expectations” for the role of AI in their future workplaces (Vicsek, 2021, p. 851). Those with Negative Effects ideal types have “high expectations for technological development; fast diffusion of the technology, and as a consequence, predict great change” and “expect huge changes in the labor market, with robots taking over the human workforce” (Ford, 2015; Frey & Osborne, 2017; Harari, 2018; Tegmark, 2017, as cited in Vicsek, 2021, p. 844). The most obviously human-made lesson plans based on labeling accuracy (Human #1 and Human #4) were rated higher than the other two Human lesson plans. Human #2 and Human #3 were labeled accurately less than 50% of the time, therefore perceived as generated by AI by the assessors, with quality scores very similar to the actual AI lesson plans. It could be that those with overly optimistic future expectations are fearful of becoming obsolete and wish to assign value to human products to safeguard their own status as essential to the workforce. This relationship has not been explicitly researched with teachers.

Exposure did seem to affect success rate, such that those who had previous experience with AI for personal use were 14% more accurate in labeling the lesson plans as created by humans or generated by AI. For what it is worth, participating in the study led to experimentation with AI by at least one participant: “I played around with AI and was blown away by how well it breaks things down and explains concepts” (assessor #39, 50% accurate). It is logical to expect these features to improve as fine-tuning specific models for educational contexts becomes more cost-effective. Immediate improvement and application is possible through teacher-led refinement of an existing GPT model.

Implications

ChatGPT was used to generate music lesson plans with limited input from a human controller. Addressing the issues of specificity and mismatched classroom content raised by assessors would be needed for successfully developing a GPT model for music teachers. Engineering a model to produce better lesson plans is both feasible and worthwhile. The basic input methods used in this study resulted in a situation where AI-generated content was indistinguishable from human-made content, although, the AI lessons did not pass the test of being assigned the same value by experts in the field. Human input could address the deficiencies found in this study to make an untrained model a more useful tool.

Toward a virtual assistant for music teachers and their students

To be most useful to a music teacher, a GPT model needs additional input. Software companies, such as OpenAI, provide technical documentation online for how to approach refining specific GPT models. A commercial application to share with others would require fine-tuning a model, which is expensive, time consuming, and beyond the technical expertise of a layperson. A more accessible and practical alternative for music teachers is to explore prompt engineering and prompt chaining as forms of model training.

The description for how lesson plans were generated by AI in this study explains the basic process for prompt engineering. AI Lesson Plans #1 and #2 were created by using an existing lesson plan as input and asking the GPT model to output a new lesson plan. The unrefined model was not proficient at incorporating standards, selecting repertoire, and deciding how long certain activities should last in the classroom. GPT models are iterative and can incorporate feedback from the user. The next step for a music teacher wishing to engineer a better output using this process would be to give the model specific feedback on the output that was generated. Additional refinement can occur by providing specific details prior to submitting preexisting lesson plans as input. Creating a catalog system gives the model more specific information from which to build new lesson plans. This information helps the model to understand what is contained in each section of a lesson plan and how they are interrelated.

Learning objectives

Describe to the model what is a learning objective. Provide the model examples of objectives you have previously used. Specify which standards can be met with each learning objective.

“Learning Objective #1: [Write out the learning objective.]

Standards for Learning Objective #1: [List the standard(s) addressed through this learning objective.]”

Addressing standards

National standards in the U.S. are delineated by age, type of instruction (e.g. general music, ensembles) and use the categories Creating, Performing, Responding, and Connecting. A music teacher could input the prompt:

“All lesson plans need two to four standards. There are four categories of standards. Below is each standard I use in my teaching. Only use the exact standard as written when adding them to my lesson plans. Do not make up new standards.

Standard #1: [Write out the full standard.]

Standard #1 Category: [Write out the category of the standard.]

Activities in Standard #1: [List classroom activities (e.g., singing) which align with this standard.]”

This will allow you to ask the model to use specific standards, and group standards by category and activity. Brainstorming can occur by asking the model to retrieve a list of activities based on a standard or category.

Adding repertoire

Create a catalog of songs and activities used in your classroom. This effort could be most worthwhile if multiple teachers contributed.

“Song #1 Title “AND” Artist: [List the song title, write the word “AND” followed by the artist(s). This will allow you to ask for your model to list all “titles” by a certain “artist,” for example.]

Song #1 Activity: [List the classroom activities (e.g., singing) which accompany this song.]

Song #1 Materials: [List the materials used for the activity (e.g., recordings, instruments.]

Song #1 Activity Time: [Specify how long the activity usually lasts.]

Song #1 Features: [List features of tempo, timbre, mood, genre, or other identifiers which you think are important. This could include class size considerations.]”

This will allow you to ask your model to find or use “titles” that are “folk music,” are a “slow tempo,” or use “woodwinds,” for example.

Prompt chaining

With previously written lesson plans and/or a catalog used as input, prompt chaining can begin. This process, known as “giving time to think,” has the GPT model generate a lesson plan over a series of smaller steps. Music teachers can use their preferred method of planning and logic. Be specific and tell the model what you want it to create and how.

“Step 0: Please write me a lesson plan. I will list the steps below.

Step 1: My class has [X] number of students. We meet for [X] minutes. Our objective for this class is to [specify the goal or objective.]

Step 2: Choose three standards from the standards I provided which can incorporate the learning objective. List the three standards exactly as written. Write a specific learning objective for each standard.

Step 3: Choose an activity for each learning objective. Write the procedure for teaching each activity. [Specify if activities should be specific, such as playing drums.]

Step 4: [Specify if any objectives need an assessment and ask for a rubric.]

Step 5: [Request extensions or differentiation based on the individual needs of your students.]”

This process will need to be refined for the individual teacher and teaching situation. All information generated needs to be checked by a music teacher to ensure fidelity and accuracy. The teacher can specify any parts of the task that need improvement. Each step is likely to need a certain amount of refinement and requires the expert knowledge of the teacher to refine the output. This does not need to be undertaken by an individual. Prompt engineering and refinement could be a worthy activity for a professional learning community of music teachers with varying expertise. If a large collection of teachers were willing to supply high-quality lesson plans, the catalog of standards, repertoire, activities, and assessment strategies could be more useful to help music teachers brainstorm ways to engage their students and potentially write immediately useable lesson plans.

Intelligent tutor systems (ITS)

Individual tutoring can be supported through a GPT model chatbot. A teacher can use prompt engineering to train a model to know how to help students facing common problems. For example, a strings teacher might list common issues facing their students and provide an answer to the model. This example is brief, and it is encouraged that teachers provide more detail in their input.

“Q: Why does the bow squeak when I play? A: There are a few likely reasons you hear squeaking when you bow. The pressure applied to the strings could be too strong. The bow could be too close to the bridge or over the fingerboard. The bow could be moving too slow or too fast. Do you think it could be one of these issues?”

By creating a catalog of common student needs, a music teacher could input several aspects of instrumental or vocal pedagogy, how to read and perform standard notation on a specific instrument, or other foundational musical knowledge. After supplying the input, a music teacher would ask questions to the model and see how it performs. The teacher would provide suggestions for improvement and clarify errors so that each future output would be more useful to their students. This task could be a worthwhile project as part of a collaborative learning community of diverse musicians and teachers.

Limitations

The wording of the lesson plans seemed to influence assessors. The Human lesson plans posted online by professional music education organizations were highly formal. Previous studies have shown misidentification of humans as machines can result when there is an absence of natural language in the text (Warwick & Shah, 2015). The results could be different if “everyday” lesson plans created by a random assortment of music teachers were used and not exemplars from professional organizations that tended to have jargon or complex wording. The lesson plans used were chosen because it was assumed they would be high-quality due to being shared online by respected music education organizations. However, this may have obscured the “humanness” in the Human lesson plans. The AI lesson plans were intentionally left unedited to give a fair comparison.

The sample size is a limitation of the study despite strong reliability and achieved power throughout. Results were based on 410 ratings but by 56 unique assessors. Most assessors were general music teachers with over 10 years of experience but only four of the eight lesson plans were general music lesson plans. Results could have been different if music teachers only rated lesson plans for their specific teaching assignment, especially for the labeling task. Regression analysis suggested the ability to accurately label a lesson plan did not change by teaching assignment, but there was a lack of instrumental teachers compared to general music and choral ensemble teachers.

Footnotes

Author contributions

Patrick K Cooper: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Resources; Software; Supervision; Validation; Visualization; Writing—original draft; Writing—review & editing.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Patrick K. Cooper

Statement of AI in Research Study

The author of this paper confirms that the use of artificial intelligence (AI) in this research study was limited solely to the creation of four lesson plans for evaluation by music teacher assessors. No AI has been used in the preparation of the manuscript, data, tables, or references in any way.

References

Ariza

(2009). The interrogator as critic: The Turing test and the evaluation of generative music systems. Computer Music Journal, 33(2), 48–70. https://doi.org/10.1162/comj.2009.33.2.48

Avdeeff

(2019). Artificial intelligence & popular music: SKYGGE, flow machines, and the audio uncanny valley. Arts (Basel), 8(4), 130. https://doi.org/10.3390/arts8040130

Boden

M. A.

(2010). The Turing test and artistic creativity. Kybernetes, 39(3), 409–413. https://doi.org/10.1108/03684921011036132

Chen

(2020). Imagination machines, Dartmouth-based Turing-tests, & a potted history of responses, AI & Society, 35, 283–287. https://doi.org/10.1007/s00146-018-0855-3

Clancey

W. J.

Hoffman

R. R.

(2021). Methods and standards for research on explainable artificial intelligence: Lessons from intelligent tutoring systems. Applied AI Letters, 2(4), e53. https://doi.org/10.1002/ail2.53

Cooper

P. K.

(2016). Examining correlations when using amabile’s consensual assessment technique to support validity of teachers as expert judges. In Bugos

J. A.

(Ed.), Contemporary research in music learning across the lifespan: Music education and human development (pp. 137–150). Routledge.

Cope

(2005). Computer models of musical creativity. MIT Press.

Cope

Hofstadter

D. R.

(2001). Virtual music: Computer synthesis of musical style. MIT Press.

Della Ventura

. (2019). Exploring the impact of artificial intelligence in music education to enhance the dyslexic student’s skills. In Uden

Liberona

Sanchez

Rodríguez-González

(Eds.), Learning technology for education challenges (pp. 14–22). Springer. https://doi.org/10.1007/978-3-030-20798-4_2

10.

Ford

(2015). The rise of the robots: Technology and the threat of mass unemployment. Oneworld Publications.

11.

Frey

C. B.

Osborne

M. A.

(2017). The future of employment: How susceptible are jobs to computerisation? Technological Forecasting & Social Change, 114, 254–280. https://doi.org/10.1016/j.techfore.2016.08.019

12.

Fütterer

Fischer

Alekseeva

Chen

Tate

Warschauer

Gerjets

(2023). ChatGPT in education: Global reactions to AI innovations. Scientific Reports, 13(1), 15310–15310. https://doi.org/10.1038/s41598-023-42227-6

13.

Harari

Y. N.

(2018). 21 lessons for the 21st century. National Geographic Books.

14.

Jauhiainen

J. S.

Guerra

A. G.

(2023). Generative AI and ChatGPT in school children’s education: Evidence from a school lesson. Sustainability (Basel, Switzerland), 15(18), 14025. https://doi.org/10.3390/su151814025

15.

Kang

Yoo

(2021). Elementary students’ music compositions with notation-based software and handwritten notation assisted by classroom instruments. Bulletin of the Council for Research in Music Education, 227, 29–44. https://doi.org/10.5406/bulcouresmusedu.227.0029

16.

Wang

(2023). Artificial intelligence in music education. International Journal of Human-Computer Interaction. Advance online publication. https://doi.org/10.1080/10447318.2023.2209984

17.

Lin

Ding

(2020). Application of music artificial intelligence in preschool music education. IOP Conference Series. Materials Science and Engineering, 750(1), 012101. https://doi.org/10.1088/1757-899X/750/1/012101

18.

Rohwer

(2023). Research-to-resource: ChatGPT as a tool in music education research. Update: Applications of Research in Music Education. Advance online publication. https://doi.org/10.1177/87551233231210875

19.

Ruiz

J. V.

Cooper

P. K.

Muhammed

J. N.

(2020). Can they hear a difference? Professional digital composition and the ability of music students to discriminate deep-sampled vs. acoustic instrumental performance recordings. Journal of Popular Music Education, 4(1), 81–99. https://doi.org/10.1386/jpme_00015_1

20.

Schubert

Canazza

De Poli

Rodà

(2017). Algorithms can mimic human piano performance: The deep blues of music. Journal of New Music Research, 46(2), 175–186. https://doi.org/10.1080/09298215.2016.1264976

21.

Tegmark

(2017). Life 3.0: Being human in the age of artificial intelligence (1st ed.). Alfred A. Knopf.

22.

Turing

A. M.

(1950). I.—Computing machinery and intelligence. Mind, LIX(236), 433–460. https://doi.org/10.1093/mind/LIX.236.433

23.

van den Berg

du Plessis

. (2023). ChatGPT and generative AI: Possibilities for its contribution to lesson planning, critical thinking and openness in teacher education. Education Sciences, 13(10), 998. https://doi.org/10.3390/educsci13100998

24.

Vicsek

(2021). Artificial intelligence and the future of work: Lessons from the sociology of expectations. International Journal of Sociology and Social Policy, 41(7/8), 842–861. https://doi.org/10.1108/IJSSP-05-2020-0174

25.

Warwick

Shah

(2015). Human misidentification in Turing tests. Journal of Experimental & Theoretical Artificial Intelligence, 27(2), 123–135. https://doi.org/10.1080/0952813X.2014.921734

26.

Zheng

Wang

(2023). Developments and applications of artificial intelligence in music education. Technologies (Basel), 11(2), 42. https://doi.org/10.3390/technologies11020042

27.

Yuan

(2020). Application and study of musical artificial intelligence in music education field. Journal of Physics. Conference Series, 1533(3), 32033. https://doi.org/10.1088/1742-6596/1533/3/032033

28.

Zhang

Song

(2023). Design of an online interactive teaching platform for rural music education based on artificial intelligence. Applied Mathematics and Nonlinear Sciences, 9(1). https://doi.org/10.2478/amns.2023.2.00944