Quality evaluation of AI-generated diabetes-related health education texts from different generative models

Abstract

Background

With the increasingly widespread application of artificial intelligence technology, generative artificial intelligence has become an important tool for people to obtain health information due to its convenience and flexibility in health education or health promotion. However, the readability and accuracy of such AI-generated materials still need to be evaluated.

Objective

To comprehensively evaluate and compare the quality and readability of health education texts about diabetes generated by different generative artificial intelligence (AI) models.

Methods

We followed a fixed list of ten questions without modifications, systematically presenting the same inquiries to seven generative AI models and exporting their results into defined forms in the text generation process. Five experts were invited to evaluate the texts based on five criteria. The readability index, a readability formula, was used to evaluate the text’s readability. Kendall’s coefficient of concordance was employed to assess inter-rater reliability. The linear mixed model was used to compare the differences in five dimensions and readability among the health education texts generated by different AI models.

Results

Kimi-K1.5 and Doubao attained the highest overall scores in scientific accuracy, whereas iFlytek Spark-V3.5 received lower scores compared to other models. In terms of practical value and logical clarity, Kimi-K1.5 received the highest scores, while iFlytek Spark-V3.5 scored the lowest. In the dimension of reference basis, Kimi-K1.5 and ERNIE Bot-3.5 received relatively high scores, while iFlytek Spark-V3.5 and Doubao scored lower. In the assessment of text readability, higher R-value scores indicate poorer readability. The health education text generated by Doubao had the highest R-value, while iFlytek Spark-V3.5 had the lowest R-value.

Conclusions

Kimi-K1.5 performed better across multiple assessment parameters in the overall evaluation of diabetes-related health education texts created by different generative AI models. Notably, among all the models tested, iFlytek Spark-V3.5 showed the best readability.

Keywords

generative artificial intelligence model health education texts diabetes

Introduction

Diabetes mellitus (DM), including both type 1 and type 2 diabetes, is widely recognized as a threat to public health worldwide due to the huge disease burden.^1,2 It was reported that the number of adults with diabetes worldwide reached approximately 589 million in 2024. Without effective prevention, this number is projected to reach 853 million by 2050, with over 90% of affected individuals estimated to have type 2 diabetes.³ In China, the number of people with diabetes already exceeds 118 million, representing 22% of the global total.⁴ In addition, the limited public awareness of diabetes also increased the difficulties of prevention and control of DM.^5,6 Therefore, increasing the diabetes-related literacy of the public is crucial for diabetes prevention.

With the continuous advancement of artificial intelligence (AI), its applications in healthcare have become increasingly widespread, spanning areas such as early disease diagnosis and medical image analysis.^7–9 In recent years, the application scope of AI has gradually expanded from clinical practice to health education, with generative AI emerging as an important tool for disseminating medical knowledge.^10–12 More and more people are using generative artificial intelligence to access and share relevant health information, leveraging their advanced learning abilities.^13,14 Healthcare professionals can utilize these tools to systematically organize knowledge pertaining to specific diseases, whereas non-specialists may gain preliminary insights into medical conditions through interactive dialogues. The use of Generative AI in health education not only enhances efficiency but also increases accessibility and equity in information dissemination.^15,16 Several studies have assessed the quality and readability of health education materials produced by generative AI on cardiac disease. And the results showed that the texts produced by generative AI were either overly complex or varied in quality.^17–19 Although some studies have assessed the readability and quality of generative artificial intelligence in producing health education texts for specific diseases, existing research predominantly focuses on individual models or limited comparisons among a few mainstream models. It lacks systematic evaluation across multiple models, particularly in the Chinese-language context. Given the rapid evolution of generative AI models and the ongoing need for extensive research across diverse medical themes,^20–22 coupled with the expanding influence of these models, the systematic evaluation of health education texts generated by such technologies remains of critical importance.

Currently, few studies have systematically evaluated the generative AI-produced health education texts from medical professionalism, factual accuracy, logical coherence, readability, and alignment with user needs. Most existing research focuses either on model generation capabilities or isolated quality features,^17,18 lacking a comprehensive assessment framework that integrates multiple metrics such as the reliability of scientific evidence, clarity of expression, potential biases, and ethical standards. Hence, a comprehensive evaluation of the quality and readability of artificial intelligence-generated health education texts is imperative. Additionally, different generative AI models varied significantly in the accuracy of specialized knowledge and application skills. A systematic assessment focusing on the accuracy, professionalism, and readability of health education texts produced by different AI models could help the public to recognize the differences and limitations among these models and facilitate the selection of appropriate models based on individual needs. As a consequence, this research not only guides users in selecting more appropriate AI tools but also provides constructive feedback to developers, ultimately enhancing the value of AI in promoting health literacy.

Hence, this study chose diabetes as a representative case and systematically evaluated the quality of diabetes education texts generated by various generative artificial intelligence models, employing the multi-dimensional evaluation framework for AI-generated health education content developed by Yang X et al.²³ and the text readability calculation formula. The findings aim to furnish practical guidance for both professionals and non-specialists in selecting appropriate models, while also providing critical insights for optimizing model performance and effectively utilizing such tools to produce high-quality, highly accurate, and easily comprehensible health education texts.

Methods

Selection of generative AI large models

This study selected seven generative AI large models that were widely used: ERNIE Bot-3.5, iFlytek Spark-V3.5, Kimi-K1.5, ChatGPT-4o, Tiangong-AI2.2.0, Doubao Large Model, and Deepseek-R1. All of these models have received official approval and were accessible to users freely. To ensure objectivity and reliability and reduce potential subjective bias, this study adopted a blinding process, numbering the selected models sequentially as Model 1 to Model 7.

Generation of diabetes-related health education texts

After discussing among the research team, we carefully selected the ten most prevalent questions in diabetic health education, encompassing risk factors, initial symptoms, complications, prevention, self-monitoring, and food and activity recommendations. Comprehensive information was included in Supplemental Table 1. We sequentially presented the 10 questions in Table 1 about diabetes-related health education to each generative AI model in January 2025. In the data collection process, the 10 questions were generated in the same dialogue box, and we utilized a systematic sequential questioning technique, delivering all inquiries without supplementary prompts and rigidly following a preset questionnaire to prevent leading questions, thereby ensuring the objectivity of responses. All inquiries and their responses were transcribed into a Word document to formulate the respective teaching materials for each generative AI. During the design and implementation phases, we upheld anonymity, equity, and transparency in content generation. Furthermore, the produced texts were exclusively utilized for research purposes, did not entail the processing of sensitive or personal information, and successfully alleviated possible privacy concerns.²⁴

Table 1.

Inter-rater reliability of expert assessments.

Scoring category	Kendall’s W	Chi-square	P
Scientific Accuracy	0.477	133.668	<0.001***
Logical Clarity	0.590	165.073	<0.001***
Practical Value	0.495	138.733	<0.001***
References basis	0.388	106.993	<0.001***
Stance & values	0.888	248.663	<0.001***
Total Score	0.530	148.423	<0.001^***

Note. ***,P<0.001.

Quality evaluation of the generated diabetes-related health education texts

Five experts were invited to evaluate the quality of the generated texts, including one chief nurse and one chief physician from the endocrinology department of a tertiary hospital, two registered staff nurses, and one PhD in Nursing Science. All experts had more than five years of clinical experience and professional background in diabetes health education and clinical practice guidelines. During the evaluation, assessors were blinded to the model sources and conducted independent evaluations without consultation. Five criteria, including scientific accuracy, logical clarity, practical value, quality of references, and stance & values, were assessed. Each criterion was scored on a scale of 1 to 20 points,²³ with specific evaluation criteria presented in Supplemental Table 2.

Readability evaluation of the generated diabetes-related health education texts

Text readability denotes the degree to which a text can be easily read and comprehended. This serves as a fundamental metric for evaluating the complexity of a text and is a primary focus in graded reading research.²⁵ This study used the formula R = 17.5255 + 0.0024X₁ + 0.04415X₂ − 18.3344(1 − X₃) to calculate readability,²⁶ where X₁ is the total number of characters in the text, excluding punctuation; X₂ is the average sentence length (X₂ = total number of characters/number of sentences); X₃ represents the proportion of medical professional terms (X₃ = total number of professional terms/total number of characters). A smaller R value indicates that the text is easier to read, and the range of R values corresponds to the required grade level for reading. This readability formula was created by Jing for evaluating the readability of textbook texts. Li adapted it and applied it to the field of company financial statements. Subsequently, it gradually expanded to the evaluation of readability in health education texts.²⁷ This study used the readability formula adapted by Li. The total character count and sentence number were counted in WPS documents. Chinese medical terms were identified using exact string matching with forward maximum matching to prevent redundant counting from partial substrings. English terms and abbreviations were extracted using case-insensitive regular expression matching with word boundary constraints to ensure whole-word recognition. All terms were matched against the Common Clinical Medical Terms (2023 Edition) and SinoMed Medical Subject Headings. The extracted terms were subsequently manually reviewed to exclude common daily health expressions (e.g., diabetes, malnutrition, infection) based on team consensus. Although these terms appear in medical dictionaries, they were classified as plain health vocabulary suitable for readers with primary school-level education.²⁸

Statistical analysis

Statistical analyses were performed using R software (version 4.3.1) and SPSS statistical software (version 27.0). Residuals of the linear mixed-effects models were approximately normally distributed, supporting the validity of parametric analyses. Linear mixed-effects models were used to compare expert ratings across 5 dimensions and readability scores of texts generated by different AI models. For expert ratings, dimension scores were set as dependent variables, with AI models as fixed effects (main effect) and questions/experts as crossed random intercepts. For text readability analysis, the readability R value served as the dependent variable, AI models as fixed effects, and questions as random effects. Post-hoc pairwise comparisons for statistically significant results were conducted using Bonferroni-corrected P-values. Statistical significance was defined as P < 0.05 for all analyses. Forest plots were generated using the ‘ggplot2’ package in R to visualize the estimated marginal means (EMMs) and 95% confidence intervals (CIs) of each AI model across different dimensions, with error bars representing 95% CIs for intuitive comparison. Inter-rater reliability of expert evaluations was evaluated using Kendall’s coefficient of concordance (W) with SPSS statistical software.

Results

Overall evaluation of diabetes-related health education texts generated by AI models

The Cronbach’s alpha coefficient for these 5 rating dimensions was 0.601. The educational texts generated by the seven models were assessed by five experts. Kendall’s W coefficient was 0.530 for the total score, respectively (P < 0.001). Detailed results were presented in Table 1. A graphical representation of the mean score given by five experts for each generative AI model across different dimensions was shown in Figure 1. For the total score, linear mixed-effects models revealed a significant main effect of AI model (F = 12.10, df = 6, 339, P < 0.001). Detailed results were presented in Table 2. Among all AI models, Kimi-K1.5 achieved the highest estimated marginal mean total score of 89.24, while iFlytek Spark-V3.5 obtained the lowest score of 77.44. The detailed marginal means and 95% confidence intervals were shown in Table 3. Post hoc pairwise comparisons with Bonferroni correction showed that Kimi-K1.5, Tiangong-AI2.2.0, and ERNIE Bot-3.5 scored significantly higher than iFlytek Spark-V3.5 (P < 0.05). Similarly, Doubao, ChatGPT-4o, and Deepseek-R1 also exhibited significantly higher total scores than iFlytek Spark-V3.5 (P < 0.05), but no significant differences were observed among these three models, nor between them and Kimi-K1.5, Tiangong-AI2.2.0, or ERNIE Bot-3.5 (P > 0.05). The specific differences between models were visualized in Figure 2(a).

Figure 1.

Radar charts of expert ratings for health education texts generated by each model.

Table 2.

Results of linear mixed-effects models for each evaluation dimension.

Dimension	F-statistic	Numerator df	Denominator df	p value
Total Score	12.10	6	339	<0.001^***
Scientific Accuracy	8.85	6	330	<0.001^***
Logical Clarity	4.22	6	330	<0.001^***
Practical Value	4.86	6	330	<0.001^***
References	25.61	6	339	<0.001^***
Stance & Values	6.14	6	339	<0.001^***

Note. ***,P<0.001.

Table 3.

Marginal means and 95% confidence intervals of different artificial intelligence models across all dimensions.

Dimension	AI model	Marginal mean	Standard error (SE)	95%CI
Total Score	Kimi-K1.5	89.24	3.97	78.73–99.75
	Tiangong-AI2.2.0	86.22	3.97	75.71–96.73
	ERNIE Bot-3.5	85.68	3.97	75.17–96.19
	Deepseek-R1	83.36	3.97	72.85–93.87
	Doubao	82.44	3.97	71.93–92.95
	ChatGPT-4o	81.90	3.97	71.39–92.41
	iFlytek Spark-V3.5	77.44	3.97	66.93–87.95
Scientific Accuracy	Kimi-K1.5	17.70	0.78	15.65–19.75
	Doubao	17.42	0.78	15.37–19.47
	Deepseek-R1	17.24	0.78	15.19–19.29
	Tiangong-AI2.2.0	17.00	0.78	14.95–19.05
	ERNIE Bot-3.5	16.90	0.78	14.85–18.95
	ChatGPT-4o	16.76	0.78	14.71–18.81
	iFlytek Spark-V3.5	15.54	0.78	13.49–17.59
Logical Clarity	Kimi-K1.5	17.94	0.87	15.62–20.26
	Deepseek-R1	17.66	0.87	15.34–19.98
	ChatGPT-4o	17.36	0.87	15.04–19.68
	Tiangong-AI2.2.0	17.30	0.87	14.98–19.62
	Doubao	17.22	0.87	14.9–19.54
	ERNIE Bot-3.5	16.96	0.87	14.64–19.28
	iFlytek Spark-V3.5	16.60	0.87	14.28–18.92
Practical Value	Kimi-K1.5	17.98	0.84	15.77–20.19
	Doubao	17.62	0.84	15.41–19.83
	ChatGPT-4o	17.46	0.84	15.25–19.67
	Deepseek-R1	17.44	0.84	15.23–19.65
	Tiangong-AI2.2.0	17.38	0.84	15.17–19.59
	ERNIE Bot-3.5	17.02	0.84	14.81–19.23
	iFlytek Spark-V3.5	16.26	0.84	14.05–18.47
References	Kimi-K1.5	17.14	2.02	11.79–22.49
	ERNIE Bot-3.5	17.06	2.02	11.71–22.41
	Tiangong-AI2.2.0	16.74	2.02	11.39–22.09
	Deepseek-R1	12.22	2.02	6.87–17.57
	ChatGPT-4o	12.06	2.02	6.71–17.41
	Doubao	11.90	2.02	6.55–17.25
	iFlytek Spark-V3.5	11.36	2.02	6.01–16.71
Stance & Values	Deepseek-R1	18.80	1.02	16.03–21.57
	Kimi-K1.5	18.48	1.02	15.71–21.25
	Doubao	18.28	1.02	15.51–21.05
	ChatGPT-4o	18.26	1.02	15.49–21.03
	Tiangong-AI2.2.0	17.80	1.02	15.03–20.57
	ERNIE Bot-3.5	17.74	1.02	14.97–20.51
	iFlytek Spark-V3.5	17.68	1.02	14.91–20.45

Figure 2.

Performance comparison of generative AI models across different dimensions and total scores.

Scientific accuracy of diabetes-related health education texts generated by AI models

Significant differences were also observed among models in scientific accuracy (F = 8.85, df = 6, 330, P < 0.001). Kimi-K1.5 again achieved the highest score of 17.70, while iFlytek Spark-V3.5 scored the lowest at 15.54. Post hoc tests confirmed that all models performed significantly better than iFlytek Spark-V3.5 (P < 0.05), as shown in Figure 2(b).

Logical clarity of diabetes-related health education texts generated by AI models

The clarity of logic also varied significantly across AI models (F = 4.22, df = 6, 330, P < 0.001). Kimi-K1.5 and Deepseek-R1 achieved relatively high scores of 17.94 and 17.66, respectively, while iFlytek Spark-V3.5 scored the lowest at 16.96. Post hoc pairwise comparisons with Bonferroni correction revealed that Kimi-K1.5, Deepseek-R1, ChatGPT-4o, Tiangong-AI2.2.0, and Doubao all scored significantly higher than iFlytek Spark-V3.5 (P < 0.05). ERNIE Bot-3.5 also scored significantly higher than iFlytek Spark-V3.5 (P < 0.05). No significant differences were observed among Kimi-K1.5, Deepseek-R1, ChatGPT-4o, Tiangong-AI2.2.0, and Doubao (P > 0.05). These differences were visualized in Figure 2(c).

Practical value of diabetes-related health education texts generated by AI models

Practical value also differed significantly across AI models (F = 4.86, df = 6, 330, P < 0.001). Kimi-K1.5 achieved the highest score of 17.98, while iFlytek Spark-V3.5 scored the lowest at 16.26. Post hoc pairwise comparisons with Bonferroni correction revealed that Kimi-K1.5, ERNIE Bot-3.5, Tiangong-AI2.2.0, Doubao, ChatGPT-4o, and Deepseek-R1 all scored significantly higher than iFlytek Spark-V3.5 (P < 0.05). No significant differences were observed among Kimi-K1.5, ERNIE Bot-3.5, Tiangong-AI2.2.0, Doubao, ChatGPT-4o, and Deepseek-R1 (P > 0.05). These differences are visualized in Figure 2(d).

Reference basis of diabetes-related health education texts generated by AI models

Significant differences in reference quality were also detected across AI models (F = 25.61, df = 6, 339, P < 0.01). Kimi-K1.5 achieved the highest score of 17.14, while iFlytek Spark-V3.5 scored the lowest at 11.36. Kimi-K1.5, ERNIE Bot-3.5, and Tiangong-AI2.2.0 all scored significantly higher than iFlytek Spark-V3.5, Doubao, ChatGPT-4o, and Deepseek-R1 (P < 0.05). No significant differences were observed among iFlytek Spark-V3.5, Doubao, ChatGPT-4o, and Deepseek-R1 (P > 0.05), nor between Kimi-K1.5, ERNIE Bot-3.5, and Tiangong-AI2.2.0 (P > 0.05). These differences are visualized in Figure 2(e).

Stance & values of diabetes-related health education texts generated by AI models

“Stance & Values” refers to the model’s ability to avoid subjective or commercial biases, as well as exaggerated praise or criticism. AI models also varied significantly in stance and values (F = 6.14, df = 6, 339, P < 0.001). Deepseek-R1 and Kimi-K1.5 achieved the highest scores of 18.80 and 18.48, respectively, while iFlytek Spark-V3.5 and ERNIE Bot-3.5 scored the lowest at 17.68. Pairwise post hoc analyses with Bonferroni correction revealed that Kimi-K1.5, Doubao, Deepseek-R1, and ChatGPT-4o all scored significantly higher than iFlytek Spark-V3.5 and ERNIE Bot-3.5 (P < 0.05). Tiangong-AI2.2.0 also scored significantly higher than iFlytek Spark-V3.5 and ERNIE Bot-3.5 (P < 0.05), but did not differ significantly from Kimi-K1.5, Doubao, Deepseek-R1, or ChatGPT-4o (P > 0.05). No significant differences were observed between iFlytek Spark-V3.5 and ERNIE Bot-3.5 (P > 0.05). These differences are visualized in Figure 2(f).

Readability of diabetes-related health education texts generated by different models

AI models exhibited significant differences in readability (F = 6, df = 54, 12, P < 0.001). Notably, higher R values indicate poorer readability. Thus, Doubao, with the highest R score of 4.76, showed the poorest readability, while iFlytek Spark-V3.5, with the lowest R score of 2.22, demonstrated the best readability. The marginal means and 95% confidence intervals of readability for different AI models were presented in Table 4. Pairwise post hoc analyses with Bonferroni correction revealed that Doubao scored significantly higher than ERNIE Bot-3.5 (P < 0.05), as well as iFlytek Spark-V3.5, ChatGPT-4o, and Deepseek-R1 (P < 0.05). Kimi-K1.5 and Tiangong-AI2.2.0 also scored significantly higher than iFlytek Spark-V3.5, ChatGPT-4o, and Deepseek-R1 (P < 0.05). No significant differences were observed among iFlytek Spark-V3.5, ChatGPT-4o, and Deepseek-R1 (P > 0.05). These differences were visualized in Figure 3.

Table 4.

Marginal means and 95% confidence intervals in readability for different artificial intelligence models.

AI model	Marginal mean	Standard error (SE)	95%CI
Doubao	4.76	0.34	4.07-5.46
Kimi-K1.5	3.91	0.34	3.21-4.60
Tiangong-AI2.2.0	3.66	0.34	2.97-4.35
ERNIE Bot-3.5	3.30	0.34	2.61-3.99
Deepseek-R1	2.27	0.34	1.58-2.97
ChatGPT-4o	2.27	0.34	1.58-2.96
iFlytek Spark-V3.5	2.22	0.34	1.53-2.91

Figure 3.

Readability and parameter comparison of health education texts generated by the seven models.

Discussion

This study thoroughly assessed the quality and readability of diabetes education materials produced via interactive conversations with seven popular generative AI models. Significant differences were found across several aspects, including scientific accuracy, logical clarity, practical value, reference basis, and stance & values, even if the overall quality of the texts generated by these models was typically satisfactory. Nonetheless, notable differences in readability were also discovered between the health education texts.

The accuracy and completeness of diabetes-related health education texts generated by generative AI models still need to be improved

The accuracy of health education texts is very important to the audience. In this study, although all AI-generated texts showed high accuracy, the fact that the accuracy rate did not reach 100% and that such responses may potentially mislead the inquirers was a concern. The findings were also in line with those of Giuliano Lo Bianco et al.,²⁹ who assessed ChatGPT’s responses to 13 health-related questions about spinal cord stimulation and found that 95% of answers were sufficiently accurate. At the same time, we observed that some models had shortcomings in knowledge accuracy and completeness. For example, when offering dietary advice for diabetics, the ERNIE Bot-3.5 model listed foods to avoid but did not stress the importance of managing total calorie intake. Similarly, the Doubao model, when recommending fruits for diabetics, did not classify them based on glycemic index levels. Aside from ERNIE Bot-3.5, all other models failed to account for special situations (like hypoglycemia or post-medication adjustments) when explaining home blood glucose monitoring, missing key details about timing and technique. Additionally, some models’ responses to certain questions did not align with the existing guidelines. For instance, when answering the question “How do diabetic patients test their blood sugar at home?”, the partial response given by ERNIE Bot-3.5, “Before blood collection, wash your hands with water and soap, and disinfect the blood collection area with an alcohol cotton ball or alcohol swab, etc.”, did not emphasize that the fingers should be dried before blood collection. This might affect the accuracy of the blood sugar results. When answering “What should diabetic patients pay attention to when injecting insulin?”, ERNIE Bot-3.5’s response regarding the storage environment of insulin was: “Insulin should be stored in a refrigerated environment, avoiding freezing and high temperatures. Before use, check the expiration date and appearance of the insulin to ensure the drug has not expired.” This response differed from the requirements for insulin storage environment proposed in the latest Chinese Diabetes Drug Injection Technology Guidelines,³⁰ which stated: Opened vials of insulin or insulin pen cartridges could be stored at room temperature for up to 1 month after opening, and must not exceed the expiration date.

Regarding practical value, all generative AI models provided exercise recommendations that lacked specificity, failing to clarify necessary precautions and conditions for physical activity. This finding aligned with the research of John Grundy et al³¹ and Gokbulut et al.³² who also identified accuracy and personalization limitations in ChatGPT’s handling of diabetes-related health inquiries. This might also be related to how we pose questions to generative AI. Some researchers have pointed out that adding certain qualifiers when asking AI-related questions might help us get the answers we want.³³ However, when we added specific requirement qualifiers, and it still could not find answers that meet our standards, it might still generate inaccurate responses for us.³⁴

The logicality of diabetes-related health education texts generated by generative AI

Clear logic helps people better understand the content of a text. We found that both Kimi-K1.5 and Deepseek-R1 received high scores in logical clarity, indicating that they possess strong logical reasoning abilities. In contrast, iFlytek Spark-V3.5 scored lower in this dimension, possibly because it excelled more in linguistic expression. These differences also reflected the inherent characteristics and strengths of each model.

Reference basis for diabetes-related health education texts generated by generative AI

When a generative AI model produces relevant text and lists the webpages or literature it used as references, readers can assess the professionalism of the content by examining the credibility of those sources. However, significant disparities emerged in the reference basis dimension, where Doubao, iFlytek Spark-V3.5, ChatGPT-4o, and DeepSeek-R1 provided no supporting references for their answers. These findings aligned with prior research, which reported that both ChatGPT-4o and DeepSeek-R1 lacked relevant references in generating patient education materials for spinal surgery.³⁵ Furthermore, the references cited by other models predominantly originated from platforms such as Weibo, Quark, and Baidu, whose credibility remains unverified.

The readability of diabetes-related health education texts generated by generative AI

The readability of health education texts plays a critical role in how people access and utilize health information. In this study, the generated texts generally demonstrated good accessibility. This finding was consistent with the study by Luo Y et al²⁸ and Ozduran et al.³⁶ However, it was inconsistent with the results of Stephenson-Moe et al.,³⁷ who compared patient education materials generated by different artificial intelligence models. They found that the overall readability of all AI-generated texts reached the reading level of college graduates. This discrepancy can be attributed to the varying tools employed for measuring readability. This study identified that specific models produced texts with disproportionately high reading scores for certain questions. Doubao’s response to the question, “How can diabetic patients self-monitor their blood glucose at home?” achieved a readability level of 8, suitable for an audience with higher education. This was similar to the results observed by Ozcivelek et al³⁸ when comparing different artificial intelligence models in responding to prosthodontic patient inquiries, where AI-generated texts required a reading level equivalent to 8th-9th grade or higher, potentially hindering comprehension among elderly patients. The science education texts generated by Kimi-K1.5, Doubao, Tiangong AI2.2.0, and ERNIE Bot-3.5 tended to be excessively lengthy. While detailed explanations can enhance understanding, excessive verbosity may negatively impact reading engagement and effectiveness.³⁹ Additionally, Doubao incorporated recommendations for related health videos on Douyin, though the quality of these videos required further verification. In contrast, ERNIE Bot-3.5 enhanced its texts with colorful supplementary pages, which might stimulate reading interest and improve knowledge comprehension. Kimi-K1.5 and DeepSeek-R1 are more suitable for medical professionals, while iFlytek Spark-V3.5 and DeepSeek-R1 are more suitable for non-professionals.

Implications

Although generative AI models can deliver highly accurate responses to certain health-related inquiries, the absence of cited sources frequently undermines their perceived reliability. Therefore, we suggested that large language models integrate references to authoritative web resources. When providing health-related information, these models should explicitly indicate the specific sources consulted, enabling users to evaluate the credibility of the content by examining the original materials. Furthermore, users engaging with such models may benefit from specifying their identity, and for example, indicating whether they are healthcare professionals or lay individuals. In the initial stage of our study design, given the diverse user demographics, we did not predetermine user roles or include specific qualifiers in the prompts. Consequently, the model-generated responses tend to be generalized and tailored to a broad, non-specialist audience. Regardless of the specific large language model employed, prompt formulation constitutes a critical skill. Variations in how a question is phrased can lead to substantially different outputs. Thus, it is imperative to enhance public awareness and understanding of effective strategies for interacting with generative AI systems.

Strengths and limitations

We assessed the comprehensive scientific accuracy, logical clarity, practical value, quality of references, stance & values, and readability of diabetes-related health education texts created by several popular generative AI models. First, we calculated the inter-rater agreement coefficient for the assessments of the seven models because, although all five assessors had extensive experience, human judgments could still be subject to bias. Second, there was a delay between text generation and final evaluation, despite professionals being asked to analyze the texts immediately after they were generated. Some of the models might have been updated over time, which could have affected our findings. However, even after introducing new versions, the older models are still likely to continue being used in practical applications. Finally, this study explored the quality and readability of health education texts generated by different AI models in the Chinese context. Further research is needed to investigate the quality and readability in other languages.

Conclusion

This study employed seven widely used generative artificial intelligence models to generate health education texts on diabetes through interactive dialogues to answer common questions and evaluated the results. The findings indicated that the overall quality of the health education texts generated by these models was moderate. Among the quality assessment results, Kimi-K1.5 scored significantly higher, while iFlytek Spark-V3.5 performed notably worse. In terms of readability, the health education texts generated by iFlytek Spark-V3.5 and Deepseek-R1 had higher readability, while those generated by Doubao had the lowest overall readability. The Kimi model was more suitable for healthcare professionals, while iFlytek Spark-V3.5 and Deepseek-R1 models were more suitable for non-professionals or those with limited reading ability.

Supplemental material

Supplemental material - Quality evaluation of AI-generated diabetes-related health education texts from different generative models

Supplemental material for Quality evaluation of AI-generated diabetes-related health education texts from different generative models by Xueping Jiao, Xingyu Liu, Shuhan Yang, Yueting Wang, Chenxia Wang, Yunfang Wang, Fanghong Yan, Yuhuan Xie, Yufang Guo, Yuxia Ma, Yanan Zhang in DIGITAL HEALTH

Footnotes

ORCID iDs

Xueping Jiao

Yanan Zhang

Ethical considerations

As this study does not involve any material or data related to human participants or animals, ethical approval is not required and applicable.

Author contributions

YZ conceived the study. YZ and XJ designed the study and drafted the manuscript. XJ, XL, and SY collected all relevant data and assisted in results interpretation. XJ and YW carried out data analysis. YW, CW, YG, YX, FY, and YM participated in the design and coordination. All authors contributed to the article and approved the submitted version.

Funding

The 2025 Research Project of the Chinese Nursing Association [No. ZHKYQ202516] and The General Project of the Gansu Provincial Department of Science and Technology [No. 26JRRA195].

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated or analyzed during this study are available from the corresponding author on reasonable request.*

Supplemental material

Supplemental material for this article is available online.

References

Lin

Pan

, et al. Global, regional, and national burden and trend of diabetes in 195 countries and territories: an analysis from 1990 to 2025. Sci Rep 2020; 10(1): 14790. https://doi.org/10.1038/s41598-020-71908-9, Epub 20200908. PubMed PMID: 32901098; PubMed Central PMCID: PMC7478957.

Ogle

Wang

Haynes

, et al. Global type 1 diabetes prevalence, incidence, and mortality estimates 2025: Results from the International diabetes Federation Atlas, 11th Edition, and the T1D Index Version 3.0. Diabetes Res Clin Pract 2025; 225: 112277. https://doi.org/10.1016/j.diabres.2025.112277, Epub 20250522. PubMed PMID: 40412624.

International Diabetes Federation . IDF Diabetes Atlas 2025 2025. Available from. https://idf.org/about-diabetes/diabetes-facts-figures/ (Accessed 20 November 2025).

, et al. Diabetes in China part 1: epidemiology and risk factors. Lancet Public Health 2024; 9(12): e1089–e1097. https://doi.org/10.1016/s2468-2667(24)00250-0, Epub 20241120. PubMed PMID: 39579774.

Stafford

Gage

, et al. Global, regional, and national cascades of diabetes care, 2000-23: a systematic review and modelling analysis using findings from the Global Burden of Disease Study. Lancet Diabetes Endocrinol 2025; 13(11): 924–934. https://doi.org/10.1016/s2213-8587(25)00217-7, Epub 20250908. PubMed PMID: 40934935.

Chinese Diabetes Society . Guideline for the prevention and treatment of diabetes mellitus in China (2024 edition). Chinese Journal of Diabetes 2025; 17(1): 16–139. https://doi.org/10.3760/cma.j.cn115791-20241203-00705

Mehedi

SFMJ

Shakil

Sharmin

Hovy N

, et al. An advanced deep neural network for fundus image analysis and enhancing diabetic retinopathy detection. Healthcare Analytics 2024; 5: 100303. https://doi.org/10.1016/j.health.2024.100303

Mazumder

MSA

Hossain

Shamrat

FMJM

, et al. (eds). Deep Learning Approaches for Diabetic Retinopathy Detection by Image Classification. In: 2022 3rd International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 20–22 October 2022. Piscataway (NJ): IEEE, 2022. https://doi.org/10.1109/ICOSEC54921.2022.9952159

Mazumder

MSA

Shamrat

FMJM

Mahmud

, et al. (eds). DRDnet22: Advanced convolutional techniques for early diabetic retinopathy detection in retinal images. In: 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024. Piscataway (NJ): IEEE; 2024. https://doi.org/10.1109/ICCCNT61001.2024.10724287

10.

Mariappan

. Extensive Review of Literature on Explainable AI (XAI) in Healthcare Applications. Recent Advances in Computer Science and Communications 2025; 18(1): e200324228159. https://doi.org/10.2174/0126662558296699240314055348

11.

AlSaad

Abd-alrazaq

Boughorbel

, et al. Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. J Med Internet Res 2024; 26: e59505. https://doi.org/10.2196/59505

12.

Chu

Zhang

, et al. Large Language Models in Medicine: Applications, Challenges, and Future Directions. Int J Med Sci 2025; 22(11): 2792–2801. https://doi.org/10.7150/ijms.111780, Epub 20250531. PubMed PMID: 40520893; PubMed Central PMCID: PMC12163604.

13.

Shahsavar

Choudhury

. User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study. JMIR Hum Factors 2023; 10: e47564. https://doi.org/10.2196/47564, Epub 20230517. PubMed PMID: 37195756; PubMed Central PMCID: PMC10233444.

14.

Joshi

. Artificial General Intelligence (AGI): A Comprehensive Review. Journal of the Epidemiology Foundation of India 2024; 2: 93–96. https://doi.org/10.56450/JEFI.2024.v2i03.004

15.

Xia

Cao

Chen

, et al. Chinese expert consensus on the application of large-scale AI technology in the medical service field. Health Law 2023; 31(05): 124–126. https://doi.org/10.19752/j.cnki.1004-6607.2023.05.024

16.

Musheyev

Pan

Gross

, et al. Readability and Information Quality in Cancer Information From a Free vs Paid Chatbot. JAMA Netw Open 2024; 7(7): e2422275. https://doi.org/10.1001/jamanetworkopen.2024.22275, Epub 20240701. PubMed PMID: 39058491; PubMed Central PMCID: PMC11282443.

17.

Paluszek

Loeb

. Artificial intelligence and patient education. Curr Opin Urol 2025; 35(3): 219–223. https://doi.org/10.1097/mou.0000000000001267, Epub 20250212. PubMed PMID: 39945126; PubMed Central PMCID: PMC11964839.

18.

Behers

Stephenson-Moe

Gibons

, et al. Assessing the Quality of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study. Cureus 2024; 16(9): e69996. https://doi.org/10.7759/cureus.69996, Epub 20240923. PubMed PMID: 39445289; PubMed Central PMCID: PMC11498076.

19.

Büker

Mercan

. Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment. Int J Med Inform 2025; 201: 105948. https://doi.org/10.1016/j.ijmedinf.2025.105948, Epub 20250425. PubMed PMID: 40288015.

20.

Behers

Vargas

Behers

, et al. Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study. Cureus 2024; 16(7): e63865. https://doi.org/10.7759/cureus.63865, Epub 20240704. PubMed PMID: 39099896; PubMed Central PMCID: PMC11297732.

21.

Saji

Balagangatharan

Bajaj

, et al. Analysis of Patient Education Guides Generated by ChatGPT and Gemini on Common Anti-diabetic Drugs: A Cross-Sectional Study. Cureus 2025; 17(3): e81156. https://doi.org/10.7759/cureus.81156, Epub 20250325. PubMed PMID: 40276455; PubMed Central PMCID: PMC12020652.

22.

Arora

Ramesh

Moe

, et al. Evaluating Artificial Intelligence (AI)-Generated Patient Education Guides on Epilepsy: A Cross-Sectional Study of ChatGPT and Google Gemini. Cureus 2024; 16(11): e73212. https://doi.org/10.7759/cureus.73212, Epub 20241107. PubMed PMID: 39650997; PubMed Central PMCID: PMC11624845.

23.

Yang

Huang

, et al. Generative Artificial Intelligence Models in Public Education on Prevention of Cervical Cancer. Journal of Chinese Oncology 2024; 30(09): 774–779. https://doi.org/10.11735/j.issn.1671-170X.2024.09.B010

24.

Corfmat

Martineau

Régis

. High-reward, high-risk technologies? An ethical and legal account of AI development in healthcare. BMC Med Ethics 2025; 26(1): 4. https://doi.org/10.1186/s12910-024-01158-1, Epub 20250115. PubMed PMID: 39815254; PubMed Central PMCID: PMC11734583.

25.

Bai

. Computational Linguistics in the Scope of Computational Humanities: Current Situation and Paradigm. Library & Information 2023; (01): 12–20. https://doi.org/10.11968/tsyqb.1003-6938.2023002

26.

. A Readability Analysis of the “Management’s Discussion and Analysis” in Annual Reports of Chinese Listed Companies. Times Finance 2018; (09): 225–226.

27.

Qin

Qing

Songyun

. An Empirical Study on Readability Calculation and Application of Chinese Online Health Education Information—A Case Study of Food Safety. Journal of Modern Information 2020; 40(05): 111–121. https://doi.org/10.3969/j.issn.1008-0821.2020.05.014

28.

Luo

Miao

Zhao

, et al. Comparing the Accuracy of Two Generated Large Language Models in Identifying Health-Related Rumors or Misconceptions and the Applicability in Health Science Popularization: Proof-of-Concept Study. JMIR formative research 2024; 8: e63188. https://doi.org/10.2196/63188, Epub 20241202. PubMed PMID: 39622076; PubMed Central PMCID: PMC11627524.

29.

Cascella

, et al. Reliability, Accuracy, and Comprehensibility of AI-Based Responses to Common Patient Questions Regarding Spinal Cord Stimulation. J Clin Med 2025; 14(5): 1453, Epub 20250221. https://doi.org/10.3390/jcm14051453, PubMed PMID: 40094896; PubMed Central PMCID: PMC11899866.

30.

Yun

. Interpretation of the 2016 Edition of the Technical Guidelines for Diabetic Drug Injection in China. Shanghai Nursing 2018; 18(04): 5–9. https://doi.org/10.3969/j.issn.1009-8399.2018.04.001.

31.

Hussain

Grundy

. Advice for Diabetes Self-Management by ChatGPT Models: Challenges and Recommendations. ArXiv. 2025;abs/2501.07931, Available from. https://arxiv.org/abs/2501.07931 (Accessed 20 November 2025).

32.

Gokbulut

Kuskonmaz

Onder

, et al. Evaluation of ChatGPT-4 Performance in Answering Patients' Questions About the Management of Type 2 Diabetes. Sisli Etfal Hastan Tip Bul 2024; 58(4): 483–490. https://doi.org/10.14744/semb.2024.23697, Epub 20241224. PubMed PMID: 39816417; PubMed Central PMCID: PMC11729837.

33.

Chen

Han

. Research on the Hint Framework for Knowledge Extraction From Scientific and Technological Literature. Journal of Modern Information 2026; 46(2): 91–101. https://doi.org/10.3969/j.issn.1008-0821.2026.02.008.

34.

Jain

Kankanhalli

. Hallucination is Inevitable: An Innate Limitation of Large Language Models. undefined 2024. https://doi.org/10.48550/arXiv.2401.11817 Available from. (Accessed 20 November 2025).

35.

Zhou

Pan

Zhang

, et al. Evaluating AI-generated patient education materials for spinal surgeries: Comparative analysis of readability and DISCERN quality across ChatGPT and deepseek models. Int J Med Inform 2025; 198: 105871. https://doi.org/10.1016/j.ijmedinf.2025.105871, Epub 20250313. PubMed PMID: 40107040.

36.

Ozduran

Hancı

Erkin

, et al. Assessing the readability, quality and reliability of responses produced by ChatGPT, Gemini, and Perplexity regarding most frequently asked keywords about low back pain. PeerJ 2025; 13: e18847. https://doi.org/10.7717/peerj.18847, Epub 20250122. PubMed PMID: 39866564; PubMed Central PMCID: PMC11760201.

37.

Stephenson-Moe

Behers

Gibons

, et al. Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study. Medicine (Baltimore) 2025; 104(15): e42135, PubMed PMID: 40228277; PubMed Central PMCID: PMC11999455. https://doi.org/10.1097/md.0000000000042135.

38.

Özcivelek

Özcan

. Comparative evaluation of responses from DeepSeek-R1, ChatGPT-o1, ChatGPT-4, and dental GPT chatbots to patient inquiries about dental and maxillofacial prostheses. BMC Oral Health 2025; 25(1): 871. https://doi.org/10.1186/s12903-025-06267-w, Epub 20250531. PubMed PMID: 40450291; PubMed Central PMCID: PMC12126883.

39.

Huang

Dong

Jiang

, et al. The effects of text direction of different text lengths on Chinese reading. Sci Rep 2023; 13(1): 8660. https://doi.org/10.1038/s41598-023-35859-1, Epub 20230529. PubMed PMID: 37248273; PubMed Central PMCID: PMC10226983.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB