Artificial Intelligence Applications Versus Manual Methods For Literature Retrieval: A Comparative Analysis

Abstract

Background:

Artificial intelligence (AI), particularly generative and large language models, is being used in nursing education, practice, and scholarly writing. Generative AI applications have been specifically examined for their use in conducting literature reviews with evidence supporting reduced production time of scholarly work. However, there has been limited investigation of their levels of accuracy with identifying references for a literature review.

Objective:

The purpose of this study was to compare human-generated citations of literature reviews with AI literature-review generated citations.

Methods:

Using a comparative exploratory design, references from 4 human-written literature reviews, 2 published and 2 unpublished, on 4 different topics, were compared to references derived from 2 AI literature applications, Consensus and Elicit. Three prompting strategies were utilized, including prompts generated using ChatGPT-4. Agreement between the AI and human references was evaluated.

Results:

The percent of agreement between AI and human generated reference lists ranged from 0% to 63.6%. The Consensus application had a greater overall mean rate of match (21.3%) as compared to Elicit (3.7%). The use of a ChatGPT-4 prompt did not significantly impact results, and there were no differences based on published or unpublished literature reviews.

Conclusion:

The 2 literature-based applications examined in this study offered a glimpse of their potential use and limitations. The use of an AI literature review application may support but not replace human work.

Keywords

artificial intelligence large language models literature search nursing nursing research

Introduction

The use of artificial intelligence (AI), particularly generative AI and large language models (LLMs), has grown by an estimated 540 000% over the last decade.¹ The personal and professional use cases sometimes appear limitless with a staggering 300 million active weekly users according to Reuters (2025).² In recent surveys of student and faculty use, an estimated 90% of college-age students report regular use³ and 94% of faculty working in higher education used AI in the past 6 months.⁴

Applications like ChatGPT-4, a commonly used generative AI application, have access to approximately 3000 trillion words, pulling data from various sources such as books, articles, websites, and social media posts.⁵ The vast array of retrievable data and sophistication of pattern logic has driven the utilization of LLMs for nursing education,⁶ practice,⁷ as well as scholarly research and writing support.⁸ In a 2024 systematic review of 24 studies of AI use in academic writing and research, the authors concluded that there was a strong potential for the use of AI in idea generation, content structuring, literature synthesis, and data management but recommended further investigation of how the tools can be used to support human work.⁹ According to a 2025 scoping review of AI use in nursing education, primary areas of use included content and curricular development, educational support, and simulation training.⁶ The authors commented on a lack of rigorous evaluation of AI use in nursing education.

Literature reviews are a common type of scholarly writing conducted by nurse educators and researchers. Writing literature reviews requires a certain level of expertise coupled with methodological soundness.¹⁰ High quality and effective reviews require time and resources, with scoping and systematic reviews estimated to take between 6 to 18 months from initiation to submission for publication.^11,12 A primary step of any literature review is to conduct a search of the literature.¹³ This is an iterative process where the user starts with search terms, often based on a combination of terms using Boolean operators, and as results are produced, the search is expanded or narrowed, often using filters. A rigorous search of the literature includes multiple bibliographic databases and a trial-and-error approach, with the user considered the content expert to determine when the search is complete.

Generative AI applications now include applications specific for conducting literature reviews, including the step of literature searching. A major advantage of these applications is the significant decrease in time spent, from finding citations to summarizing the findings and writing the final article.^14-16 A report published in the United Kingdom cited a 23% reduction in time spent conducting a full literature review, including article scanning, selection of articles, and synthesis, when AI was utilized.¹⁵ Recent literature across diverse fields of study cites mixed results with respect to the quality of literature searches using AI. Kacena et al¹⁶ conducted a study to determine if ChatGPT-4 could assist with writing a credible, peer-reviewed, scientific review article on 3 topics related to musculoskeletal conditions. They found that only 30% of references were accurate. Similarly, Mostafapour et al¹⁴ conducted a literature review on the topic of physician and patient relationships and then conducted that same review using ChatGPT-4. They noted that 24% of the AI-produced references were only somewhat related to the topic and 7.5% were completely irrelevant. The authors also found that when using iterative prompt strategies in ChatGPT, to mirror the human process, the more contextual factors that were asked of ChatGPT-4, the greater the risk for hallucinations, or fake citations.

The 2 previously cited examples represent the use of generative AI applications, such as ChatGPT-4, to conduct literature reviews. More recently, several generative AI literature review-specific applications have emerged. These differ in a variety of ways including the reference databases they have access to, the LLMs they use for their algorithms, the outputs they allow for, and other functions such as filtering options. For example, vended applications like Elicit, Consensus, and Perplexity utilize Semantic Scholar as a primary resource, while others, like Evidence Hunt, utilize PubMed.¹⁷ According to Bolaños et al,¹⁷ the benefit of literature-review-specific AI applications is the use of natural language rather than keyword searching and the incorporation of information from a collection of documents, which creates greater opportunity to reduce inaccuracies and hallucinations.

Apata et al¹⁸ reviewed Consensus and the effectiveness in conducting a literature review, including identifying citations. They found that there is a lack of studies evaluating the search quality of the Consensus application. Bernard et al¹⁹ evaluated if Elicit strengthens a systematic review process by comparing results of an umbrella review on older adult living environments with and without the use of AI support. They found that Elicit missed finding 82% of articles from the original review. The use of Elicit among nursing students has also been explored.²⁰ In a 2024 study of graduate nursing students (n = 323), students were asked to conduct a search of the literature on their chosen concept using PubMed, Cumulative Index to Nursing and Allied Health Literature (CINAHL), and Elicit. While the accuracy of the 3 applications was not explored, students reflected on their preference and perception of the 3 methods. Preference was similar, with students selecting CINAHL (31.6%) and PubMed (30.7%) just slightly over Elicit (26%). Primary advantages of the Elicit application were the ability to have an abstract produced, user friendliness, and the number of relevant articles retrieved. Primary disadvantages were the infinite number of citations that resulted, limited search filters, and low accuracy/trustworthiness of the sources. The authors recommended the use of AI literature applications in helping students develop relevant terminology and restructure their research queries.

Despite the proliferation of AI applications for scholarly literature searching, there has been limited investigation of their levels of accuracy and overall quality compared to human experts. While AI can work more efficiently, humans must ensure the accuracy, relevance, and comprehensiveness of AI outputs.⁸

Purpose

Understanding benefits and limitations of AI literature applications is important as the nursing profession looks to understand and guide the preparation of nurses to responsibly apply AI in education, research, and practice.²¹ The purpose of this study was to compare human-generated citations of literature reviews with AI literature-review generated citations.

Methods

A comparative exploratory design was utilized to determine the rate of agreement between AI generated references and human generated references. References derived from 2 different AI literature review platforms, Consensus and Elicit, were compared with references from 4 human completed literature reviews on 4 different topics. Data analyses were conducted in April 2025.

Human Literature Reviews

A total of 4 human-written literature reviews were utilized in this study. A purposeful selection of literature reviews, conducted by a member of the author team of this study, was chosen so that it could be determined if any if the human-generated literature reviews missed any citations that the AI applications found. All authors were nursing faculty engaged in research who had previously published. Four articles, including different types of literature review methods and non-specific nursing areas, were selected to minimize the bias of homogeneity of the citations.

Article 1 was a scoping review that used a wide search of any research-based articles about interruptions and medication administration errors from database inception to 2024.²² Article 1 was submitted and published in a high-index journal in 2025, following data analysis for this article. Article 2, not yet published, was an integrative review of the use of objective-structured clinical examinations in nurse practitioner students who focused on a 10-year timeframe of research and quality improvement articles. Article 3 was a published 2025 concept analysis of technostress and compassion fatigue in remote healthcare and social service workers since COVID-19.²³ Article 4 was a published integrative review, covering a 5-year timeframe of studies and quality improvement articles on the topic of interruptions and distractions during nurse-to-nurse handoff communication.²⁴ Further details of the 4 articles are included in Table 1.

Table 1.

Characteristics of Human Generated Literature Reviews.

Characteristic	Article 1	Article 2	Article 3	Article 4
Type of review	Scoping	Integrative	Evolutionary concept analysis	Integrative
Inclusion criteria	Database inception October 2024 English language Quantitative	2015-2024 English language Qualitative, Quantitative or Quality Improvement	2019-2025 English language Qualitative or Quantitative	2017-2022 English language Qualitative, Quantitative or Quality Improvement
Exclusion criteria	Review article Qualitative	Review article	Review article Intervention studies	Review article Dissertations
Databases searched	Ovid Medline, EMBASE, CINAHL, Web of Science, and Scopus	PubMed, CINAHL, PsycINFO, Scopus, and ERIC	PubMed and 33 Ovid resources	CINAHL, PubMed, Scopus, and PsycINFO
Articles included in final review	22	15	28	17

AI Applications and Prompting Strategies

Consensus and Elicit, two AI literature review applications, were utilized in this study. This decision was based on both applications commonly being cited in the last 2 years specifically as academic research and literature review tools, designed to focus on biomedicine and social sciences.¹⁷ Both applications also draw from the same primary database, with access to Semantic Scholar. Semantic Scholar is considered an original AI library with an index of over 233 million academic articles. Consensus and Elicit, in comparison to Semantic Scholar, are specific for academic research as they include workflow and data extraction functions and pull from Semantic Scholar. We also selected Consensus and Elicit because of free access to both applications, which we felt was important as an aspect of accessibility for faculty and students. It is important to note that at the initial conception of this work, both applications were free; however, when the searches were run in April 2025, Elicit required a fee to download the results so a Plus package was purchased by the author team to perform the analysis. Table 2 gives features of the 2 applications as of April 2025, when the searches were run.

Table 2.

Features of AI Applications.

Feature	Consensus	Elicit
Website description of databased sources (as published April, 2025)	200 million articles from Semantic Scholar database	25 million academic articles from the Semantic Scholar corpus
Years searchable	Searches by 1-year range, any year to 2025	Searches by 1-year range, any year to 2025
Study type	Meta-analysis, systematic review, RCT, non-RCT, observational study, literature review, case report, animal study, and invitro study	Review, meta-analysis, systematic review, RCT, and longitudinal

Abbreviations: AI, artificial intelligence; RCT, randomized control trial.

An initial pilot test of AI-prompting strategies was first conducted using a research topic unrelated to this project. This was done to determine if inclusion and exclusion criteria should be included as part of a prompt or through the AI application filter. Filters available in both AI applications included publication date range, study design, and journal ranking based on an external metric. It was determined that including search filters in a prompt did not produce beneficial results. The authors thus used the filters of publication year and study design in both Consensus and Elicit.

Three separate generative AI-prompting strategies were utilized for each of the review topics to determine how different prompts affected search results, as it has been suggested that the quality of AI responses depend on the quality of the prompt.²⁵ The authors first independently created a prompt question for the review articles that they were the primary author of (prompt 1). Next, the author team came together, after reading each of the 4 articles, and created a consensus prompt for the literature search (prompt 2). Lastly, ChatGPT-4 was used to create a prompt for each search (prompt 3). To produce a ChatGPT-4 prompt, a standard formula with the following phrase was used: “I am a researcher conducting a literature review using the research question, XXXXX. Generate one prompt that can be used to search in an LLM.” Table 3 includes the prompts and filters applied.

Table 3.

Prompt Strategies and Filters Applied.

Prompt	Article 1	Article 2	Article 3	Article 4
Prompt 1 (expert)	Find peer-reviewed quantitative research articles that investigated the association between interruptions to nurses in hospital settings and medication administration errors. Focus on studies that report inferential statistics.	Find peer reviewed studies that evaluate the use of OSCEs in nurse practitioner student education.	Investigate the relationship between technostress and compassion fatigue, focusing on the emergence of digital compassion fatigue as a distinct phenomenon for remote healthcare and social service workers.	Find peer-reviewed studies and quality improvement projects that examine distractions and interruptions in nurse handoff.
Prompt 2 (consensus)	Find peer-reviewed quantitative research articles that investigated the correlation between interruptions to nurses in hospital settings and medication administration errors.	Is there evidence to support the use of OSCEs to improve learning outcomes among graduate nurse practitioner students?	Investigate the relationship between technostress and compassion fatigue, focusing on the emergence of digital compassion fatigue as an emerging concept among remote healthcare and social service workers published since the COVID-19 pandemic.	Is there evidence about nurses’ perceptions of distractors and interruptions during nurse-to-nurse handoff?
Prompt 3 (ChatGPT-4)	Identify and summarize quantitative research studies that report inferential statistical analyses (eg, regression, correlation, hypothesis testing) examining the association between interruptions and medication administration errors among nurses in hospital settings. Focus on study design, statistical methods, sample characteristics, and key findings.	Identify and summarize peer-reviewed studies that examine the impact of OSCEs on learning outcomes among graduate nurse practitioner students.	Identify and summarize research studies examining the relationship between technostress and compassion fatigue, with specific attention to the emergence of digital compassion fatigue as a distinct phenomenon among remote healthcare and social service workers.	Summarize qualitative and quantitative research studies that explore nurses’ perceptions of distractors and interruptions during nurse-to-nurse handoff in hospital settings, including impacts on communication quality, patient safety, and handoff effectiveness.
Filters Applied Application A (Consensus)	Search only randomized and non-randomized controlled trials, and observational studies; any year to 2024.	Search only randomized and non-randomized controlled trials, and observational studies; 2015 to 2024.	No open access; Search only randomized and non-randomized controlled trials, observational studies, and systematic reviews; 2019 to current.	Search only randomized and non-randomized controlled trials, and observational studies; 2017 to 2022.
Filters Applied Application B (Elicit)	Search only RCT or longitudinal studies; any year to 2024.	Search only RCT or longitudinal studies; 2015 to 2025 (with manual removal of 2025 citations).	Search only systematic reviews, randomized controlled trials, and longitudinal studies; 2019 to current.	Search only RCT or longitudinal studies; 2017 to 2025 (with manual removal citations after 2022).

Abbreviation: OSCE, Objective Structured Clinical Exam.

All searches were conducted during the same week in April 2025 to ensure consistency in the search process given that the AI application search capabilities, data access, and underlying models are noted to change regularly. Each author independently ran the 3 prompts related to their review topic in sequential order through the 2 AI applications. The first generation of 10 citations and a sequential generation of outputs up to a total of 100 citations were downloaded for analysis. This approach was utilized as both AI applications indicated that they generate the most relevant citations in order of fit. All citations were downloaded into an excel spreadsheet for review. Once downloaded, the citations from the matched human-written article were entered into the same Excel file so that a simple sorting function could be used to make a comparison. The primary author had access to all searches to assess the accuracy of the comparison.

Data Analysis

A matrix of the 4 articles across the 3 prompts as well as the search returns for the first 10 and 100 citations, across both AI applications, were used to complete the analysis. Forty-eight individual searches were conducted across a total of 1320 citations. For each search, the percent of agreement was calculated using the number of matched AI-derived references and the human-derived references included in each review. A simple percent match was calculated using the equation of number of AI references in agreement divided by the total studies included in the human searched literature review.

To determine if the AI applications discovered any articles that the human searches did not identify, each author manually reviewed the AI reference lists related to their review topic. If an AI-derived reference appeared to fit the inclusion criteria of the completed review article, the author screened the article abstract and full text to decide fit. Presentation and discussion to the entire author group then occurred to make a final determination of fit.

Ethical Considerations

Ethical approval was not required as this study did not involve human participants or human data.

Results

The top 10 and top 100 results across the 2 literature review AI applications were examined. Across all 3 prompts and both AI applications, the overall agreement with the 4 human-searched references ranged from 0% to 63.6% as shown in Tables 4 and 5. The Consensus application had a greater overall mean agreement rate to human searching (21.3%) as compared to the Elicit application (3.7%). Agreement improved from the 10 results to the top 100.

Table 4.

Consensus and Human Reviews Rate of Agreement.

Prompt	Article 1# matched (percent)	Article 2# matched (percent)	Article 3# matched (percent)	Article 4# matched (percent)
Prompt 1 Top 10	3 (13.6)	3 (20)	0 (0)	3 (18)
Prompt 1 Top 100	13 (59)	5 (33)	1 (3.6)	7 (41)
Prompt 2 Top 10	3 (13.6)	2 (13)	0 (0)	2 (12)
Prompt 2 Top 100	13 (59)	5 (33)	1 (3.6)	4 (23)
Prompt 3 Top 10	6 (27.3)	2 (13)	0 (0)	4 (23)
Prompt 3 Top 100	14 (63.6)	5 (33)	0 (0)	9 (53)

Table 5.

Elicit and Human Reviews Rate of Agreement.

Prompt	Article 1# matched (%)	Article 4# matched (%)
Prompt 1 Top 10	0 (0)	2 (12)
Prompt 1 Top 100	0 (0)	2 (12)
Prompt 2 Top 10	1 (4.5)	2 (12)
Prompt 2 Top 100	2 (9)	2 (12)
Prompt 3 Top 10	1 (4.5)	2 (12)

Results by Review Article and Prompts

Articles 1 and 2 were unpublished when data analysis occurred in April 2025. The Consensus application reached an agreement rate threshold of 63.6% for article 1 and 33% for article 2, across the 3 prompt strategies. Articles 3 and 4 were published in indexed journals at the time the AI platforms were searched. Article 3 had very low agreement rates (range 0%-3.6%) across the prompts and both applications. Article 4 had an agreement rate up to 53% in Consensus and 12% in Elicit. The agreement rate for Elicit between published and unpublished articles varied, but was consistently low.

In comparing the 3 prompting strategies, the ChatGPT-4 prompt (prompt 3) performed as well or slightly better in Consensus. Elicit had an equally poor performance across the 3 prompts (see Tables 4 and 5).

New AI References Identified

The Consensus application identified 1 reference not identified in the human search for article 3 that was a fit with inclusion criteria for the article. No new references for articles 1, 2, or 4 were identified in either AI platform.

Discussion

There has been limited investigation of the accuracy and overall quality of AI applications compared to humans for literature identification. This study provides insights into the performance of 2 AI literature review applications compared with human literature searches. The findings can be used to guide nurse educators, researchers, and practitioners who are considering using AI to search for literature.

Low agreement between references from human-written literature reviews and AI applications was found. A primary reason is likely the limitations on data that AI applications have access to. For example, both Elicit and Consensus have access to Semantic Scholar, which has partnerships with academic publishers and academic journals, including PubMed.²⁶ However, AI applications can only access full-text downloads of articles that are open access and often cannot get access to databases that require commercial licenses or paywalls. In this study, Consensus had a higher rate of agreement with the 4 original human literature reviews compared to Elicit. At the time of data collection, the Consensus application cited having access to 200 million articles from the Semantic Scholar database as compared to Elicit, which reported access to 25 million academic articles from Semantic Scholar. As of March 2026, Consensus reported access to over 250 million articles with additional article access to OpenAlex and a crawl of the scholarly web. Elicit reported access to 138 million articles and access to OpenAlex, indicating a rapid increase in access for both applications. Thus is it likely that if the same study methods were applied in March 2026, that results might differ.

In 2017, PubMed deployed a new relevance sort option (labeled as Best Match) to improve searches with a more relevant sort order of references displayed by topic. In this study, the researchers similarly thought that the top 10 results for each AI application would return the most relevant sets of references.²⁷ This was not supported, as shown by the low number of matches in the top 10 results from both applications. This is one example of how traditional literature searching, a learned skill, may not translate directly with AI search applications and thus, nurses need to understand uses and limitations of AI. It has been suggested that the extent of a human’s AI knowledge has a potential influence on the results of an AI search process. Karcena et al¹⁶ reported that a researcher with more advanced AI knowledge may be more efficient and effective in guiding a generative AI application and altering search settings to produce more relevant results. Improving knowledge of how to create and use AI prompts can result in improved outputs, as prompting is a learnable skill.²⁸

The American Association of Colleges of Nursing (AACN) is integrating AI as a component of foundational informatics competencies (domain 8),²¹ and the 2026 American Academy of Nursing position statement on AI in healthcare endorses AI as a transformative force that requires nurses to be equipped with AI literacy.²⁹ The N.U.R.S.E.S framework, which stands for Navigate AI basics, Utilize AI strategically, Recognize AI pitfalls, Skills support, Ethics in action, and Shape the future, supports AI literacy across all levels of nursing.³⁰ With the anticipation of nurses integrating AI in their work as educators, researchers, or clinicians, nursing faculty, who report limited knowledge and skill with AI use,³¹ will need to quickly become proficient to role model AI competency.

Limitations

The study included a small sample of literature review applications (n = 2) and review articles (n = 4), thus limiting generalizability. The choice of 2 AI applications was purposeful as Consensus and Elicit were common AI literature review tools at the time of data extraction, both having been in existence for several years at the time of our search, and noted for their access to biomedical and social science data. Since our search, additional applications like Paperguide and Research Rabbit have been released and the features of Consensus and Elicit have expanded. The rapid growth of these applications makes it a challenge to do widespread comparisons. Comparing outcomes of 4 articles is acknowledged as a limitation. Most generative AI applications like those examined in this study are heavily reliant on the prompt utilized, especially in a single-turn prompt application. Expanded experimentation with different prompts and filtering strategies may have yielded improved results. Lastly, these types of applications evolve quickly with updated models, filtering options, and other system enhancements that could significantly improve performance over time.

Conclusion

Literature review applications like those examined in this study may be suitable to support human literature work but should not replace human work. If AI is used for a literature search, the use or more than one AI application is recommended, as different applications examine and access varied datasets, and vary in functional options (eg, filters) and underlying algorithms.

Footnotes

ORCID iDs

Jenny O’Rourke

Ginger Schroers

Ethical Considerations

Ethical approval was not required as this study did not involve human participants or human data.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

AI Use

AI applications were used in the data collection for this article as described. AI was not used for any of the writing or conceptualization of the article.

References

Trends in artificial intelligence. Epoch AI. Updated February 5, 2026. Accessed March 5, 2026. https://epoch.ai/trends

Reuters. OpenAI’s weekly active users surpass 400 million. February 20, 2025. Accessed June 8, 2025. https://www.reuters.com/technology/artificial-intelligence/openais-weekly-active-users-surpass-400-million-2025-02-20/

Legatt

. 90% of college students use AI: higher Ed needs AI fluency support now. Forbes. September 18, 2025. Accessed March 5, 2026. https://www.forbes.com/sites/avivalegatt/2025/09/18/90-of-college-students-use-ai-higher-ed-needs-ai-fluency-support-now/

Robert

. The impact of AI on work in higher education. EDUCAUSE. January 12, 2026. Accessed March 6, 2026. https://www.educause.edu/research/2026/the-impact-of-ai-on-work-in-higher-education

Open AI. GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. ND. Accessed June 8, 2025. https://openai.com/index/gpt-4/

Doston

Fontenot

Morris

Hebert

The use of artificial intelligence in nursing education: a scoping review. J Nurs Educ. 2025;64(8):479-488. doi:10.3928/01484834-20250313-03

Rony

MKK

Das

Khalil

, et al. The role of artificial intelligence in nursing care: an umbrella review. Nurs Inq. 2025;32(2):e70023. doi:10.1111/nin.70023

Oermann

Owens

Carter-Templeton

Peterson

Bailey

HE.

Using artificial intelligence for scholarly writing. Am J Nurs. 2025;125(11):52-55. doi:10.1097/AJN.0000000000000179

Khalifa

Albadawy

Using artificial intelligence in academic writing and research: an essential productivity tool. Comput Methods Programs Biomed Update. 2024;5:100145. doi:10.1016/j.cmpbup.2024.100145

10.

Chetwynd

Critical analysis of reliability and validity in literature reviews. J Hum Lact. 2022;38(3):392-396. doi:10.1177/08903344221100201

11.

Tricco

Lillie

Zarin

, et al. A scoping review on the conduct and reporting of scoping reviews. BMC Med Res Methodol. 2016;16:15. doi:10.1186/s12874-016-0116-4

12.

Borah

Brown

Capers

Kaiser

KA.

Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017;7(2):e012545. doi:10.1136/bmjopen-2016-012545

13.

Oermann

Knafl

KA.

Strategies for completing a successful integrative review. Nurse Author Ed. 2021;31(3-4):65-68. doi:10.1111/nae2.30

14.

Mostafapour

Fortier

Pacheco

Murray

Garber

Evaluating literature reviews conducted by humans versus ChatGPT: comparative study. JMIR AI. 2024;3:e56537. doi:10.2196/56537

15.

Egan

Leak-Smith

Hanna-Amodio

, et al. AI-assisted vs human-only evidence review: results from a comparative study. April 23, 2025. Accessed June 25, 2025. https://www.gov.uk/government/publications/ai-assisted-vs-human-only-evidence-review/ai-assisted-vs-human-only-evidence-review-results-from-a-comparative-study#key-results-how-the-2-reviews-compared

16.

Kacena

Plotkin

Fehrenbacher

JC.

The use of artificial intelligence in writing scientific review articles. Curr Osteoporos Rep. 2024;22(1):115-121. doi:10.1007/s11914-023-00852-0

17.

Bolaños

Salatino

Osborne

Motta

Artificial intelligence for literature reviews: opportunities and challenges. Artif Intell Rev. 2024;57:259. doi: 10.1007/s10462-024-10902-3

18.

Apata

Kwok

Lee

YH.

The use of generative artificial intelligence (AI) in academic research: a review of the consensus app. Cureus. 2025;17(7):e87297. doi:10.7759/cureus.87297

19.

Bernard

Sagawa

Jr Bier

Lihoreau

Pazart

Tannou

Using artificial intelligence for systematic review: the example of elicit. BMC Med Res Methodol. 2025;25(1):75. doi:10.1186/s12874-025-02528-y

20.

Fenske

Otts

JAA

. Incorporating generative AI to promote inquiry-based learning: comparing elicit AI research assistant to PubMed and CINAHL complete. Med Ref Serv Q. 2024;43(4):292-305. doi:10.1080/02763869.2024.2403272

21.

2025 THOUGHT LEADERS ASSEMBLY of AI to Transform Nursing Education. Accessed March 10, 2026. https://www.aacnnursing.org/Portals/0/PDFs/Reports/Thought-Leadership/AACN-2025-Thought-Leaders-Assembly-Summary.pdf

22.

Schroers

Huggins

Sasangohar

O’Rourke

Associations between interruptions and medication administration errors among nurses in hospital settings: a scoping review of quantitative studies. J Adv Nurs. 2026;82(4):2551-2569. doi:10.1111/jan.70032

23.

Byrne

Digital compassion fatigue as an emerging phenomenon for registered nurses experiencing technostress. Appl Clin Inform. 2025;16(3):708-717. doi:10.1055/a-2564-8809

24.

Vanderzwan

Kilroy

Daniels

O’Rourke

Nurse-to-nurse handoff with distractors and interruptions: an integrative review. Nurse Educ Pract. 2023;67:103550. doi:10.1016/j.nepr.2023.103550

25.

Park

Choo

Generative AI prompt engineering for educators: practical strategies. Journal of Special Education Technology. 2025;40(3):411-417. doi:10.1177/01626434241298954

26.

Semantic Scholar Publishers. Semanticscholar.org. Published 2024. Accessed March 10, 2026. https://www.semanticscholar.org/about/publishers

27.

Fiorini

Canese

Starchenko

, et al. Best match: new relevance search for PubMed. PLoS Biol. 2018;16(8):e2005343. doi:10.1371/journal.pbio.2005343

28.

Sloan

. Study: generative AI results depend on user prompts as much as models | MIT Sloan. MIT Sloan. Published August 4, 2025. Accessed March 11, 2026. https://mitsloan.mit.edu/ideas-made-to-matter/study-generative-ai-results-depend-user-prompts-much-models

29.

Position Statement: Artificial Intelligence in Health Care–American Academy of Nursing. Aannet.org. Published 2026. Accessed March 17, 2026. https://aannet.org/page/AI-position-statement-2026

30.

Hoelscher

Pugh

N.U.R.S.E.S. embracing artificial intelligence: a guide to artificial intelligence literacy for the nursing profession. Nurs Outlook. 2025;73(4):102466. doi:10.1016/j.outlook.2025.102466

31.

Ehmke

Bridges

Patel

SE.

Self-perceived knowledge, skills, and attitude of nursing faculty on generative artificial intelligence in nursing education: a descriptive, cross-sectional study. Teach Learn Nurs. 2025;20(3):222-227. doi:10.1016/j.teln.2025.01.029