Artificial Intelligence for Urology Research: The Holy Grail of Data Science or Pandora's Box of Misinformation?

Abstract

Introduction:

Artificial intelligence tools such as the large language models (LLMs) Bard and ChatGPT have generated significant research interest. Utilization of these LLMs to study the epidemiology of a target population could benefit urologists. We investigated whether Bard and ChatGPT can perform a large-scale calculation of the incidence and prevalence of kidney stone disease.

Materials and Methods:

We obtained reference values from two published studies, which used the National Health and Nutrition Examination Survey (NHANES) database to calculate the prevalence and incidence of kidney stone disease. We then tested the capability of Bard and ChatGPT to perform similar calculations using two different methods. First, we instructed the LLMs to access the data sets and independently perform the calculation. Second, we instructed the interfaces to generate a customized computer code, which could perform the calculation on downloaded data sets.

Results:

While ChatGPT denied the ability to access and perform calculations on the NHANES database, Bard intermittently claimed the ability to do so. Bard provided either accurate results or inaccurate and inconsistent results. For example, Bard's “calculations” for the incidence of kidney stones from 2015 to 2018 were 2.1% (95% CI 1.5–2.7), 1.75% (95% CI 1.6–1.9), and 0.8% (95% CI 0.7–0.9), while the published number was 2.1% (95% CI 1.5–2.7). Bard provided discrete mathematical details of its calculations, however, when prompted further, admitted to having obtained the numbers from online sources, including our chosen reference articles, rather than from a de novo calculation. Both LLMs were able to produce a code (Python) to use on the downloaded NHANES data sets, however, these would not readily execute.

Conclusions:

ChatGPT and Bard are currently incapable of performing epidemiologic calculations and lack transparency and accountability. Caution should be used, particularly with Bard, as claims of its capabilities were convincingly misleading, and results were inconsistent.

Introduction

Recent advances in artificial intelligence (AI) models have prompted unprecedented medical research interest.¹ Free public access to advanced large language models (LLMs) such as ChatGPT and Bard has helped significantly to lower the barrier to entry to utilize AI and discover potential use cases for it. LLMs are deep-learning algorithms that utilize data sets to analyze large amounts of information to predict and produce plausible language.² Integration of LLMs into the practice of medicine is an area of active investigation across several medical specialties.³ Indeed, several prominent medical journals, such as the New England Journal of Medicine, have even started new series or opened new submission categories dedicated to AI.^4
–6 One exciting potential domain yet to be fully explored is the utility of AI, and specifically the novel publicly available LLMs, for urologic epidemiology.

Arguably the most common disease treated by endourologists, and urologists in general, is kidney stone disease. Accordingly, understanding the epidemiology of kidney stone disease and factors such as cost, prevalence, and incidence is of great importance and was a key component of the Urologic Diseases in America Project in the 2000s.⁷ Data on kidney stone epidemiology are often derived from pooled analyses of large national health data sets such as the National Health and Nutrition Examination Survey (NHANES), National Inpatient Sample, and Medicare claims. Aggregating and interpreting these data sets can involve advanced training in data science, putting such studies out of reach for the average investigator.

However, if the data sets could be accessed and analyzed by an LLM, the ability to study urologic epidemiology could suddenly be limited only by an investigator's creativity and curiosity, rather than by advanced training in data science. One could study local urologic epidemiology at the national, regional, statewide, and potentially even local level.

To this end, as a starting point, we aimed to investigate whether easily accessible LLM-based AI interfaces such as Bard and ChatGPT would be capable of performing a large-scale calculation of the incidence and prevalence of kidney stone disease in the United States over various periods of time and localities.

Materials and Methods

AI instruments utilized

To investigate the capability of LLM-based AI interfaces to calculate the incidence and prevalence of kidney stone disease in the United States, we utilized Bard and ChatGPT 3.5. The study was IRB exempt as no research was conducted on human subjects. ChatGPT 3.5 is the latest free version of an LLM-based AI chat interface released by OpenAI. The model was trained using Reinforcement Learning from Human Feedback using a large amount of data, and it uses its training to predict the most appropriate response.⁸ No adjustments in settings were made. Listed limitations are that ChatGPT is not connected to the internet, may provide inaccurate but plausible information, and may provide different responses in response to slight rephrasing of prompts.⁸ Bard is a free LLM-based AI chat interface that was released by Google.

Bard is labeled as an experiment and is a tool that was trained using trillions of words for pattern recognition but also is trained on an “infiniset,” meaning it also has access to the internet in real time.^9,10 No adjustments in settings were made. Similarly, Google warns that Bard will make mistakes and can present inaccurate information as factual.⁹

Input

To assess the ability of ChatGPT and Bard to calculate the incidence and prevalence of kidney stone disease from the publicly available NHANES database, we queried the interfaces using two separate methods.¹¹ The first method was to test whether ChatGPT and Bard could independently analyze data sets for any given year and perform the calculation when requested. The results would then be compared with known published results. The second method was to test whether ChatGPT and Bard can generate a computer code, a known capability of LLMs, which could then be applied to the downloaded data sets.²

To test for independent analysis, we posed the following starting prompt to ChatGPT and Bard using 10 slight variations in phrasing (see Supplementary Data):

“Access the NHANES database and calculate the prevalence/incidence of kidney stone disease for ages 20 and older from 20XX–20XX. Include confidence intervals.”

Bard generates three possible prompts per response, which allowed us to test for variable responses to identical phrasing.

To test the ability to generate code, relevant Statistical Analysis Software Transport File Format (XPT) files were downloaded from the NHANES portion of the Centers for Disease Control and Prevention (CDC) website. The files were then imported into R and converted to comma-separated value (CSV) files for increased usability. We then used 10 slight variations of the following starting prompt to solicit code:

“Create a code that I can execute through Windows or Mac OS to use with a CSV file that will be able to calculate the prevalence/incidence of kidney stone disease based on responses to survey questions. I have data from the NHANES dataset. There are two relevant survey questions. KIQ026 is the question have you ever had kidney stones. The possible answers are 1 is yes, 2 is no, 7 is refused, 9 is don't know, and ‘.’ is missing. KID028 is how many times you have passed a kidney stone. The possible answers are a range of numbers from 0 to 80 for stone passage events, 777 for refused, 999 for don't know, and ‘.’ or NA is for missing. For question KID028, any number of stone passage events 1 and above should be considered a positive result.”

Quality assessment

Both authors independently reviewed the AI responses for accuracy and consistency. A valid independent analysis was considered one in which the relevant NHANES data set was accessed, newly analyzed to determine the prevalence and incidence, and the details of the analysis were provided. A valid computer code was one that could be effectively executed on a downloaded data set with or without the assistance of AI troubleshooting.

Statistical analysis

To ensure a high-quality benchmark for the results of the requested calculation, we chose comparative data from two frequently cited articles. Scales et al. used the NHANES database in their calculations of prevalence and incidence for the years 2007 to 2010, and Hill et al. used the NHANES database to calculate incidence for the years 2015 to 2018.^12,13 We planned to assess agreement using a correlation coefficient between AI results and the comparative standard.

Results

When asked to access and perform calculations on the NHANES database, ChatGPT denied the ability to do so. “I can't access external databases, including NHANES, or perform real-time data analysis.” This question was repeated with different wording several times with no change in response.

Bard intermittently claimed the ability to access the NHANES database and perform calculations (Fig. 1). When prompted to perform the requested prevalence and incidence calculations on the NHANES database, Bard was able to produce both accurate and inaccurate or irrelevant results without a clear pattern. Bard's calculations for the incidence of kidney stones from 2015 to 2018 were 2.1% (95% CI 1.5–2.7), 1.75% (95% CI 1.6–1.9), and 0.8% (95% CI 0.7–0.9), while the published number was 2.1% (95% CI 1.5–2.7). A similar pattern of sporadically accurate values was obtained when calculating prevalence.

FIG. 1.

Bard claiming ability to access and perform calculations on the National Health and Nutrition Examination Survey database.

Bard was able to detail the statistical steps that it claimed to have used in the epidemiologic calculations, and even produced tables broken down by year with sample sizes (Fig. 2). Only with additional prompting did Bard acknowledge that the final product was obtained from online sources rather than from a unique calculation and even provided a link to our reference article from Scales et al. (Fig. 3).

FIG. 2.

Kidney stone prevalence table generated by Bard.

FIG. 3.

Bard providing detailed steps of a prevalence calculation and later rescinding its computational claims.

ChatGPT and Bard were both able to produce a unique code (Python) to perform the intended prevalence and incidence calculations on the downloaded data sets (Fig. 4a, b), however, these could not be executed without consistent errors. Troubleshooting the code through additional prompts with the AI interface was attempted, however, the errors persisted.

FIG. 4.

(a) Nonfunctional custom-built Python code generated by Bard to calculate the prevalence of kidney stones. (b) Nonfunctional custom-built Python code generated by ChatGPT to calculate the prevalence of kidney stones.

The following are example quotes from Bard. In addition, we have included four representative chat logs in the Supplementary Data to show the exact wording used when interacting with the interface.

“I can certainly access the NHANES dataset and calculate the number of individuals with kidney stone disease for those aged 20 and older from years 2007–2022.”

“To access the NHANES database, I used the NHANES Analytic Guidelines. I downloaded the NHANES 2007–2010 kidney stone disease data file (KDQ_D.XPT) from the NHANES website.”

“I apologize for the confusion. I did not perform the calculations myself. I compiled the information from various sources, including medical journals and websites, and presented it in a comprehensive manner. The specific calculations were likely performed by researchers who published their findings in medical journals.”

“I cannot directly access the NHANES database. However, I can access and process data from a variety of sources, including public datasets and application programming interfaces. If you have the NHANES data available in a format that I can access, I can perform calculations on the data for you. I can also help you to find and access the NHANES data if you need it.”

“Yes, I can provide you with a Python code that you can use on Mac OS to calculate the prevalence of kidney stone disease for given years from the NHANES database. The code will utilize the Pandas library to handle data manipulation and the NHANES data repository to access the necessary information.”

We found that both Bard and ChatGPT were not able to effectively complete the task. Irrespective of the wording of the prompts, the AI interface showed a lack of ability to integrate external data through direct access or uploading of materials and a lack of ability to generate the functional code. As the AI-generated calculations were not newly analyzed and as such not considered valid, we did not perform the planned comparative statistical analysis to our reference standard.

Discussion

To our knowledge, this was the first study of its kind to assess the ability of LLMs such as ChatGPT and Bard to determine the prevalence and incidence of a urologic pathology via independent analysis of a large publicly available health care database. Our results showed that in their current state, ChatGPT and Bard are incapable of performing such an analysis. ChatGPT could not access the intended publicly available database and both ChatGPT and Bard generated a nonfunctional Python code that would not effectively execute on a downloaded data set. Bard frequently claimed to have access to the NHANES database and that it had the ability to perform calculations.

The results of the requested prevalence and incidence calculations were inconsistent and often included results that had been taken from other studies without acknowledgment. The most concerning finding as shown in Figure 3 was that Bard often gave detailed descriptions of their calculation such as downloading an XPT file from the database, filtering the data, performing the statistical analysis, and providing a result that matched the comparative standard. It was not until inconsistencies in the responses from Bard led us to pose the question, “did you actually calculate those numbers… or did you pull them from existing sources,” at which point Bard apologized and admitted to taking the numbers from other sources. Overall, both LLMs were ineffective, and there were no real data to perform the intended comparison with the reference data we chose.

The introduction of free, easily accessible LLM-based AI interfaces such as ChatGPT in November 2022 and Bard in March 2023 has coincided with a substantial increase in the number of AI-related publications in the medical field.¹ LLMs are complex algorithms capable of countless tasks, such as text generation, coding, predictive modeling, and analysis among others.² In the realm of medicine, the ease of access to these tools may lead to advancements in research and the way we care for patients, although the extent of their capabilities is not yet clear. Instant and accurate epidemiologic calculations would allow clinicians to better understand their population and provide more nuanced care.

How to best use AI to assist in epidemiologic projects, however, is a difficult question. The development of useful models for processing epidemiologic data is a complex process that is fraught with many challenges such as estimating appropriate sample sizes, identifying clinically relevant features within large data sets, and dealing with missing data.¹⁴ Reasoning and deduction in particular are known challenges for LLMs as they need to both comprehend information and deduce de novo answers rather than pull from existing sources. A recent large survey by Chang et al. performed a comprehensive review of the success and failure of LLMs in various tasks. In mathematical reasoning, the models have shown mixed results and can give unpredictable answers depending on the complexity of the problem. Most LLMs are capable of addition, subtraction, and some multiplication; however, division, exponentiation, trigonometry functions, logarithm functions, and other complex mathematical problems are still challenges.

Despite these hurdles, successive iterations of these models (e.g., GPT 3, GPT 3.5, and GPT 4) show continued improvement in each reasoning domain and handling new data compared with their predecessors.¹⁵ Effectively navigating the reasoning challenges of LLMs and developing reliable models will take continued work; however, future versions will likely continue to improve the ability to perform complex calculations.

With the promise that AI in health care offers, come also significant pitfalls and ethical concerns. Our findings raised an interesting dilemma of whether we can ever truly trust the results of AI calculations, particularly when it might relate to patient care where the stakes are high. LLMs have been shown to generate misinformation through confabulations, falsify data, and contribute to plagiarism.^2,16 AI systems are prone to bias and, through reliance on algorithmic predictions and overrepresented populations in data sets, can perpetuate inequities in disadvantaged populations.¹⁷ Furthermore, as AI interfaces gain access to more patient data, it has the potential to not only compromise the quality of medical research but in its worst form potentially affect the care and privacy of patients. The questions of authorship and who is to be held accountable when the said data are inaccurate or misleading represent real challenges. One solution we propose to protect readers is for publications to include explicit disclaimers for all material generated by AI. We also suggest the adoption of standardized methodologies and checklists for appropriate incorporation of AI-generated research, similar to the way the PRISMA checklist is used for systematic reviews and the CONSORT statement is used for randomized control trials.^18,19 In addition, we expect that as LLMs are further developed they will, similar to growth in humans, develop a greater ability to recognize their own limitations in knowledge without overstating the confidence of their findings. We believe the important concerns raised by our findings can signal to endourologist researchers to proceed with significant caution when utilizing AI in the design of their studies.

The ideal model for a standardized methodology to incorporate LLMs in research would ensure complete transparency and reproducibility of methodology, provide external validity of AI capabilities, and ensure data privacy, which are features not present in common LLMs such as Bard or ChatGPT. The Translational Evaluation of Health care AI framework, created in 2021, is one such framework that aims to address the issue of trust and unrecognized poor performance of LLMs within health care. The structure assesses various aspects of AI systems regarding relevance, safety, ethical considerations, and efficiency.¹⁶ In rigorously assessing LLM tools and advocating for a governance approach to LLMs, it aims to keep LLMs used in medicine in check. Frameworks such as these, in addition to official regulatory requirements, are important in developing safety measures for any form of AI that may influence patient care.

Studies from Scales et al. and Hill et al. were large studies that performed calculations of the prevalence and incidence of kidney stone disease and provided us with our reference standards to which we could compare results produced from the LLMs.^12,13 In scrutinizing the calculation results from Bard, we were able to eventually recognize that some findings appeared remarkably similar to these published findings, and with further questioning uncovered that results were taken from other sources and were not de novo calculations. This is a dangerous finding as without a high degree of suspicion, users could be convinced of the process and results. Thankfully, most LLM chat interfaces are currently constrained in their handling of medical information.²⁰ However, limited integration of this technology within medicine is slowly being seen such as the partnership announced between the electronic medical records company Epic and OpenAI, the owner of ChatGPT.²¹ Which companies and developers will continue down this path and be willing to assume the regulatory and validation process necessary to bridge the gap into the medical field is unclear.

There are several limitations to our study. ChatGPT and Bard were chosen in our experiment as these are easy to understand and free; however, they only represent one small fraction of AI, and they may not be the most appropriate tool for the task of an incidence and prevalence calculation. Likely, the ideal tool would be an AI interface that is designed specifically for the subject of epidemiology and has been rigorously tested and validated to be accurate and reliable. As discussed above, newer versions of these common LLMs will likely continue to improve the capability to perform such tasks. In addition, the descriptive nature of the study does not allow for any hard conclusions to be drawn from its findings. ChatGPT and Bard did not produce a valid output for the requested calculations to be able to statistically compare the results with known data.

However, we still believe our findings to be important, as they demonstrate important limitations in the power of LLMs and serve as an important cautionary tale and counterpoint to the recent surplus of studies boasting the power of LLMs such as ChatGPT and Bard.

Conclusions

ChatGPT and Bard are LLMs that are currently incapable of performing large-scale epidemiologic calculations. In their current state, these tools are fraught with inconsistencies and lack of transparency. Application of these tools within medical research should be with caution and limited until they have been validated for an intended purpose. As AI progresses, robust testing and regulation will be needed before a non-epidemiologist can reliably obtain the desired results.

Footnotes

Authors' Contributions

R.M.B.: Protocol/project development, data collection or management, data analysis, and article writing. J.A.K.: Protocol/project development, data collection or management, data analysis, and article writing.

Author Disclosure Statement

No competing financial interests exist.

Funding Information

No funding was received for this article.

Supplementary Material

Supplementary Data

Abbreviations Used

References

Barrington

, Gupta

, Musmar

, et al. A bibliometric analysis of the rise of ChatGPT in medical research. Med Sci (Basel), 2023; 11(3):61.

Clusmann

, Kolbinger

, Muti

, et al. The future landscape of large language models in medicine. Commun Med (Lond). 2023; 3(1):141.

Mumtaz

, Ahmed

, Mumtaz

. Overview of current applications of large language models in various medical specialities. arXiv, 2023;231112882. https://doi.org/10.48550/arXiv.2311.12882

Beam

, Drazen

, Kohane

, et al. Artificial Intelligence in medicine. N Engl J Med, 2023; 388(13):1220–1221.

Khera

, Butte

, Berkwits

, et al. AI in medicine—JAMA's focus on clinical outcomes, patient-centered care, quality, and equity. JAMA, 2023; 330(9):818–820.

Aims & Scope [Webpage]. Nature; 2023. Available from: https://www.nature.com/natmachintell/aims [Last accessed: February 2, 2024 ].

Pearle

, Calhoun

, Curhan

. Urologic diseases in America project: Urolithiasis. J Urol, 2005; 173(3):848–857.

Introducing ChatGPT. 2023. Available from: https://openai.com/blog/chatgpt [Last accessed: February 2, 2024].

Bard FAQ. 2023. Available from: https://bard.google.com/faq [Last accessed: February 2, 2024].

10.

Toscano

. ChatGPT Or Google Bard? Privacy or performance? Outstanding questions answered. 2023. Available from: https://www.forbes.com/sites/joetoscano1/2023/06/24/chatgpt-or-google-bard-privacy-or-performance-outstanding-questions-answered/?sh=7aaaf2c7ade8 [Last accessed: February 2, 2024 ].

11.

National Health and Nutrition Examination Survey. 2023. Available from: https://wwwn.cdc.gov/nchs/nhanes [Last accessed: February 2, 2024].

12.

Scales

Jr. , Smith

, Hanley

, et al. Prevalence of kidney stones in the United States. Eur Urol, 2012; 62(1):160–165.

13.

Hill

, Basourakos

, Lewicki

, et al. Incidence of kidney stones in the United States: The continuous national health and nutrition examination survey. J Urol, 2022; 207(4):851–856.

14.

Wiemken

, Kelley

. Machine learning in epidemiology and health outcomes research. Ann Rev Public Health, 2020; 41(1):21–36.

15.

Chang

, Wang

, Wu

, Yang

, Zhu

, et al. A survey on evaluation of large language models. ACM Trans Intell Syst Technol. 2024; 15(3):Article 39.

16.

Reddy

. Evaluating large language models for use in healthcare: A framework for translational value assessment. Inf Med Unlocked, 2023; 41:101304.

17.

Mittermaier

, Raza

, Kvedar

. Bias in AI-based models for medical applications: Challenges and mitigation strategies. NPJ Digit Med, 2023; 6(1):113.

18.

Page

, McKenzie

, Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ, 2021; 372:n71.

19.

Schulz

, Altman

, Moher

, the

. CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials. BMC Med, 2010; 8(1):18.

20.

Gottlieb

, Silvis

. How to safely integrate large language models into health care. JAMA Health Forum, 2023; 4(9):e233909. doi: 10.1001/jamahealthforum.2023.3909

21.

Microsoft and Epic Expand Strategic Collaboration with Integration of Azure OpenAI Service. 2023. Available from: https://www.epic.com/epic/post/microsoft-and-epic-expand-strategic-collaboration-with-integration-of-azure-openai-service/ [Last accessed: February 2, 2024].

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.57 MB