A Chatbot for the Management of Bipolar Disorder: Using Retrieval-Augmented Generation With an Open-Weight Large Language Model to Answer Clinical Questions Based on the CANMAT and ISBD 2018 Guidelines for Bipolar Disorder: Un dialogueur pour la prise en charge du trouble bipolaire : utiliser la génération augmentée par récupération avec un grand modèle de langage (GML) à poids ouverts pour répondre aux questions cliniques fondées sur les lignes directrices de 2018 de CANMAT et de l'ISBD relatives au trouble bipolaire

Abstract

French

Objective

Clinical practice guidelines support evidence-based care but are often underused due to complexity, time constraints, and navigation challenges. We investigated whether a conversational agent (chatbot) using an open-weight large language model (LLM) with retrieval-augmented generation (RAG) could provide guideline-consistent answers for bipolar disorder management based on the full 2018 Canadian Network for Mood and Anxiety Treatments (CANMAT) and ISBD guidelines, comparing against a system using only the base LLM.

Method

We developed a multi-step RAG-based chatbot that retrieves relevant guideline sections and generates responses using Llama 3.3 70B. Twenty-one clinical vignettes spanning all guideline sections were created. Six expert psychiatrists generated queries and were presented with paired responses without labels from 2 systems: one using the base Llama 3.3 70B model, the other RAG-enhanced. Responses were rated for guideline consistency on a 3-point scale, and were analyzed using mixed-effects ordinal logistic regression.

Results

Experts evaluated 126 responses, of which 110 (87.3%) were rated as correct as or more correct than the baseline system. The RAG system produced 80 answers (63.5%) rated fully consistent with the guidelines versus 24 (19.0%) for baseline, and only 10 answers with major deviation (7.9%) versus 48 (38.1%) for baseline. Ordinal regression showed RAG responses were significantly more likely to be more correct (OR = 9.1, 95% CI [5.3-16.3], P < 0.001), which was consistent across all raters. Preference ratings favoured RAG answers in 78.7% of cases. Performance varied by vignette, with some errors in both retrieval and reasoning.

Conclusion

The use of RAG with an open-weight model helped produce answers consistent with the CANMAT guidelines across vignettes that required adapting or combining guideline text, suggesting a proof-of-concept of a bipolar guideline chatbot. We identified areas to improve results and evaluation. Future work should explore additional retrieval strategies and LLMs, and test in more naturalistic settings.

Plain Language Summary Title

Evaluation of an AI system to answer treatment questions from bipolar disorder medical guidelines.

Plain Language Summary

We evaluated the effectiveness of an AI system at answering treatment questions from the bipolar disorder medical guidelines.

Keywords

bipolar disorder NLP machine learning artificial intelligence

Introduction

Bipolar disorder is a prevalent and chronic psychiatric condition that presents a significant public health challenge globally,¹ with a lifetime prevalence rate of bipolar I at 0.87% and bipolar II at 0.57%.^2,3 Its management is complicated by the illness's recurrent, multi-phased course, characterized by periods of wellness interrupted by episodes of mania, hypomania, and depression, along with its lifelong nature and high rates of comorbidities.⁴ Clinical guidelines, such as those developed by the Canadian Network for Mood and Anxiety Treatments (CANMAT) and the International Society for Bipolar Disorders (ISBD),⁵ aim to enhance care through evidence-based recommendations. These guidelines were developed using a rigorous overview of evidence, combined with expert clinical opinion, to produce pragmatic, hierarchical recommendations organized by treatment lines and levels of evidence. Previous evidence has shown that physicians' adherence to the CANMAT/ISBD guidelines can improve patient outcomes.⁶ Despite the availability of clinical guidelines, their use in practice remains inconsistent,^7,8 leading to suboptimal outcomes such as ineffective medications and inadequate consideration of comorbid conditions. Multiple barriers may play a role, including limited time and clinician skill,⁹ guideline complexity,¹⁰ challenges in applying recommendations to specific clinical scenarios,^11,12 and difficulty navigating the documents.¹³

To address these barriers, previous studies have examined digital tools such as C-IMPACT BD, a web application that provides rule-based recommendations from the CANMAT bipolar guidelines and has been shown to increase use of first-line therapies.¹⁴ Recent advances in artificial intelligence offer additional opportunities to overcome barriers to guideline implementation. In particular, conversational agents, or “chatbots,” allow users to pose questions in natural language and receive support for treatment decisions.¹⁵ These chatbots use large language models (LLMs) to answer questions. These models have been trained on language content from across the internet and may include knowledge relevant to clinical decision making,¹⁶ though this information can be incomplete or incorrect, or drawn from more general sources such as internet forums. However, a challenge limiting their direct application in clinical settings is the generation of inaccuracies. These errors may arise from incorrect, biased, or outdated information, or as “hallucinations,” where the model fabricates content.¹⁷

Retrieval-augmented generation (RAG)¹⁸ is a technique that can improve the accuracy of chatbot systems.^19,20 When a query is submitted, the RAG framework retrieves relevant information from domain-specific sources such as clinical guidelines, then passes the query, retrieved content, and system instructions to the LLM to generate a response. Initial studies have shown that this technique can reduce hallucinations and improve factual relevance.^21,22 Examples of this approach in clinical settings are emerging, with LLMs being integrated with specialized texts and guidelines for managing specific conditions, such as thyroid disease,²³ liver disease^24,25 and anticoagulation protocols for gastrointestinal procedures.²⁶ Because LLMs have a limited context window (the maximum text they can process at once), they cannot always include an entire guideline in a single prompt.²⁷ When context windows are smaller, RAG or a related technique is needed for retrieving relevant sections from large documents. Even with larger context windows, RAG helps focus responses on the most pertinent content rather than overwhelming the model with unnecessary text.

While RAG-based applications are emerging in related areas—including mental health support,²⁸ psychiatric documentation,²⁹ and depression detection³⁰—to our knowledge, RAG systems have not yet been studied for providing treatment recommendations in mental health using clinical guidelines. We identified one study that examined a chatbot for decision support in bipolar disorder treatment.³¹ This study did not use RAG; it augmented the chatbot's prompt with an excerpt from the U.S. Veterans Administration 2023 guidelines on bipolar depression,³¹ and only tested recommendations for that mood state. Their augmented model prioritized expert-recommended medications 50.8% of the time, compared with 23.0% for the unaugmented model. The study used a proprietary, closed-weight model, GPT4-turbo, through OpenAI.

In this work, we investigate the use of RAG to provide treatment recommendations for bipolar disorder, covering all sections of the 2018 CANMAT guidelines, the world's most cited guideline for bipolar disorder. Our question set spans all mood states, both bipolar disorder types I and II, and addresses complex considerations such as age, comorbidities, side effects, special populations, and medication history. These vignettes often required reasoning that could not be satisfied by returning guideline text verbatim, as they involved synthesizing information from multiple sections and reconciling competing factors. We believe this reflects the complexity of treating mental illnesses like bipolar disorder and distinguishes our approach from prior guideline-based RAG systems, which have largely focused on simple knowledge retrieval.^25,32,33 As opposed to prior work³¹ that used proprietary models, we examined the use of an open-weight LLM, which offers advantages in privacy, interpretability, and reproducibility. We hypothesized that our RAG system would outperform the base LLM for all expert raters, despite raters generating their own queries from vignettes, introducing variation in wording. This work aims to inform future applications of RAG systems in psychiatry and other areas of medicine where recommendations must integrate multiple data points from guidelines and patient-specific queries.

Methods

Vignette Evaluation

We designed our evaluation based on recent assessments of clinical chatbots,^{25,26,31–33} asking expert raters to generate queries from clinical vignettes and score responses for consistency with the CANMAT 2018 guidelines. Six psychiatrists were initially recruited through the CANMAT network. All were guideline experts, either as co-authors or through regular clinical and educational use, and each had at least 10 years of experience treating bipolar disorder (range: 10-30 years). Raters received an orientation that included a walkthrough of a warm-up vignette, which was not scored. Due to an error, one participant did not receive most of the introduction and did not complete the warm-up case; this individual was excluded, and a seventh rater was subsequently recruited.

To avoid the risk of overfitting, where questions might be tailored to produce optimal answers for a fixed set of questions, we asked raters to write their own queries based on the provided vignettes rather than using the original wording. This approach also enabled evaluation of a more naturalistic and broader range of queries, as even minor differences in phrasing can significantly influence responses from systems using LLMs.^34,35

To minimize courtesy bias, particularly because participants were recruited within the same network, raters were asked to blindly evaluate responses from 2 versions of the chatbot: one using RAG and one without, which we will refer to as the base model. We did not disclose how the 2 versions differed, only informing participants that they would be comparing 2 chatbot systems. The order in which responses appeared (left or right) was randomized for each vignette.

We created our primary scoring criterion by adapting recently used scoring criteria^26,33:

Correct: Fully aligns with the 2018 CANMAT guidelines and is clinically actionable.

Inaccurate: Largely correct but includes minor omissions or deviations that could lead to suboptimal decisions.

Wrong: Contains major errors or contradictions that could compromise patient safety.

As a secondary outcome, we also asked participants to choose between the 2 responses by judging overall preference, considering readability, contextual applicability, and clarity in expressing complex clinical concepts, also based on a recent study.³⁶ The purpose of this was to provide an initial investigation as to whether a RAG system adversely affected these usability aspects.

Clinical Vignettes Development

We developed 21 vignettes simulating clinical scenarios encountered by clinicians treating bipolar disorder, adapting the format from Perlis (2024)³¹ to include patient demographics, diagnostic history, presenting symptoms, medication history, and social/functional context. Three vignettes were created for each of the 7 sections of the 2018 CANMAT guidelines (excluding the introduction and conclusion). Vignettes are summarized in Table 1, with an example provided in Supplemental Table S1. To ensure clarity in age-related cases, older adult patients were assigned ages in the 70s, adolescents between 13 and 17 years, and all other cases between 20 and 59 years, with a mix of genders. The complexity of each vignette was intended to reflect questions that a community psychiatrist might reasonably pose in practice. Scenarios were designed to require integration of multiple guideline components (e.g., combining recommendations for bipolar I maintenance with strategies to mitigate weight gain) or to adapt the guidelines to the clinical scenario, such as excluding previously failed medications. However, the degree of complexity and reasoning varied depending on how raters formulated their queries, particularly for cases 1 and 4.

Table 1.

Clinical Vignettes.

Vignette	Section	Primary Subsection	Summary of Vignette	Multi-Chunk Retrieval	Multi-Hop Reasoning
1	Section 2	2.2.4. Screening and diagnosis of bipolar disorder	Features that differentiate bipolar and unipolar, besides presence of mania or hypomania, in an adult woman with past antidepressant trials and recurrent depressive episodes.	Depends^a	Yes
2	Section 2	2.3. Suicide risk	Risk factor for suicide attempt in an adult man who lives alone with bipolar I disorder presenting with suicidal ideation, suboptimal medication response, and depressive symptoms.	Depends^a	Yes
3	Section 2	2.4. Chronic disease management	Enhance adherence and prevent relapse in an adult man with bipolar I disorder who is euthymic but who recently stopped quetiapine, no longer wanting to attend appointments or take medication.	Yes	No
4	Section 3	3.2. Management of agitation	Pharmacological management of agitation for an adult woman with bipolar I disorder presenting with mania. They are medication non-adherent, refusing to take oral medication, previously failed quetiapine and recently stopped lithium.	Depends^b	Depends^b
5	Section 3	3.3. Pharmacological treatment of manic episodes	Pharmacological management of an adult man with bipolar I disorder presenting with mania who has failed quetiapine, lithium, carbamazepine, and paliperidone	No.	Yes
6	Section 3	3.3.8. Clinical features that help direct treatment choices	Pharmacological management of an adult woman with bipolar I disorder presenting with mania with mixed features, who is already on lithium.	Yes	Yes
7	Section 4	4.4. Pharmacological treatment for acute bipolar depression	Pharmacological management of an adult man with bipolar I disorder who is on a therapeutic dose of lithium and did not tolerate lamotrigine and quetiapine due to side effects.	Yes	Yes
8	Section 4	4.4. Pharmacological treatment for acute bipolar depression	Pharmacological management of an adult woman with bipolar I disorder who is on quetiapine and did not tolerate lamotrigine, lurasidone, and lithium due to impaired kidney function.	Yes	Yes
9	Section 4	4.5. Clinical features that help direct treatment choices—anxious distress	Pharmacological management of an adult man with bipolar I disorder presenting with depression as well as intense anxiety, recently on lithium but he discontinued.	Yes	Yes
10	Section 5	5.2. Treatment adherence 5.3. Psychosocial interventions for maintenance therapy	Psychosocial interventions to support treatment adherence in an adult woman with bipolar I disorder who is nonadherent to lithium due to forgetfulness and has not tolerated valproate and quetiapine.	Yes	No
11	Section 5	5.5. Pharmacological treatments for maintenance therapy	Pharmacological management for an adult man with bipolar I disorder requiring maintenance treatment, currently on lithium and quetiapine but having frequent relapses of both mania and depression, who has previously failed divalproex, asenapine, and aripiprazole.	No	Yes
12	Section 5	5.5. Pharmacological treatments for maintenance therapy	Pharmacological management for an adult woman with bipolar I disorder requiring maintenance treatment currently on divalproex with recurrent depressive relapses.	No	Yes
13	Section 6	6.2.3. Acute management of bipolar II depression	Pharmacological management for an adult man with bipolar II disorder, currently depressed with mixed features, has impaired kidney function and a history of sertraline-induced hypomania, currently on no medication.	Yes	Yes
14	Section 6	6.2.4. Maintenance treatment	Pharmacological management for an adult woman with bipolar II disorder, frequent depressive relapses and problems with quetiapine, lamotrigine, and lithium.	No	Yes
15	Section 6	6.2.2. Acute management of hypomania	Pharmacological management in an adult man with bipolar II disorder currently on venlafaxine and failed trials of lamotrigine, lithium, and quetiapine, presenting with hypomanic symptoms.	No	Yes
16	Section 7	7.1.3. Pharmacological management of bipolar disorder during pregnancy	Pharmacological management for an adult woman with bipolar I disorder who has mania in the post-partum period, previously on lithium but discontinued during pregnancy.	Yes	Yes
17	Section 7	7.2. Management of bipolar disorder in children and adolescents 7.2.2. Pharmacological management	Pharmacological management for an adolescent male presenting with mania with no prior psychiatric history or medications.	Yes	No
18	Section 7	7.3. Management of bipolar disorder in older age	Pharmacological management for an older adult man with bipolar I disorder and hypertension presenting with mania who has failed lithium, aripiprazole, quetiapine, divalproex, lamotrigine, lurasidone, asenapine, risperidone, and paliperidone.	Yes	Yes
19	Section 8	8.3.1. Weight gain	Pharmacological management for an adult man with bipolar I disorder and metabolic syndrome requiring maintenance treatment who has experienced weight gain on his current olanzapine and has failed previous trials of lithium and quetiapine.	Yes	Yes
20	Section 8	8.3.7. Cognition	Pharmacological management for an adult woman with bipolar II disorder who requires maintenance treatment but who is currently experiencing cognitive side effects on their current lithium.	Yes	Yes
21	Section 8	8.3.9. Neurological effects, including EPS	Pharmacological management for an adult male with bipolar I disorder requiring maintenance treatment who previously experienced extrapyramidal symptoms on risperidone.	Yes	Yes

“Multi-chunk” indicates answers requiring information from more than one part of the guidelines. “Multi-hop” refers to cases where non-trivial reasoning is needed, so responses cannot simply restate retrieved text verbatim.

For Vignettes 1 and 2, information could be retrieved from both tables and the text; raters may or may not have checked that information was consistent with both.

For Vignette 4, one rater asked for treatment for “agitated mania” which required consideration of both mania and agitation treatment, while others varied in whether they specified the failure of lithium.

Chatbot Design

Document Preparation and Retrieval

We created a database of guideline content by scraping the CANMAT 2018 bipolar disorder website, converting paragraphs, tables, and figures into text. Each chunk was annotated with its section, subsection, and any referenced tables or figures to support downstream filtering.

Our system uses a multi-step RAG pipeline (Figure 1). First, user queries are classified into 1 of 5 categories: bipolar I mania, bipolar I depression, bipolar I maintenance, bipolar II, or general query. This classification, performed by a lightweight Mistral 7B model,³⁷ restricts the subsequent vector search to relevant guideline sections.

Figure 1.

Diagram of our multi-step retrieval-augmented generative system.

To improve retrieval accuracy, we implemented a hybrid approach that combines dense and sparse methods: embedding-based search (Qwen/Qwen3-Embedding-0.6B)³⁸ for semantic similarity and BM25³⁹ for exact keyword matching. Section headers are boosted during BM25 scoring. When a query includes header terms such as “cognition,” “treatment,” or “adverse effects,” scores for chunks in that section are multiplied by 2. References to tables or figures in retrieved sections are appended to the context for final response generation.

Response Generation Using an LLM

For answer generation in our RAG system, we employed Llama 3.3 70B, which was the most capable open-weight model available at the study's inception. Llama 3.3 70B has demonstrated performance comparable to leading commercial models released around the same time, such as GPT-4^40,41 showing similar scores on benchmarks such as GPQA, Math 500 and BFCL,⁴² though results vary by task.

Prompt design was refined through collaboration between team members with expertise in computer science and psychiatry (Supplemental Table S2). The system prompt instructed the model to use only the retrieved context when reasoning and answering questions, reducing hallucinations. We used the Chain-of-Thought⁴³ technique in prompt engineering to improve LLM reasoning. The prompt guides the LLM to first gather patient attributes, such as disorder subtype, mood state, population, and then apply hierarchical reasoning to decide the treatment line needed, exclude medications based on comorbidities, side effects, evidence level, or user preferences, reflecting a clinician's decision-making process.

For our evaluation, we maintained nearly identical prompts for the RAG and baseline versions to isolate the effect of retrieval augmentation; the only modification was removing instructions related to utilizing the retrieved context in the baseline model, as it operated without access to retrieved documents. The complete prompts are provided in Supplemental Table S2.

Cloud Deployment and User Interface

The user interface of the chatbot consists of a standard chat window with the CANMAT bipolar guidelines displayed alongside (Figure 2(a)). Responses include the retrieved “chunks,” linking to the relevant subsections of the guidelines.

Figure 2.

(a) Our chatbot in standard mode, linking to relevant parts of the guidelines which can be viewed. (b) Our Chatbot as seen during our evaluation.

For the interface of the chatbot used in the evaluation, links to the subsections were not included so as not to differentiate the responses further from the base model. Our evaluation interface displays responses side by side and lists 3 rating options for each answer, as well as a button to choose the preferred response (Figure 2(b)). During the evaluation, response generation latency was approximately 2 to 3 seconds per query, which raised no concerns among reviewing clinicians.

We deployed our system on Amazon Web Services (AWS) through our institution's AWS account. The chatbot itself is hosted on AWS Amplify, and it is built using Node.js. The vector search runs directly in the user's web browser. All LLM inference tasks are handled by AWS Bedrock and embedding model inference by a 6 GB GPU SageMaker endpoint. Further architectural details are available on our GitHub repository.

Statistical Analysis

We analyzed rating outcomes using a mixed-effects ordinal logistic regression model implemented in the R package brms.⁴⁴ This approach was selected because the ratings (i.e., wrong, inaccurate, and correct) are inherently ordered, and the model accommodates this structure without collapsing categories into binary outcomes. Including random intercepts for participants and vignettes accounts for variability in rater tendencies and vignette difficulty, thereby improving the generalizability of the findings. The cumulative logit link estimates the effect of rating method (RAG vs. baseline) on the odds of being in a higher category, under the proportional odds assumption. We summarized these predictions to characterize individual-level tendencies, providing mean posterior probabilities and 95% credible intervals for each rater under both conditions. This approach allows us to assess whether the observed group-level effect of RAG versus baseline is consistent across participants—details in Supplemental Material S1.

Data Availability

The code used for our chatbot and evaluation will be available on a public GitHub repository upon publication. Given the potential for automatic scraping systems to use any publicly available data to train LLMs, which may impact planned immediate short-term work, the vignettes and question–answer pairs will be made available upon reasonable request to the corresponding author within 2 years of publication.

Results

Our RAG system performed better than the base model on the clinical vignettes. In total, raters evaluated 126 responses, 80 (63.5%) of them from the RAG system were marked as fully consistent (correct), compared to 24 (19%) from the base model. Thirty-six RAG responses (28.6%) were marked as having minor deviations or omissions (inaccurate), compared to 54 (42.9%) from the base model. Ten RAG responses (7.9%) were marked as having major deviations (wrong), compared to 48 (38.1%) from the base model (Supplemental Table S3 and Supplemental Figure S1). The RAG system provided answers that were statistically significantly more likely to be more correct than the baseline model (OR = 9.1, 95% CI [5.3-16.3], P < 0.001). Across 84 ratings (66.7%), RAG provided an answer that was more correct than the base model, 26 (20.6%) had the same level of rating, and 16 (12.7%) less correct. In total, 110 (87.3%) of the RAG responses were rated as correct as or more correct than the baseline system. Mixed-effects ordinal logistic regression provided strong evidence that this advantage was consistent across individual raters, with posterior probabilities of superiority exceeding 0.99 for all 6 raters.

For our secondary analysis, overall, raters preferred the RAG response in 96 (78.7%) ratings out of 122 preference ratings. Raters did not provide preferences for 4 ratings. In the 26 ratings where the experts rated both answers as the same level of correctness, raters preferred 18 (69.2%) of responses from RAG. In the 16 ratings where the baseline was correct, the baseline responses were preferred.

For the RAG system, performance varied across vignettes. Some cases, such as Vignettes 1 and 4, had all 6 responses rated correct, whereas Vignette 9 had none marked correct and 5 rated inaccurate, with the base model performing better by producing 2 correct answers (Table 2). Vignettes 15 and 18 showed mixed ratings, with 3 responses marked correct, 2 inaccurate, and 1 wrong. In several cases (e.g., Vignettes 2, 3, and 19), both models achieved the same number of correct responses.

Table 2.

Chatbot Ratings for Each Vignette.

	Vignette 1		Vignette 2		Vignette 3		Vignette 4		Vignette 5		Vignette 6
	Base	RAG	Base	RAG	Base	RAG	Base	RAG	Base	RAG	Base	RAG
Correct	4	6	3	3	3	3	0	6	1	4	0	3
Inacc.	2	0	2	3	3	2	4	0	2	2	2	3
Wrong	0	0	1	0	0	1	2	0	3	0	4	0
	Vignette 7		Vignette 8		Vignette 9		Vignette 10		Vignette 11		Vignette 12
	Base	RAG	Base	RAG	Base	RAG	Base	RAG	Base	RAG	Base	RAG
Correct	0	4	0	5	2	0	2	4	0	4	0	4
Inacc.	2	1	4	1	2	5	2	2	4	1	3	2
Wrong	4	1	2	0	2	1	2	0	2	1	3	0
	Vignette 13		Vignette 14		Vignette 15		Vignette 16		Vignette 17		Vignette 18
	Base	RAG	Base	RAG	Base	RAG	Base	RAG	Base	RAG	Base	RAG
Correct	2	5	0	5	1	3	0	3	1	3	0	3
Inacc.	1	0	2	1	2	2	3	3	3	1	3	2
Wrong	3	1	4	0	3	1	3	0	2	2	3	1
	Vignette 19		Vignette 20		Vignette 21		All
	Base	RAG	Base	RAG	Base	RAG	Base	RAG
Correct	3	3	0	5	2	4	24	80
Inacc.	2	3	3	1	3	1	54	36
Wrong	1	0	3	0	1	1	48	10

Note. RAG = retrieval-augmented generation; Inacc. = Inaccurate

We also observed variability among raters in how many responses they marked as correct. The strictest rater judged only 8 RAG responses as correct and marked 6 as wrong, whereas the most lenient rater marked 18 responses as correct and none as wrong (Figure 3).

Figure 3.

Comparison of ratings across 6 expert raters between the baseline and retrieval-augmented generation responses.

Of the 10 RAG-generated responses rated as wrong, 6 appeared to result from inadequate retrieval context—that is, the relevant sections retrieved from the clinical guidelines did not provide sufficient information to support a correct answer. Two of these errors occurred in Vignette 17, which involved managing mania in an adolescent patient, while the remaining errors were distributed across other vignettes. We have added more information on the errors in Supplemental Table S4, so that readers can evaluate their clinical safety implications.

Discussion

This is the first study to evaluate the use of RAG for generating treatment recommendations for bipolar disorder based on the 2018 CANMAT guidelines. RAG combines an LLM with the retrieval of relevant guideline content to support answers that require synthesizing information rather than simple text lookup. Our goal was to design a system that could provide just-in-time answers to clinical questions across all mood states, bipolar disorder subtypes, and comorbidities, reflecting the complexity of real-world decision-making. Our findings indicate that the multi-step RAG system, built on open-weight models, produces answers that align with CANMAT guidelines at a substantially higher rate than the base model without RAG, which achieved only 19.0% correct responses and 38.1% with major errors. In contrast, RAG yielded 63.5% fully guideline-consistent responses and only 7.9% with major errors, demonstrating numerically better performance than a prior study assessing treatment decisions from an augmented GPT-4 Turbo model for bipolar depression alone,³¹ though differences in study design limit direct comparison. We believe our study provides an initial demonstration of how RAG can be used with clinical guidelines in psychiatry for scenarios involving multiple considerations, such as age, previous medication responses, special populations, and comorbidities, providing directions for further improvement when seeking to design systems to help in complex clinical scenarios.

Our results provide an example of how RAG can support mental health treatment recommendations. The best performing vignette, Vignette 4, had all responses marked correct, likely because it required straightforward reasoning; the answer was simply the first-line treatment options for agitation without modification. However, one rater's query illustrated how phrasing can vary by asking for treatment for “agitated mania” rather than just agitation. The system interpreted this composite query and returned preferred treatments for both mania and agitation, demonstrating its flexibility in handling nuanced language. Other high-performing vignettes, such as Vignette 8 and Vignette 20, demonstrated the system's ability to generate recommendations based on scenario-specific factors. The former required adapting guideline tables on bipolar I depression treatment after excluding multiple failed medications, while the latter required integrating content from multiple sections discussing maintenance treatment for bipolar II and strategies to reduce cognitive symptoms. One rater judged the RAG system's performance as particularly strong, marking 85.7% of its responses as correct and the remainder as inaccurate, which underscores the potential of this approach, although outcomes may depend on query formulation and interpretation.

Evaluation of the RAG-based chatbot revealed 3 key challenges: variability in retrieval accuracy, limitations in multi-step reasoning, and inconsistencies between expert scoring and the system's design objectives. While most RAG responses were correct, the 7% with major errors seemed to have been related to retrieval issues, particularly in Vignette 17. This case required recommendations specific to the subsection on treating adolescents and youths. Despite the explicit instruction in the prompt for LLM to consider age factors, the retrieval component often failed to interpret numeric ages as indicators of youth, instead relying on explicit keywords such as “youth” or “adolescent.” When these were absent, the model defaulted to adult-specific recommendations. A substantial portion of errors appeared related to LLM weaknesses in multi-step reasoning. For example, in Vignette 9, the RAG chatbot appeared to struggle with balancing information retrieved from chunks on acute bipolar I depression, anxious distress, and comorbid anxiety. This sometimes led to incomplete responses, such as failing to recommend olanzapine–fluoxetine even though it is listed in the comorbid-anxiety subsection, likely because the acute-depression table designates it as a second-line option. In contrast, the baseline model, which was not constrained by specific retrieved chunks, recommended this combination more consistently, contributing to its superior performance on this vignette. Additionally, some responses were rated inaccurate or wrong because experts applied different rating standards or clinical judgment beyond the guideline scope, despite instructions to evaluate solely on consistency with the guideline. For instance, in Vignette 17, identical responses from the baseline model were rated “correct” by one expert and “inaccurate” by another, underscoring variability in scoring. Similarly, a rater judged a RAG response to Vignette 5 (Supplemental Table S1) as inaccurate because it suggested risperidone after paliperidone failure, noting that paliperidone is the active metabolite of risperidone—a clinically reasonable interpretation but not specified in the guidelines. This underscores a misalignment between system design and evaluation criteria: the chatbot was built for document-grounded reasoning, whereas experts often applied a physician standard, contributing to lower-rated accuracy. Future improvements could include integrating external clinical knowledge bases to better emulate clinical reasoning.

Future improvements should focus on 2 key limitations that can sometimes lead to errors in RAG systems: incomplete retrieval of relevant guideline sections and challenges with multi-step reasoning. Advanced RAG techniques and newer models could help address these issues. Our current search approach retrieves information in a single step, which makes it difficult to answer questions that require combining details from different sections of the guidelines.⁴⁵ Approaches such as PlanRAG⁴⁶ and self-correction⁴⁷ mechanisms, along with retrieval-enhanced embedding models, warrant exploration. To improve reasoning, 2 directions appear promising.

First, testing newer LLMs, including open-weight models such as gpt-oss⁴⁸ and state-of-the-art commercial systems like OpenEvidence,⁴⁹ GPT,⁵⁰ given its widespread adoption and direct relevance to the prior study.³¹ This approach, combined with a structured clinical reasoning framework for prompt design, could enhance performance. Second, hybrid architectures that integrate rule-based components with LLMs may offer greater reliability. This strategy aligns with existing tools such as the C-IMPACT web application,¹⁴ which provides rule-based recommendations for the CANMAT bipolar guidelines. Although C-IMPACT does not currently cover all scenarios tested in this study, it could be expanded. However, the upcoming update to the CANMAT bipolar guidelines would require ongoing rule development, highlighting the adaptability advantage of systems that minimize manual rule creation.

Further evaluation will be needed to test iterative improvement. Our methodology highlighted challenges related to rater availability and variability. Most raters required more than 2 hours to complete 21 ratings, making it impractical to compare different system outcomes and continuous optimization. It also reflects a structural trade-off: standardizing questions across raters would improve comparability but reduce ecological validity, while cross-rating design for inter-rater reliability would further strain rater availability. Future studies should explore ways to provide robust evaluations while reducing the time burden on guideline experts, who are often clinician-scientists with limited availability. One option is to develop marking rubrics for vignettes, allowing less experienced raters, such as psychiatry residents, to complete evaluations with an expert available to resolve disagreements. Rubrics could also support automated evaluation, although human oversight would likely still be necessary. To avoid overfitting, rubrics should accommodate variability in queries and could be created after external clinicians generate the queries rather than the chatbot development team.

Beyond the limitations noted above, other limitations should be considered. Examining the impact of gender, stage of training, and other equity considerations on responses was beyond the scope of our study, but will be an important future direction. Future work should also examine how applying the current approach to guidelines of varying quality, precision, and structure might affect accuracy. While we believe our study was based on scenarios with more complexity than many others in RAG studies so far, they still represent simplifications compared to real-world patients, and our evaluation solely evaluated single-turn question answering, deferring the evaluation of real-world, dynamic multi-turn conversations to future work. Future studies could evaluate multi-turn interactions using simulated clinical dialogues in which information is disclosed incrementally, reflecting the iterative nature of real psychiatric consultations. Additionally, our system was only designed to achieve guideline concordance, not clinical optimality; future iterations could be improved to approach clinical optimality by integrating post-2018 evidence, off-guideline clinical nuances, and updated pharmacological data.

Despite these limitations, our study provides a foundation for further use of RAG chatbots to support decision-making in mental health, while highlighting multiple directions for improvement. Although our results showed that responses still contained minor and occasional major deviations from the guidelines, future work could examine how a chatbot with this level of performance functions in more naturalistic settings, such as with clinicians at varying career stages or trainees who lack guideline expertise. A prior study in bipolar depression found that community-based clinicians selected optimal medication choices for only 23.1% of vignettes, whereas a simple prompt-augmented chatbot using GPT-4 Turbo achieved 50.8%.³¹ This preliminary evidence suggests that even imperfect systems may meaningfully improve care, though direct comparison with human clinicians remains an important empirical question warranting further study. In the future, explainable AI⁵¹ could enhance such tools by highlighting relevant guideline text and reasoning, enabling error checking and supporting clinician education. Ultimately, we believe this work advances efforts to integrate AI into clinical workflows, supporting evidence-based care for individuals with bipolar disorder and other conditions.

Supplemental Material

sj-docx-1-cpa-10.1177_07067437261457232 - Supplemental material for A Chatbot for the Management of Bipolar Disorder: Using Retrieval-Augmented Generation With an Open-Weight Large Language Model to Answer Clinical Questions Based on the CANMAT and ISBD 2018 Guidelines for Bipolar Disorder: Un dialogueur pour la prise en charge du trouble bipolaire : utiliser la génération augmentée par récupération avec un grand modèle de langage (GML) à poids ouverts pour répondre aux questions cliniques fondées sur les lignes directrices de 2018 de CANMAT et de l'ISBD relatives au trouble bipolaire

Supplemental material, sj-docx-1-cpa-10.1177_07067437261457232 for A Chatbot for the Management of Bipolar Disorder: Using Retrieval-Augmented Generation With an Open-Weight Large Language Model to Answer Clinical Questions Based on the CANMAT and ISBD 2018 Guidelines for Bipolar Disorder: Un dialogueur pour la prise en charge du trouble bipolaire : utiliser la génération augmentée par récupération avec un grand modèle de langage (GML) à poids ouverts pour répondre aux questions cliniques fondées sur les lignes directrices de 2018 de CANMAT et de l'ISBD relatives au trouble bipolaire by Yash Mali, Zejiao Zeng, Kayoung Heo, Grace Zhang, Antarip Kashyap, Jincheng Chen, Kamyar Keramatian, Gayatri Saraf, Marco Solmi, Edwin Tam, Sagar V. Parikh, Ayal Schaffer, Serge Beaulieu, Raymond Ng, Lakshmi N. Yatham and John-Jose Nunez in The Canadian Journal of Psychiatry

Footnotes

Acknowledgments

Technical guidance, project coordination, and computing resources were provided by the University of British Columbia Cloud Innovation Centre.

ORCID iDs

Yash Mali

Zejiao Zeng

Kamyar Keramatian

Gayatri Saraf

Marco Solmi

Edwin Tam

Ayal Schaffer

Serge Beaulieu

Lakshmi N. Yatham

John-Jose Nunez

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the AWS.

Declaration of Conflicting Interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Marco Solmi has received honoraria/has been a consultant for AbbVie, Angelini, Bausch Health, Boehringer Ingelheim, Pharmascience, Lundbeck, Otsuka, Teva. Marco Solmi holds shares of MESAS, S2M. Serge Beaulieu reports research grants, personal fees, and/or nonfinancial support from Abbvie, Boehringer-Ingelheim, CIHR, DiaMentis, Idorsia, Eisai, Janssen, and Otsuka-Lundbeck. Sagar V. Parikh is the Medical Director for the National Network of Depression Centers and reports research grants paid to his institution from Aifred, Compass Pathways, and Janssen; and consulting fees from Boehringer Ingelheim and Otsuka. Ayal Schaffer reports consulting fees and honoraria from AbbVie and Otsuka. Lakshmi N. Yatham reports research grants and/or personal fees from Abbvie, Alkermes, Allergan, Dainippon Sumitomo, Gedeon Richter, GSK, Intracellular Therapies, Merck, Sanofi, and Sunovion. All other authors have no conflicts of interest to declare.

Supplemental Material

Supplemental material for this article is available online.

References

Singh

Swartz

Cuellar-Barboza

, et al. Bipolar disorder. Lancet. 2025;406(10506):963–978. doi:10.1016/S0140-6736(25)01140-7

McDonald

Bulloch

AGM

Duffy

, et al. Prevalence of bipolar I and II disorder in Canada. Can J Psychiatry. 2015;60(3):151–156. doi:10.1177/070674371506000310

Schaffer

Cairney

Cheung

Veldhuizen

Levitt

. Community survey of bipolar disorder in Canada: lifetime prevalence and illness characteristics. Can J Psychiatry. 2006;51(1):9–16. doi:10.1177/070674370605100104

Fountoulakis

Vieta

Siamouli

, et al. Treatment of bipolar disorder: a complex treatment for a multi-faceted disorder. Ann Gen Psychiatry. 2007;6(1):27. doi:10.1186/1744-859X-6-27

Yatham

Kennedy

Parikh

, et al. Canadian Network for Mood and Anxiety Treatments (CANMAT) and International Society for Bipolar Disorders (ISBD) 2018 guidelines for the management of patients with bipolar disorder. Bipolar Disord. 2018;20(2):97–170. doi:10.1111/bdi.12609

Gomes

Dumay

Fagen

Palma

Milev

Brietzke

. Does the ranking matter? A retrospective cohort study investigating the impact of the 2018 CANMAT and ISBD guidelines for the management of patients with bipolar disorder treatment recommendations for acute mania on rehospitalization rates. Can J Psychiatry. 2023;68(8):605–612. doi:10.1177/07067437231156235

Baek

Yatham

, et al. Pattern of pharmacotherapy by episode types for patients with bipolar disorders and its concordance with treatment guidelines. J Clin Psychopharmacol. 2014;34(5):577. doi:10.1097/JCP.0000000000000175

Paterniti

Bisserbe

. Pharmacotherapy for bipolar disorder and concordance with treatment guidelines: survey of a general population sample referred to a tertiary care service. BMC Psychiatry. 2013;13(1):211. doi:10.1186/1471-244X-13-211

Heselmans

Donceel

Aertgeerts

Van De Velde

Ramaekers

. The attitude of Belgian social insurance physicians towards evidence-based practice and clinical practice guidelines. BMC Fam Pract. 2009;10(1):64. doi:10.1186/1471-2296-10-64

10.

Wolf

Hubbard

Faraday

Forrest

. Clinical practice guidelines to inform evidence-based clinical practice. World J Urol. 2011;29(3):303–309. doi:10.1007/s00345-011-0656-5

11.

Mansfield

. Attitudes and behaviors towards clinical guidelines: the clinicians’ perspective. BMJ Qual Saf. 1995;4(4):250–255. doi:10.1136/qshc.4.4.250

12.

Carlsen

Bringedal

. Attitudes to clinical guidelines–do GPs differ from other medical doctors? BMJ Qual Saf. 2011;20(2):158–162. doi:10.1136/bmjqs.2009.034249

13.

Foy

. Clinical guidelines: must we follow them? Bone & Joint 360. 2016;5(4):42–43. doi:10.1302/2048-0105.54.360450

14.

Kozicky

Schaffer

Beaulieu

McIntosh

Yatham

. Use of a point-of-care web-based application to enhance adherence to the CANMAT and ISBD 2018 guidelines for the management of bipolar disorder. Bipolar Disord. 2022;24(4):392–399. doi:10.1111/bdi.13136

15.

Liu

Wright

Patterson

, et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc. 2023;30(7):1237–1245. doi:10.1093/jamia/ocad072

16.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198

17.

Augenstein

Baldwin

Cha

, et al. Factuality challenges in the era of large language models. arXiv. Preprint posted online October 11, 2023. doi:10.48550/arXiv.2310.05189.

18.

Lewis

Perez

Piktus

, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv. Preprint posted online April 13, 2021. doi:10.48550/arXiv.2005.11401.

19.

Gan

Zhang

Tong

Liu

. Evaluation of retrieval-augmented generation: a survey. In: Vol 2301. 2025:102-120. doi:10.1007/978-981-96-1024-2_8.

20.

Retrieval-augmented generation for large language models: a survey. Accessed November 13, 2025. https://arxiv.org/html/2312.10997v5

21.

Liu

McCoy

Wright

. Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J Am Med Inform Assoc. 2025;32(4):605–615. doi:10.1093/jamia/ocaf008

22.

Woo

Yang

Olsen

, et al. Custom large language models improve accuracy: comparing retrieval augmented generation and artificial intelligence agents to noncustom models for evidence-based medicine. Arthroscopy. 2025;41(3):565–573.e6. doi:10.1016/j.arthro.2024.10.042

23.

Shin

Song

Kim

Choe

Chai

. Thyro-GenAI: a chatbot using retrieval-augmented generative models for personalized thyroid disease management. J Clin Med. 2025;14(7):2450. doi:10.3390/jcm14072450

24.

Sun

Owens

, et al. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatology. 2024;80(5):1158–1168. doi:10.1097/HEP.0000000000000834

25.

Kresevic

Giuffrè

Ajcevic

Accardo

Crocè

Shung

. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. npj Digit Med. 2024;7(1):1–9. doi:10.1038/s41746-024-01091-y

26.

Malik

Kharel

Dahiya

, et al. Assessing ChatGPT4 with and without retrieval-augmented generation in anticoagulation management for gastrointestinal procedures. Ann Gastroenterol. 2024;37(5):514–526. doi:10.20524/aog.2024.0907

27.

Ping

, et al. Retrieval meets long context large language models. arXiv. Preprint posted online January 23, 2024. doi:10.48550/arXiv.2310.03025.

28.

Yang

Zhu

Tan

Yang

. CascadeRCG: retrieval-augmented generation for enhancing professionalism and knowledgeability in online mental health support. Companion Proceedings of the ACM on Web Conference 2025. WWW ‘25. Association for Computing Machinery; 2025:1465-1469. doi:10.1145/3701716.3715466.

29.

Boggavarapu

Srivastava

Varanasi

Bhaumik

. Evaluating enhanced LLMs for precise mental health diagnosis from clinical notes. medRxiv. Preprint posted online March 10, 2025:2024.12.16.24317648. doi:10.1101/2024.12.16.24317648.

30.

Hou

Teng

Liu

, et al. Retrieval-augmented multimodal depression detection. arXiv. Preprint posted online October 29, 2025. doi:10.48550/arXiv.2511.01892.

31.

Perlis

Goldberg

Ostacher

Schneck

. Clinical decision support for bipolar depression using large language models. Neuropsychopharmacol. 2024;49(9):1412–1416. doi:10.1038/s41386-024-01841-2

32.

Wada

Tanaka

Nishizawa

, et al. Retrieval-augmented generation elevates local LLM quality in radiology contrast media consultation. npj Digit Med. 2025;8(1):395. doi:10.1038/s41746-025-01802-z

33.

Masanneck

Meuth

Pawlitzki

. Evaluating base and retrieval augmented LLMs with document or online support for evidence based neurology. npj Digit Med. 2025;8(1):137. doi:10.1038/s41746-025-01536-y

34.

de Hond

Leeuwenberg

Bartels

, et al. From text to treatment: the crucial role of validation for generative large language models in health care. The Lancet Digital Health. 2024;6(7):e441–e443. doi:10.1016/S2589-7500(24)00111-0

35.

Rahmani

Siro

Aliannejadi

, et al. Report on the 1st workshop on large language model for evaluation in information retrieval (LLM4Eval 2024) at SIGIR 2024. arXiv. Preprint posted online August 9, 2024. doi:10.48550/arXiv.2408.05388.

36.

Zakka

Shad

Chaurasia

, et al. Almanac — retrieval-augmented language models for clinical medicine. NEJM AI. 2024;1(2):AIoa2300068. doi:10.1056/AIoa2300068.

37.

Jiang

Sablayrolles

Mensch

, et al. Mistral 7B. arXiv. Preprint posted online October 10, 2023. doi:10.48550/arXiv.2310.06825.

38.

Yang

, et al. Qwen3 Technical Report. arXiv. Preprint posted online May 14, 2025. doi:10.48550/arXiv.2505.09388.

39.

Robertson

Zaragoza

. The probabilistic relevance framework: BM25 and beyond. Found Trends Inf Retr. 2009;3(4):333–389. doi:10.1561/1500000019

40.

Zhong

Liu

, et al. Performance of ChatGPT-4o and four open-source large language models in generating diagnoses based on China’s rare disease catalog: comparative study. J Med Internet Res. 2025;27(1):e69929. doi:10.2196/69929

41.

Perlis

Collins

. Can open-source AI models diagnose complex cases as well as GPT-4? JAMA. 2025;333(17):1473–1475. doi:10.1001/jama.2025.2806

42.

LLM leaderboard 2025. Accessed November 15, 2025. https://www.vellum.ai/llm-leaderboard

43.

Wei

Wang

Schuurmans

, et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv. Preprint posted online January 10, 2023. doi:10.48550/arXiv.2201.11903.

44.

Bürkner

. Brms: an R package for Bayesian multilevel models using stan. J Stat Softw. 2017;80:1–28. doi:10.18637/jss.v080.i01

45.

Jiang

Sun

Liang

Zhang

. Retrieve, summarize, plan: advancing multi-hop question answering with an iterative approach. arXiv. Preprint posted online January 30, 2025. doi:10.48550/arXiv.2407.13101.

46.

Lee

Kim

. PlanRAG: a plan-then-retrieval augmented generation for generative large language models as decision makers. In: Duh

Gomez

Bethard

, editor. Proceedings of the 2024 conference of the North American chapter of the Association for Computational Linguistics: human language technologies (volume 1: long papers). Mexico City: Association for Computational Linguistics; 2024:6537–6555. doi:https://doi.org/10.18653/v1/2024.naacl-long.364.

47.

Asai

Wang

Sil

Hajishirzi

. Self-RAG: learning to retrieve, generate, and critique through self-reflection. arXiv. Preprint posted online October 17, 2023. doi:10.48550/arXiv.2310.11511.

48.

Introducing gpt-oss. March 24, 2026. Accessed March 31, 2026. https://openai.com/index/introducing-gpt-oss/

49.

OpenEvidence. OpenEvidence. Accessed March 25, 2026. https://www.openevidence.com/

50.

Open AI,

Achiam

, Adler

, Agarwal

et al. GPT-4 technical report. arXiv.org. March 15, 2023. Accessed March 31, 2026. https://arxiv.org/abs/2303.08774v6.

51.

Rizzo

Veneri

Albarelli

Lucchese

Nobile

Conati

. A theoretical framework for AI models explainability with application in biomedicine. 2023 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB). 2023:1-9. doi:10.1109/CIBCB56990.2023.10264877.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.05 MB

0.00 MB