Agree to Disagree: Reliability of Seizure Frequency Assessments

Abstract

Epilepsia. 2026 Mar;67(3):1317-1331. doi: 10.1111/epi.70031. Epub 2025 Nov 30.

Objective: Although vagus nerve stimulation (VNS) is a well-established neuromodulation therapy for drug-resistant epilepsy, treatment outcomes remain heterogeneous. One possible source of variability lies in differing interpretations of seizure frequency ratings (SFRs). This study examined interrater reliability (IRR) in SFRs between (1) retrospective clinician–clinician chart reviews and (2) prospective caregiver–clinician reports, and explored sources of disagreement. Methods: Data were collected from the CONNECTiVOS database. In the retrospective cohort (n = 254), 2 clinicians independently reviewed medical records and rated seizure frequency across multiple timepoints. In the prospective cohort (n = 214), caregivers and clinicians independently reported SFR in children treated with VNS. IRR was assessed across different measurement thresholds, and potential causes of disagreement were analyzed. Results: Clinician–clinician agreement in retrospective chart reviews was excellent (intraclass correlation coefficient [ICC] > .90, Cohen κ > .80), with 18.8% divergent ratings and 4.8% exceeding the reliable change index. Disagreement was significantly associated with higher mean seizure frequency at baseline (P = .004) and at postoperative timepoints (P < .001). In the prospective caregiver–clinician comparison, agreement for absolute seizure frequency was poor (ICC < .50), with discrepancies in 86.5% of cases, although only 1.8% were statistically significant. When rating pairs diverged, clinicians more often reported lower absolute seizure frequencies (P = .002) and greater relative seizure reductions (P = .023) and were more likely to classify patients as achieving a 90% reduction (P = .043). Significance: This study highlights interrater variability in both retrospective and prospective SFR assessments, a finding systematically related to baseline seizure frequency. Coarser classifications (eg, 50% or 90% seizure reduction) may improve agreement but reduce clinical nuance. Future efforts should focus on structured, patient-centered documentation and the development of objective outcome measures in VNS evaluation, particularly for children with high seizure burden.

Commentary

“Garbage in, garbage out” is a central problem in research and clinical care. Even with the most compelling research question, or the most advanced candidate therapy, our ability to understand effects is only as good as our ability to reduce noise and reliably measure key outcomes.

In epilepsy care, we clearly have a problem. Seizure reduction is a critical goal of antiseizure therapeutics. Yet, substantial sources of random and systematic bias could easily obscure the truth. Patients have wide variability across time in seizure frequency. But beyond natural variation, many other downstream measurement challenges exist—patients do not recognize many of their seizures,¹ eyewitnesses may later poorly recall basic seizure characteristics,² patients or caregivers may not communicate all episodes to the clinician, clinicians may not adequately document seizure frequencies even with an accurate history, and clinicians may come to different conclusions regarding the likelihood that any given event was in fact a seizure. With so many potential pitfalls, testable hypotheses exist regarding the reliability of seizure frequency, such as whether the major source of variability occurs at the clinician or the caregiver level, and whether reliability drops off depending on what exact seizure metric is used.

Dinger et al³ recently studied this important problem. Their objective was to assess interrater reliability in seizure frequency both between clinician raters, and also comparing clinician versus caregiver raters. They used a large multicenter dataset of children who underwent vagus nerve stimulation for drug-resistantepilepsy.⁴

First, they conducted a retrospective chart review of a single center in Toronto from 2007 to 2022 including 254 patients (1524 rated pairs). They compared 2 independent clinicians’ seizure frequency ratings at several timepoints (baseline, and then 3, 6, 12, 36, and 60 months). Agreement was generally strong, even though their Rater 1 tended to score seizure frequencies higher than their Rater 2. For example, intraclass correlation coefficients between clinician raters pooling all timepoints together were 0.94 for absolute seizure frequency and 0.91 for relative seizure frequency (both excellent, perfect would be 1). Agreement beyond chance for 2 dichotomized measures for change (50% or 90% seizure reduction compared to baseline) was also quite good (Kappa = 0.94 and 0.88 for 50% and 90% reduction, respectively). The major source of disagreement was incomplete or nonspecific chart data, namely wide-ranging documented seizure frequencies thus leading clinicians to each choose a different value. Other lesser sources of disagreement were errors reading the chart or variability in summarizing multiple seizure types.

Second, they conducted prospective evaluations across multiple centers from 2018 to 2025 including 140 patients (540 rated pairs). They compared ratings from one clinician versus the patient's caregiver at each timepoint. Results were notably worse. For example, the same intraclass correlation coefficients were 0.17 and 0.03 (almost no correlation), and Kappas were 0.28 and 0.26 (weak). Clinicians tended to interpret a somewhat rosier seizure frequency reduction than caregivers.

What can we learn from these data?

A major roadblock was not necessarily between-clinician unreliability, but rather total absence of the data needed to assess reliability in the first place. For example, in the retrospective cohort, 27% of desired timepoints could not be scored due to missing baseline or follow-up values. Worse, in the prospective cohort only 65% of patients had any assessments, and 49% of absolute seizure frequencies were missing. This presents a serious threat to validity to the degree that data are missing not at random. This story has also played out elsewhere—natural language processing to extract seizure frequencies.⁵ In that case, algorithms accurately identified portions of text describing seizure frequencies … but only when present in the medical record. They say an ounce of prevention is worth a pound of cure. Documenting seizure frequency represents a quality metric set forth by the American Academy of Neurology.⁶ Thus, these findings emphasize the need for standardized note templates to ensure that clinician documentation enables seizure tracking for both research and clinical purposes, whereas such a high rate of missingness in prospective data collection requires a more complex set of solutions regarding retention of human subjects to improve research quality.

Reliability was far superior between clinicians using chart data than between clinicians gathering prospective data and caregivers. This suggests that clinicians largely agreed when provided the same information. The major unresolved question to me from this observation remains—was the major source of unreliability between patients and clinicians due to inadequate history taking such that the clinician did not understand everything the caregiver knew? Or did they understand, but the discrepancy was that the clinician was interpreting events differently believing the caregiver was over- or undercalling seizures? Distinguishing inadequate history taking versus differences in interpretation would be a fascinating follow-up study regarding variation in the process by which clinicians take seizure histories.

What metric to use when judging “success” of an antiseizure therapeutic remains an open question, which will assuredly differ across patients. Historically, the U.S. Food and Drug Agency has relied more upon median seizure frequency reduction whereas the European Medical Agency has relied more upon the 50% responder rate. One could have hypothesized that the 50% responder rate would have been more reliable due to categorization, at the expense of less granularity and more arbitrariness as a threshold.⁷ However, Dinger et al's work showed that both absolute and relative changes were similarly (un)reliable as the 50% responder rate. This seems like yet one more argument against relying solely upon the 50% responder rate—it is arbitrary, withholds potentially useful distinctions, and in this case was found no more reliable than more granular changes.

The question “how reliable are seizure frequencies” is essentially a nonsensical question for certain “uncountable” seizure types, at least by clinical means. Examples include absence seizures that are so subtle that they are easily missed, or else myoclonic seizures in which jerks mild blend into one another. Technology may be the only way around this problem, such as EEG- or EMG-based methods. FDA-cleared seizure tracking devices are sensitive for only tonic–clonic seizures with high false alerts during periods of activity,⁸ thus with considerable work to go.

Seizure changes represent only one of many outcomes by which to judge the success of antiseizure therapeutics, as it is well-known that epilepsy is a multidimensional condition in which seizures are only one of many contributors to biopsychosocial well-being.⁹ Still, measuring seizure frequency is a key input for tracking response to treatment in both research and clinical settings. Hopefully concerted efforts to systematically document seizure frequency in the medical record alongside advances in tracking technologies could improve assessments.

Footnotes

ORCID iD

Samuel W Terman

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Institutional Affiliation

University of Michigan Department of Neurology, Ann Arbor, MI 48105.

References

Hoppe

Poepel

Elger

. Accuracy of patient seizure counts. Arch Neurol. 2007;64(11):1595-1599.

Noble

Lane

New

, et al. How accurate are witnesses of first suspected seizures in recalling semiology at clinically relevant timepoints? A UK experimental study with a pilot intervention. Epilepsia. 2025;66(12):4795-4808.

Dinger

Mithani

Al-Hasan

, et al. Do we agree on seizure reduction after vagus nerve stimulation? Interrater reliability of retrospective and prospective seizure frequency ratings from the CONNECTiVOS database. Epilepsia. 2026;67(3):1317-1331.

Siegel

Yan

Warsi

, et al. Connectomic profiling and vagus nerve stimulation outcomes study (CONNECTiVOS): a prospective observational protocol to identify biomarkers of seizure response in children and youth. BMJ Open. 2022;12(4):e055886.

Xie

Terman

Gallagher

, et al. Generalization of finetuned transformer language models to new clinical contexts. JAMIA Open. 2023;6(3):1-9.

Patel

Baca

Franklin

, et al. Quality improvement in neurology epilepsy quality measurement set 2017 update. Neurology. 2018;91(18):829-836.

van der Kop

. The need for an individualized approach to what is considered a clinically significant reduction in seizure frequency: a patient’s perspective. Epilepsia. 2023;64(6):1469-1471.

Beniczky

Wiebe

Jeppesen

, et al. Automated seizure detection using wearable devices: a clinical practice guideline of the international league against epilepsy and the International Federation of Clinical Neurophysiology. Clin Neurophysiol. 2021;132(5):1173-1184.

Kheder

. More than seizure control: Multidimensional Outcome Reporting in Epilepsy (MORE) as a patient-centered framework redefining success in treatment. Epilepsia. 2025;66(9):3105-3117.