Abstract
Background
Migraine is a complex neurological disorder characterized by recurrent, often debilitating headaches. Current evidence suggests that autonomic nervous system (ANS) alterations play a significant role in migraine pathophysiology, affecting sensory, limbic, and homeostatic processing. Heart rate variability (HRV), a well-established, noninvasive marker of ANS function, is associated with migraine severity and treatment efficacy.
Objective
In this study, we aimed to evaluate the use of wearable sensor technology in predicting migraine attacks by monitoring changes in the ANS during the prodrome phase.
Methods
We recruited 23 migraine sufferers and analyzed HRV during nocturnal sleep using wearable biosensors and machine learning, extracting HRV features from BVP signals and applying feature engineering to predict migraine episodes.
Results
The analysis of HRV provides an important approach to migraine attack prediction, revealing significant individual variability in physiological responses. Overall, these results lay the groundwork for developing more effective and personalized migraine prediction models, which could lead to earlier interventions and improved participant outcomes. Future research should consider controlled inclusion of post-migraine nights, potentially leveraging additional statistical or machine learning techniques to mitigate misclassification risks while capturing these transitional dynamics.
Keywords
Introduction
Migraine is a complex neurological disorder characterized by recurrent, often debilitating headaches, frequently accompanied by symptoms such as nausea, photophobia, and phonophobia. 1 It progresses through four distinct phases: prodrome, aura, headache, and postdrome, each marked by unique physiological and neurological changes. 2 Current evidence suggests that autonomic nervous system (ANS) alterations play a significant role in migraine pathophysiology, affecting sensory, limbic, and homeostatic processing.3,4 Heart rate variability (HRV) is a well-established, non-invasive marker of ANS function that reflects autonomic modulation through variations in the time intervals between consecutive heartbeats. Studies have shown that migraine participants have lower HRV values, particularly during migraine attacks.5,6
Beyond its role in migraine, HRV has been extensively studied in sleep analysis, offering valuable insights into sleep architecture and autonomic balance across different sleep stages. 7 Notably, migraine and sleep disturbances share a strong bidirectional relationship, with migraineurs frequently experiencing sleep alterations. 8 Polysomnographic analysis has further highlighted autonomic dysregulation during sleep, revealing that individuals with migraines experienced significantly reduced sleep efficiency, prolonged sleep onset latency, decreased stage 4 and NREM sleep, and a higher number of total awakenings. 9
Given that night-time HRV signals are minimally influenced by external factors such as physical activity and stress, 10 sleep-based HRV monitoring provides a reliable window into autonomic dysfunction in migraine.
In the study, 11 HRV characteristics were analysed in participants with vestibular migraine, migraine without vestibular symptoms, and healthy control subjects. Ambulatory ECG monitoring was used to assess HRV, analysing both time-domain and frequency-domain parameters during daytime and nighttime. It was found that participants with vestibular migraine and migraine exhibited autonomic dysfunction, characterized by sympathetic hyperactivity and reduced vagal activity, particularly at night.
The findings of study 12 reveal a negative correlation between HRV parameters and pain intensity during the ictal period (p = 0.04), indicating that lower HRV is associated with more severe migraine attacks. This aligns with the results of the study 13 on flunarizine treatment for chronic migraine, which demonstrated that participants with preserved HRV (SDNN > 30 ms) showed a better response to preventive therapy, experiencing a significantly greater reduction in monthly headache days compared to those with lower HRV (p = 0.026). Taken together, these findings reinforce the role of HRV not only as a marker of autonomic dysfunction in migraine but also as a potential predictor of both migraine severity and treatment efficacy. However, most studies have been cross-sectional or treatment-focused, rather than predictive, and few have examined HRV specifically during the prodrome or pre-migraine sleep period. This leaves an important gap: whether nocturnal HRV changes can be leveraged for early migraine prediction.
This underscores HRV's potential as a novel biomarker for tracking migraine activity and severity, offering opportunities for early detection and improved management. Further research is needed to establish standardized HRV-based metrics for reliable migraine monitoring.
In our previous studies,14,15 we have analyzed ANS changes during pre-migraine nights using wearable biosensors and machine learning, identifying electrodermal activity and skin temperature as key predictive features. Building on this, the present study focuses specifically on HRV as a predictive modality, rather than combining multiple signals. By targeting night-time HRV in the prodrome phase, we aim to provide novel insights into the feasibility of using sleep-based HRV changes for migraine prediction. This represents a step beyond existing literature by shifting from descriptive HRV–migraine associations to predictive modeling, and testing feasibility with wearable sensors in naturalistic, per-participant settings.
Methods
Participants
In this study, 23 migraine sufferers were recruited to evaluate the use of wearable sensor technology in predicting migraine attacks by monitoring changes in the ANS during the prodrome phase.
Participants wore an Empatica Embrace Plus device on their non-dominant wrist until at least three migraine episodes were recorded. To be included, participants had to be at least 18 years old, diagnosed with episodic migraine with or without aura according to ICHD-3 criteria, experience at least four migraine attacks per month, and be proficient in Lithuanian. Exclusion criteria included pregnant or lactating women, those diagnosed with chronic or hemiplegic migraine, other headache syndromes (except episodic tension-type headache occurring no more than four days per month), individuals using preventive migraine treatment or medications affecting the ANS, and those with other chronic pain diagnoses. Data of six participants were discarded due to quality of data or labelling issues.
Migraine labelling
In this study, we adopted the labelling approach used in our previous research on migraine detection and physiological signal changes using wearable devices14,16,17 to label migraine episodes. Migraine episodes and post-migraine nights were identified based on self-reported start and end dates, recorded by study participants through migraine diaries and the Migraine Buddy application. Each timestamp was labelled as migraine if it fell on the day of migraine episode. One night was labelled as a post-migraine night if there was no migraine episode on that day. All other timestamps were labelled as non-migraine nights. These labels were added to the feature dataset directly.
All post-migraine night entries were completely removed from the dataset to avoid any misclassifications that might become present due to the physiological changes present during the migraine postdrome. A total number of labels is given in Table 1 In total there were 3760 migraine labels, 13793 non-migraine labels. Thus, 25.4% of labels were migraine positive.
Distribution of positive (migraine) and negative (non-migraine) labels for each participant.
Distribution of positive (migraine) and negative (non-migraine) labels for each participant.
For the HRV data preparation, raw data from the Empatica Embrace Plus devices was used. The data is stored in AVRO files, a row-based data serialization framework developed within the Apache Hadoop ecosystem. It uses a JSON-based schema to define data types and a compact binary format to store serialized data, ensuring efficient data exchange and supporting schema evolution. This design makes AVRO well-suited for handling large datasets.
The AVRO files store data streams from each sensor, each file storing up to 30 min of data. If the data stream is interrupted, a new file is generated. Each file contains a starting timestamp, the sampling rate, participant and device identification data and the data streams for each sensor. In the case of blood volume pulse (BVP) data, which is the sensor data used for this study, the devices default sampling rate of 64 Hz was used.
For the study presented in this paper, the raw data was read and concatenated into per-day files for easier processing later on. They were saved as CSV files that contain the calculated timestamp (the initial timestamp provided from the original AVRO file + the sampling rate offset) and the value for the corresponding timestamp. Then the files were further combined into full dataset for the entire monitoring period for each participant.
In the case of this experiment, the full BVP dataset was filtered down to sleep time data only, as the BVP signals captured by wearable devices, like the Empatica Embrace Plus, can be unreliable during exercise. 5 The sleep-time data was also detected by the Empatica Embrace Plus device, and all awake periods were filtered out, including daytime and nighttime awake periods. The BVP data was further filtered using a bandpass filter between 0.5-8 Hz during feature extraction to remove any irrelevant artifacts.
Feature extraction
The feature extraction method comprises a multi-stage pipeline designed to process sensor data for analysis. The pipeline is implemented in Python and consists of three primary stages: (1) sleep data extraction and interval grouping, (2) blood volume pulse (BVP) data processing and filtering, and (3) photoplethysmography (PPG) feature extraction using NeuroKit2 python library. Sleep data is obtained from a series of digital biomarker CSV files from the device. For each participant, the method searches through a predefined directory structure to locate files containing sleep-detection records. Each file is read into a pandas DataFrame where the Unix timestamps (provided in milliseconds) are converted into datetime objects. The method filters the data by retaining only those records where the sleep-detection stage is either 101 or 102, which indicate rest intervals. Thus, the filtering step ensures that only nighttime intervals when participants were asleep are preserved for further analysis. After filtering, consecutive sleep events are grouped into intervals. The grouping algorithm calculates the time difference between successive timestamps and defines a new sleep interval whenever the gap exceeds one minute.
For each group, the start and end times are recorded, and the interval duration is calculated. BVP data is collected from daily CSV files. Each file is read and concatenated into a single data frame per participant, with timestamps converted from microseconds to datetime format. The BVP data is then filtered based on pre-determined sleep intervals, retaining only records with timestamps within each interval. This step ensures that subsequent analyses focus only on data recorded during sleep periods.
Finally, the filtered BVP data is processed to extract PPG features. The data is resampled into fixed 10-min windows. For each window containing a sufficient number of data points (at least 20), NeuroKit2, an open-source Python package for neurophysiological signal processing, 8 is used to analyze the raw BVP signal. This process includes pre-processing the PPG signal, extracting relevant features and metadata using the ppg_process method, and applying the ppg_analyze function to derive a broad number of heart rate, HRV, and PPG metrics. The extracted features from each window are then aggregated into a single-row representation that includes the corresponding window start timestamp. These windowed features provide a time-resolved summary of cardiovascular dynamics during sleep. This way, a total of 92 features were extracted, exact list of features is given in Neurokit website (https://neuropsychology.github.io/NeuroKit/functions/hrv.html#).
All extracted features, along with their corresponding timestamps, were stored in a compressed Python .pkl format for each participant, allowing for faster loading in subsequent analyses.
Classifiers and feature selection
Two classifiers, XGBoost, and Random Forest, were used for training. The hyperparameters for each model are shown in Table 2, however, as this is an initial study most parameters were set to their default values and will be tuned in future work. XGBoost was configured with logloss as the evaluation metric, while Random Forest used the Gini split criterion and a “balanced” class weight, ensuring automatic weight adjustment based on class frequencies in the input data.
Hyperparameters of machine learning models.
Hyperparameters of machine learning models.
Feature selection was performed using two methods: analysis of variance (ANOVA) and embedded feature selection. In the ANOVA variant, the f_classif function from the sklearn.feature_selection module was used to perform a one-way ANOVA F-test. It compares the variance between and within classes for each feature, returning an F-statistic that quantifies class separability and a p-value that indicates its statistical significance.
In the embedded approach, the classifier was first trained on all features. Then, the top 20 features were selected and used for retraining with only that subset. The top 20 features were selected and used by each classifier for each feature selection.
Feature ranking
Feature ranking was performed per participant, i.e., each dataset had its own set of top features. The ranking process aimed to select 20 features for training the final classifiers, and was performed using two methods: (1) embedded feature selection, where the classifier was first trained on all features, and then the top 20 features were selected (based on gain for XGBoost and impurity reduction for Random Forest), and (2) ANOVA. Figures 1–4 show the top 20 most frequently selected features across all participants, where selection frequency indicates the number of participants for which each feature was selected.

Top 20 features selected by the XGBoost classifier using embedded feature selection.

Top 20 features selected by the XGBoost classifier using ANOVA.

Top 20 features selected by the Random Forest classifier using ANOVA.

Top 20 features selected by the Random Forest classifier using embedded feature selection.
Figure 1 shows the top 20 feature frequencies per participant for the XGBoost classifier using embedded feature selection. Among the top features, 10 were time-domain metrics, with 6 of them ranking in the top 10 most frequently selected features. These include: HRV_MedianNN - Median of all NN (normal to normal) intervals, representing the central tendency of the interbeat intervals. HRV_MeanNN - Mean of all NN intervals. HRV_Prc80NN - 80th percentile of NN intervals. HRV_pNN20 - Percentage of absolute differences in successive NN intervals greater than 20 ms. HRV_SDNN1 - Mean of standard deviations of NN intervals from 1-min segments of time-series data. HRV_CVNN - Standard deviation of NN intervals (SDNN) divided by their mean (MeanNN).
In addition, nine metrics were nonlinear, including four of the remaining top 10 most frequently selected features: HRV_IALS - Inverse of the mean length of acceleration/deceleration segments. HRV_PIP - Percentage of inflection points in the NN interval series. HRV_CD - Total contributions of heart rate decelerations to HRV. HRV_PSS - Percentage of short segments.
A frequency domain metric, HRV_LF (low-frequency spectral power), was also among the top selected features.
XGBoost with ANOVA, shown in Figure 2, showed a similar distribution of selected features, with 9 time-domain metrics, 10 nonlinear metrics, and 1 PPG metric (PPG_Rate_Mean, representing the mean heart rate after stimulus onset).
Notably, 11 features overlapped between the two selection methods, including key time-domain metrics (HRV_PrC20NN, HRV_pNN20, HRV_MedianNN, HRV_Prc80NN, HRV_CVNN, HRV_CVSD, HRV_MeanNN) and nonlinear metrics (HRV_PSS, HRV_CD, HRV_IALS, HRV_PIP).
Among these, the most frequently selected time-domain metrics included HRV_PrC20NN, HRV_pNN20, HRV_MedianNN, HRV_pNN50, HRV_Prc80NN, and HRV_HTI (HRV triangular index, which measures the total number of NN intervals relative to the height of the NN interval histogram).
In comparison, while the Random Forest classifier with ANOVA (Figure 3) had a feature distribution similar to XGBoost with ANOVA, the embedded feature selection approach (Figure 4) resulted in a higher frequency of nonlinear metrics (12) than the other three configurations. Of these, five metrics - HRV_PI (heart rate asymmetry measure), HRV_IALS, HRV_PAS (percentage of NN intervals in alternation segments), HRV_PIP, and HRV_PSS - were among the most frequently selected.
As with XGBoost with ANOVA, PPG_Rate_Mean was among the most selected features in both Random Forest configurations.
While the top features vary slightly, nearly all are derived from heart rate variability. Across participants, both embedded selection and ANOVA mostly chose time-domain features (e.g., MeanNN, MedianNN, pNN20/pNN50), with a smaller subset of nonlinear segmentation features (PSS, IALS, PIP, CD). Frequency-domain features were selected less often.
In sleep windows, the NN-interval series is relatively stationary with good signal-to-noise, so simple time-domain statistics computed over 10-min windows are stable and discriminative. Because many of these statistics are strongly correlated, different selectors often choose a similar representative from the same feature domain.
PSS (percentage of short segments) measures the proportion of short monotonic runs, and IALS (inverse average length of acceleration/deceleration segments) reflects the typical run length of monotonic increases/decreases in NN intervals. Higher PSS or IALS indicates more frequent switching between acceleration and deceleration, i.e., more fragmented beat-to-beat dynamics. These local-dynamics measures differed consistently between positive and negative nights, so both selectors choose them.
Although the sensor samples evenly, the NN intervals are not. Spectral HRV then needs resampling and longer windows to be stable. With 10-min windows, spectra can be noisier, so time-domain and short-run features can be more reliable and thus selected more often.
Classifiers were trained separately for each participant, not on the full dataset, to evaluate individual performance. The data was split, per participant, with an 80/20 split for training and testing, similar to previous study. 7 The stratify parameter was used to make sure we have the same ratio of labels on both the training and the test set. The tables in this section present the mean, maximum, minimum, and standard deviation for accuracy, f1 score, precision and recall.
Table 3 shows the accuracy of each classifier and feature selection method configuration. On average, all models performed above 50%, with XGBoost reaching just over 60%. Given the class imbalance, balanced accuracy can be misleading; nonetheless, XGBoost consistently outperformed Random Forest by 3–5%. Although the XGBoost with ANOVA method had a slightly lower mean accuracy (0.602) compared with the embedded feature selection variant (0.606), it achieved the highest maximum and minimum scores, making it the most consistent configuration overall.
Accuracy of each model configuration across all participants.
Accuracy of each model configuration across all participants.
The F1 scores, are shown in Table 4, which balance precision and recall, are particularly useful for imbalanced data. With 25.4% mean positive and 74.6% negative labels, a random classifier would be expected to achieve an F1 near 0.254; all models exceeded this baseline. XGBoost again performed better overall, with a mean F1 around 0.39, maxima of 0.80–0.91, and lower variability (SD < 0.19). The ANOVA variant was the strongest configuration. Random Forest models achieved lower mean scores (0.23–0.25), failed to reach 0.9 in any participant, and showed higher variability (SD ≈ 0.22–0.24).
F1-score of each model configuration across all participants.
Precision tells us how many of the episodes predicted as migraines were correct, while recall tells us how many of the actual migraine episodes were detected. Precision values are shown in Table 5. Mean values were similar across models (≈0.43–0.48). Random Forest occasionally reached perfect precision (max = 1) but also produced zero precision in some cases, reflected in its high variability (SD = 0.22–0.35). XGBoost had slightly lower mean precision (∼0.43) and lower maxima (∼0.80), but its minimum remained above zero (≈0.05–0.09), and variability was lower (SD = 0.18–0.19).
Precision of each model configuration across all participants.
Recall, shown in Table 6, was generally lower than precision, with means of 0.19-0.37. Random Forest tended to underperform (mean ≈0.19, minimum = 0) and showed high variability (SD ≈0.24). XGBoost achieved higher average recall (∼0.37) and more consistent performance. Notably, both ANOVA-based models occasionally reached perfect recall (1.0), whereas embedded-feature variants peaked at 0.80 (XGBoost) and 0.95 (Random Forest). Overall, XGBoost, especially with ANOVA feature selection, provided the strongest balance between sensitivity and stability.
Recall of each model configuration across all participants.

XGBoost classifier performance for each metric by participant.

Random Forest classifier performance for each metric by participant.
Overall, each approach has a wide gap between min and max values, plus moderate-to-high standard deviations. This is not unexpected, as the models were trained for each individual participant and this correlates with findings from our previous work. The XGBoost classifier is generally more stable across all metrics then the Random Forest model, even though the Random Forest classifiers can achieve higher max results in some splits.
If accuracy alone is considered, on average all models achieve results above 50%, with the XGBoost models achieving slightly above 60%. Precision near ∼0.4–0.5 means that when the model predicts migraine, it is correct only around half the time on average, however for some participants both the XGBoost and Random Forest classifiers achieved near 0.8-1 precision, meaning that for some participants the models were able to correctly label migraine episodes. Recall is generally in the 0.2–0.4 range, meaning the models on average only catch approx. 20–40% of migraine episodes. But again, the best runs can hit recall = 0.95 or 1.0, so there is variation depending on the participant dataset. A mean F1 score of 0.24–0.40 range is modest for a two-class problem. The large spread suggests the models can sometimes do very well but are inconsistent—likely due to the small datasets for each participant.
Both feature selection methods can yield similar average performance, though the ANOVA approach occasionally reaches a higher maximum accuracy or F1. Embedded methods also have some good runs but can show extreme lows. A different feature selection approach could be explored, as noted previously there is some similarity to which features each method captured and more research into what HRV and PPG features could be used is needed, as these methods alone are not enough to produce good, stable prediction results.
We formally compared XGBoost and Random Forest using paired Wilcoxon signed-rank tests across participants (n = 17). With embeded feature selection, XGBoost improved recall (median Δ = 0.194, 95% CI [0.158, 0.264]; p = 0.0002) and F1 (median Δ = 0.159, 95% CI [0.082, 0.238]; p = 0.0005), with a modest gain in balanced accuracy (median Δ = 0.041, 95% CI [−0.007, 0.070]; p = 0.023). Precision did not differ reliably (median Δ = −0.071, 95% CI [−0.167, 0.053]; p = 0.517). These results indicate XGBoost's advantage is driven primarily by higher sensitivity, without a significant change in precision.
Using ANOVA-based feature selection (n = 17), XGBoost improved recall (median Δ=0.194; p = 0.0002) and F1 (median Δ=0.159; p = 0.0005) over Random Forest, with a small balanced-accuracy gain (median Δ=0.041; p = 0.023) and no reliable precision difference (p = 0.517).
A full metric performance breakdown for each participant for the classifiers using ANOVA can be seen in Figure 5 and Figure 6. While the XGBoost classifier is more stable and performs better on average, the Random Forest classifier can achieve better results for some participants, therefore choosing one model as the best-performing model is, currently, difficult. Both models may benefit from more data or a different approach, such as fine tuning. Further research on why they can perform much better on some participants rather than others is also needed, as this may be due to the selected features, different physiological patterns or other factors.
Regarding the limitations, the exclusion of post-migraine nights, while aimed at avoiding misclassification, may overlook valuable transitional physiological changes that could provide insights into recovery patterns and lingering ANS imbalances. Post-migraine phases are known to involve gradual autonomic stabilization, 18 and analyzing HRV trends during this period could help refine predictive models by distinguishing between pre-migraine and recovery states. Moreover, incorporating post-migraine nights might allow for a more comprehensive understanding of the entire migraine cycle, potentially revealing biomarkers indicative of susceptibility to subsequent attacks. Without this data, the study may miss important physiological fluctuations that could enhance the accuracy and generalizability of migraine prediction models. Future research should consider controlled inclusion of post-migraine nights, potentially leveraging additional statistical or machine learning techniques to mitigate misclassification risks while capturing these transitional dynamics.
In conclusion, this study demonstrates the potential of wearable sensor technology in predicting migraine episodes by monitoring changes of the ANS during the prodrome phase. In this pilot study, our goal is an assistive, non-autonomous early-warning aid for already diagnosed users (human-in-the-loop), rather than a clinical diagnostic tool. The analysis of HRV during nocturnal sleep provides an important approach to migraine attack prediction, revealing significant individual variability in physiological responses. The balanced accuracy of prediction models varied across participants, highlighting the need for personalized prediction strategies.
The consistent selection of HRV features such as median of all NN (normal to normal) intervals, Percentage of absolute differences in successive NN intervals greater than 20 ms, and Inverse of the mean length of acceleration/deceleration segments by both embedded and ANOVA methods indicates stable, frequently informative descriptors in our setting. Their repeated appearance across different feature selection strategies highlights these metrics as key candidates for further investigation.
That said, the average performance levels observed (F1 ≈ 0.24–0.40) remain modest and are not sufficient for direct clinical deployment or use as a stand-alone early warning system. These results should instead be viewed as a proof-of-concept: they demonstrate that physiological signals contain predictive information, but larger datasets, multimodal features, and longitudinal validation will be required before clinically useful accuracy can be achieved.
Overall, these results lay the groundwork for developing more effective and personalized migraine prediction models, aimed at practical, threshold-tuned early warning for diagnosed users. Further studies with larger and more diverse participant groups are recommended to validate these findings and refine the predictive models including analyses to explain variability and the addition of complementary signals (e.g., EDA, skin temperature).
Footnotes
Ethical considerations
The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Vilnius University Hospital Santaros Klinikos (No. 2024/4-1569-1041) on 9 April 2024.
Consent to participate
Written confirmation has been obtained from all participants whose data are included in this study; they consent to the use of their medical information for scientific purposes.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
The datasets generated and analysed during the current study are not publicly available due to ethical board approval restrictions but are available from the corresponding author upon reasonable request.
