Abstract
Objective
To introduce a highly innovative imaging method to study the complex velopharyngeal (VP) system and introduce the potential future clinical applications of a VP atlas in cleft care.
Design
Four healthy adults participated in a 20-min dynamic magnetic resonance imaging scan that included a high-resolution T2-weighted turbo-spin-echo 3D structural scan and five custom dynamic speech imaging scans. Subjects repeated a variety of phrases when in the scanner as real-time audio was captured.
Setting
Multisite institution and clinical setting
Participants
Four adult subjects with normal anatomy were recruited for this study.
Main Outcome
Establishment of 4-D atlas constructed from dynamic VP MRI data.
Results
Three-dimensional dynamic magnetic resonance imaging was successfully used to obtain high quality dynamic speech scans in an adult population. Scans were able to be re-sliced in various imaging planes. Subject-specific MR data were then reconstructed and time-aligned to create a velopharyngeal atlas representing the averaged physiological movements across the four subjects.
Conclusions
The current preliminary study examined the feasibility of developing a VP atlas for potential clinical applications in cleft care. Our results indicate excellent potential for the development and use of a VP atlas for assessing VP physiology during speech.
Introduction
Static velopharyngeal (VP) magnetic resonance imaging (MRI) has been an established area of research that continues to gain clinical interest and routine use.1–5 Static MRI provides reliable and valid quantitative data regarding key VP structures including pharyngeal depth, velar length, relative position of the levator veli palatini muscle compared to the posterior hard palate, and size and location of the velopharyngeal gap, when present.1,6–12 These variables have been shown to be predictors of improved outcomes, highlighting the unprecedented value of static MRI in pre-surgical planning.1,13,14 Static MRI has also been used during sustained phonation tasks to capture the maximal movements of the VP mechanism, ultimately indicating what the mechanism is capable of during sustained phonation. These insights provide details about the dynamic or hypodynamic nature of the VP muscles, which have been documented as causes to VP dysfunction.15–18 Comparisons of these measures taken from patient anatomy to that of normal anatomy allow one to determine if differences are clinically meaningful. Taken together, these findings demonstrate the value of static VP MRI in cleft care.
Dynamic MRI provides data related to the physiology of the VP mechanism on a more habitual level (i.e., during connected speech) as often done in other imaging modalities such as nasopharyngoscopy and videofluoroscopy. However, dynamic VP MRI during connected speech is not used extensively in the clinical setting, largely due to two primary barriers. First, published high-speed (30 frames per second or higher) dynamic methods19–21 have used research scanners and sequences have not been translated to the clinical setting. Secondly, dynamic imaging methods are associated with a large data output and clinical interpretation is often time-intensive and requires significant amounts of manual labor. 22 There is currently no time-efficient clinical tool to provide comparisons of patient data to normalized data for clinicians to generate a meaningful, visual diagnostic representation of patient-specific movements and deficits (e.g., decreased lateral pharyngeal wall movements, poor velar movements, etc). An ideal VP MRI protocol should include static VP MRI, dynamic VP MRI during speech, and provide a tool to compare a patient's anatomy and physiology to that of normal anatomy so that clinicians can easily identify where the dysfunction in movement is occurring.
MRI atlases have been created for numerous areas of the body (e.g., hip, ankle, shoulder, elbow, tongue, vocal tract, and brain) to create a normalized representation of anatomy and physiology.23–30 As such, MRI scans from several individuals are first used to create a data bank. An atlas then draws a set of images from this data bank to create a group-wise single representation of anatomy and physiology. Atlases can provide clinical value by serving as a reference volume to identify and define a disorder or abnormality. Applying this same process of atlas creation to examine the VP system during speech could offer tremendous insight into the abnormalities of speech, VP dysfunction, and potentially improve our understanding of the impact of surgical changes by providing an improved diagnostic tool for VPI assessment. The purpose of this paper is to present preliminary data using four adult subjects to (1) demonstrate a high-speed dynamic MRI sequence that can be used in VP assessment and (2) construct a four-dimensional (4-D) dynamic MRI VP atlas. Lastly, we aim to highlight the potential innovations by creating a theoretical framework for the use of VP MRI atlases in cleft care.
Methods
Subjects & Data Acquisition
In accordance with the local Institutional Review Board at East Carolina University, four adult subjects with normal anatomy were recruited for this study. Race and age were variable. Subject demographics are as follows: one White male aged 44, one Asian male aged 26, one White female aged 25, and one Black female aged 22. All subjects were native speakers of English and reported no history of neurological, musculoskeletal, craniofacial, or hearing disorders. Articulation and resonance capabilities of all subjects were judged by a speech-language pathologist and deemed to be within normal limits.
High-resolution MRI datasets were collected and obtained using a Siemens 3 T Prisma MRI system (Siemens Healthcare Inc., Malvern, PA) with a 20-channel head coil. Each subject was imaged in the supine position and instructed to keep their heads still during all scans. All subjects underwent a 3D structural scan (scan time: 3 min and 54 s) and five 3D dynamic speech scans. All dynamic speech imaging scans used a customized dynamic MRI sequence 20 with a field of view of 256 × 256 mm, 10 slices 6 mm thick, voxel size of 2 × 2 × 6 mm, echo time TE of 2.44 ms and repetition time TR of 5.16 ms. The temporal information for the dynamic scan was acquired through a 3D spiral-in/spiral-out cone navigator with 5.8 ms TR. Each navigator was interlaced with four k-space lines of the imaging sequence creating one temporal frame. 7 This made the temporal frame length of 25.2 ms per 3D image frame (ms/frm) in the reconstructed 3S dynamic image. The effective temporal resolution per 3D volume of this scan was 39.67 frames per second (fps) for a total of 28,800 volumes acquired across all five 3D dynamic speech scans.
Audio from the speech scans was simultaneously recorded using an MR-compatible noise-cancellation microphone (Optoacoustics, FORMI-III), which was used to preserve the integrity of speech but significantly reduce noise from the MR scanner. The audio receiver was placed in the control room so that study personnel could ensure that subjects were stating the appropriate phrases for the required length. During dynamic speech scans, subjects repeated a series of phrases and sentences including: “buy baby a bib,” “get a cookie,” “mom ‘n bob are happy,” “hamper, hamper, hamper,” and a counting sample of “60–66”. A metronome was not used because we aimed to collect natural self-paced repetitions from the subjects. All scans for this study were obtained in less than 20 min.
The speech stimuli were carefully selected by three speech-language pathologists, each with over 20 years of experience in VP assessments and a linguist to ensure that the stimuli captured contributions and movements of the velum in the following contexts: lowered velum (i.e., nasal consonants like [m] and [n]), a fully elevated velum with closure on the posterior pharyngeal wall (i.e., oral pressure consonants like [b] and [k]); adjustments between the lowered and elevated positions (i.e., stimuli with both oral and nasal consonants present); velar contribution during the production of sibilants (i.e., repeated [s] in counting sample); and the consonant [h] where the position of the velum is not strictly specified as either elevated or lowered. The selected stimuli are consistent with those often used by speech-language pathologists for clinical assessments of children with repaired cleft palate.
Results
Dynamic MRI Sequence
Dynamic speech MRI methods used allow for simultaneous visualization of midsagittal and oblique coronal image planes, as resampled from the 39.67 fps 3D data (Supplemental Video 1). The total scan time for each speech stimulus was 2 min and 25 s. The custom acquisition images a spatiotemporal model that provides unprecedented temporal resolution but requires sufficient frames to be sampled to identify the model. 7
High spatiotemporal resolution is achieved with our dynamic imaging sequence through an acquisition and image reconstruction method based on the partial separability model. 31 In this framework, data are captured with a specialized MRI pulse sequence that interleaves a dynamic navigator acquisition to acquire high temporal resolution with an imaging acquisition that acquires lines of k-space, the data space of the MRI. The dynamic navigator data are then used to extract a small number of temporal waveforms, using a singular value decomposition (SVD), that describe the dynamic information in the scan. The imaging lines of k-space are then used to fit spatial maps corresponding to those temporal waveforms. Finally, the spatial maps and temporal waveforms are mixed back together, similar to a principal components analysis, to form the 3D dynamic image time series. This results in significant overall accelerations in the imaging, achieving high spatiotemporal resolutions that are not possible through standard serial imaging with MRI that might be found in clinical scanners.
By using a separate, but interleaved, acquisition of spatial and temporal information, the acquisition optimizes the capturing of imaging data that provides a high-quality image that is robust to the magnetic field inhomogeneity that exists in the oropharyngeal region. Although the method does not require explicit repetitions, the desired physiological movements from the speech samples must represent a significant amount of energy in the temporal signal for the SVD to maintain this motion in the final represented image series.
Temporal Alignment of Data & Atlas Construction
Preprocessing of all datasets was necessary to construct a 4-D dynamic atlas of each speech task. Since each atlas was a combination of all subject data to represent their collective shape, both tissue motion and tissue anatomy captured in the volume sequences had to be synchronized among all subjects. This synchronization necessitated a spatiotemporal image volume alignment process 26 as described below.
Spatial alignment was intrinsically integrated into the atlas construction process through diffeomorphic image registration. 32 The diffeomorphic image registration algorithm that was adopted in this work specifically produced an invertible, continuous, and geometrically smooth (diffeomorphic) deformation field. 33 This diffeomorphic field transformed a source volume to closely resemble a target reference volume. In alignment with our previously proposed atlas construction methods, 32 an unbiased 4-D dynamic atlas was created by jointly aligning all anatomical structures in this study population. This method was followed instead of designating one specific subject as a target reference volume. Further details on the specific method steps can be found in Woo et al., 26 which also includes a comprehensive evaluation of atlas quality.
For temporal alignment, the method presented by Xing and colleagues, 34 which dealt with multi-subject dynamic MRI data to achieve temporal alignment for the 4-D atlas construction, was applied. In brief, a temporal mapping between each subject to the target reference subject was determined by using audio waveform matching. 34 The audio waveforms simultaneously recorded during the image scanning sessions were decomposed into one-dimensional (1D) signals, which were treated as input to a 1D image registration algorithm. For 1D signals, this structure is represented by temporal positions of peaks and valleys within the signal waveform. These positions can be easily detected by a metric called cross-correlation, which measures the similarity between the two signals. One subject's audio was manually selected to serve as the target reference signal to which all other subjects’ signals were matched. For this task, diffeomorphic demons image registration was applied 35 to match each subject's waveform to the reference waveform using cross-correlation as the similarity metric. It is important to note that this registration algorithm was the same one used to align the MRI volumes, with only its input signal's dimension reduced to one for waveform alignment. Therefore, each subject's audio time frames were aligned with those in the target reference's temporal space, creating a temporal map. After matching all waveforms in the study population accompanied by computing all temporal maps, a modified sequence of image volumes—maintaining a consistent temporal speech rhythm within the same target reference space—was computed for each subject. 34 Further details on the specific method steps, along with a comprehensive evaluation of the accuracy of waveform-guided temporal alignment, can be found in Xing, et al. 34
Finally, after temporal alignment of all subjects, at each frame all subjects’ image volumes were combined to construct an unbiased 4-D dynamic atlas. In brief, the atlas construction step involves creating a common spatial space using groupwise diffeomorphic registration 32,33 with volumes from the first-time frame. Next, volumes from all remaining time frames and subjects are transported into this common space using a single transformation for each subject, reducing potential bias caused by anatomical differences while preserving temporal correspondence. Atlases are constructed for each time frame using volumes deformed to the target reference time frame space using groupwise diffeomorphic registration (Supplemental Video 2).
In this study, a VP 4-D MRI spatiotemporal atlas (Figure 1) was successfully created based on data from four adult subjects. The authors aimed to determine if formulation of an atlas 26 could be applied to the VP system during standard connected speech samples. Supplemental video 2 demonstrates the atlas constructed from the four subjects, which represents the statistically averaged data volume from the subjects while speaking the phrase “get a cookie”.

A flowchart of the dynamic atlas construction process. The recorded audio waveforms associated with each subject are aligned to provide information for temporally aligning all subjects’ image sequences, which all have the same length after temporal alignment. Deformable registration is used to spatially align images’ anatomical structures, which are then averaged into a unified dynamic atlas sequence.
Discussion
This study demonstrates the application of a dynamic speech MRI sequence for capturing and assessing VP anatomy and physiology and the successful creation of a 4-D MRI atlas for the VP system. The dynamic MRI methods for this study were established on a Siemens research scanner (3 T Prisma) and can be translated onto clinical Siemens MRI scanners, as the authors have successfully acquired high quality data with the same frame rate on a Siemens 1.5 Aera. Due to the customized pulse sequence to acquire the data, implemented in Siemens proprietary pulse sequence development environment, expansion of the sequence to other vendor platforms is not trivial and has not been undertaken. However, the proposed techniques for dynamic VP MRI outlined in this study can be used on various Siemens clinical scanners.
Individual static MRI images have been shown to provide valuable information relating to the anatomical structures and musculature of the VP port.3,36–43 There are many clinical questions related to VP anatomy that can and should be derived from static MRI (at rest and during sustained phonation). For example, the cohesiveness and relative location of the levator veli palatini muscle are clinical factors that might dictate surgical decisions.3,7,37,44 Additionally, details related to pharyngeal depth, velar length, and VP gap size can be captured using simple two-dimensional static imaging, which offers the advantage of rapid scan times (less than 8 s). There is no doubt that static VP MRI has an important place in evaluating VPI. However, dynamic MRI during speech expands the clinical value of MRI in cleft care because it provides an understanding of the complex and multi-faceted nature of connected speech. Specifically, dynamic MRI assessment can highlight individual physiological variations across sentence-level stimuli and identify the specific place of dysfunction. For example, using directional displacement mapping or heat mapping of differences from the average normal patient, dynamic MRI paired with atlases can show the specific region of the tongue that is contributing to the speech error. Features related to timing and rate of VP closure, coordination of the VP mechanism during speech, and an overall view of the true capabilities of the VP system during connected speech can only be seen using dynamic MRI methods. For these reasons, clinical VP MRI protocols should include rapid (< 8 s) static images at rest, during sustained phonation,1,11,14 and dynamic MRI at the word- and/or sentence-level. However, the difficulty remains in how to interpret and use the dynamic VP MRI data for clinical cleft care.
The application of VP atlasing, as described in this paper, creates a powerful and highly innovative method to assess VP physiology during speech. Additionally, it offers the potential to provide a clinical method for comparing a patient's anatomy and physiology to that of a normalized population. In this study, the VPI MRI atlas determined the average anatomy and physiology during speech across only four individuals. The four subjects used in this study demonstrated highly variable anatomy due to their varied ages, races, and genders. Despite these variations, the atlas was successfully constructed, effectively accounting for these variations. For clinical use, atlas construction should be based on a large, homogenous population. When using an atlas for comparison to a patient's VP MRI data, the selected individuals used in the atlas should be of similar age, race, sex, and possibly even regional dialect. Future research is needed to define which factors should be considered when comparing a VP MRI atlas to a patient.
It should also be noted that in the current atlas construction and temporal alignment process, the waveforms were mathematically treated as 1D inputs. Another potential direction for future research is to enhance alignment accuracy by integrating spectrogram analysis into the diffeomorphic registration process. Since waveforms only account for voice onset, nasalization also should be considered as an extra input source in the future. Variables that would impact nasalization onset such as rate, intonation, stress, and prosody need to be accurately acquired and considered as additional information to further refine the alignment process. Additionally, sentences used for VP assessments using MRI should be carefully selected to ensure a balance across nasal and oral sounds.
Similar to atlases constructed for other speech structures, any patient-specific differences in the VP movements could be captured using the dynamic MRI and atlas techniques. Our results indicate excellent potential and feasibility of a VP atlas for understanding VP physiology during speech. With further research developments, this can be used to compare disordered speech among children born with cleft palate who present with persistent speech and resonance disorders. Future research is needed to define and evaluate methods for applying VP MRI atlases to cleft care. The authors propose two potential applications as a theoretical framework for future investigations.
Highlight Subject-Level Differences
VP MRI atlases could be used to compare patient's anatomy and physiology to that of a normalized population. To do so, the patient's anatomy is aligned to the atlas and then the spatial and temporal transforms are inverted to place the atlas in the individual subject's space. The data output would be a video showing the patient's speech movements overlayed onto a normalized atlas. In places where the patient's anatomy and or physiology differ from the atlas, the region on the patient (e.g., the tongue, velum, adenoid, etc) could be highlighted in a color to create further clarity on the precise area of difference.
Label and Track Anatomical Structures
VP atlases could also be used to automate measurements of key anatomic structures in a dynamic dataset. The atlas can be used to identify VP structures such as the velum, palatal muscles (levator veli palatini muscle, palatoglossus, tensor veli palatini), create labels on the patient's dynamic data (e.g., such as during the production of a speech stimuli), and clearly demonstrate and track structural deformations and timing variables during speech production. The output of this approach would be labeled (via different color fields) structures that are able to be tracked throughout the dynamic speech event. The labeling function can be beneficial when educating families or those not familiar with key VP structures on any anatomical or physiological differences that may exist. Figure 2 and Supplemental Video 3 demonstrate in the midsagittal view how the first frame of the atlas is used to segment the key structures. Once a manual segmentation is performed in 3D over all the slices of the volume at the first frame, all volumes in the successive frames can be automatically propagated with deformation to find the corresponding structure label positions in them. Specifically, a deformable registration method such as diffeomorphic demons 45 is performed between each frame and the first frame, obtaining the deformation field connecting the structures between the two frames. Each deformation field is then applied to the manually segmented labels in the first frame to automatically yield the deformed labels in the successive frame.

Demonstration of the segmentation of variable VP structures using the first frame of the atlas.
Labeling and tracking of anatomic structures may be particularly important in understanding how surgery in cleft palate alters the anatomy and physiology of key VP structures within the mechanism. For example, atlasing could be used to compare the pre- and post-surgery MRI data of patients to normalized atlases to visualize how a surgical method used to treat velopharyngeal inadequacy (VPI) alters the anatomy and physiology to approximate (or deviate) normal VP physiology during speech.
Conclusion
This study demonstrated the successful creation of an innovative VP atlasing tool using dynamic VP MRI techniques. Dynamic MRI was captured at 39.67 fps with high resolution using unique imaging protocols. Protocol use resulted in a dynamic VP MR data set that has remarkable temporal resolution and has laid the foundation for MR methods that capture high-speed dynamic images during conversational-like speech. Dynamic MR data was also used to create a spatiotemporal VP atlasing tool, potentially changing the way dynamic MR data is analyzed and interpreted. Future research is needed to investigate the application of this tool in examining the variations among speakers with abnormal speech, such as that among children with repaired cleft palate and persistent VPI.
Supplemental Material
Supplemental Material
Supplemental Material
Footnotes
Declaration of Conflict of Interest & Funding Disclosure
Institutional review board (IRB) was obtained, and a statement thereof is listing within the methods section. The authors declare that there is no conflict of interest. Research reported in this publication was supported by the National Institute of Dental & Craniofacial Research of the National Institutes of Health under Award Numbers R01DE027989 and 3R01DE027989-04S1. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Institute of Dental and Craniofacial Research, (grant number R01DE027989 and 3R01DE027989-04S1).
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
