Abstract
Background:
Physician review websites have influence on a patient’s selection of a provider. Written reviews are subjective and difficult to quantitatively analyze. Sentiment analysis of writing can quantitatively assess surgeon reviews to provide actionable feedback for surgeons to improve practice. The objective of this study is to quantitatively analyze large subset of written reviews of hand surgeons using sentiment analysis and report unbiased trends in words used to describe the reviewed surgeons and biases associated with surgeon demographic factors.
Methods:
Online written and star-rating reviews of hand surgeons were obtained from healthgrades.com and webmd.com. A sentiment analysis package was used to calculate compound scores of all reviews. Mann-Whitney U tests were performed to determine the relationship between demographic variables and average sentiment score of written reviews. Positive and negative word and word-pair frequency analysis was also performed.
Results:
A total of 786 hand surgeons’ reviews were analyzed. Analysis showed a significant relationship between the sentiment scores and overall average star-rated reviews (r2 = 0.604, P ≤ .01). There was no significant difference in review sentiment by provider sex; however, surgeons aged 50 years and younger had more positive reviews than older (P < .01). The most frequently used bigrams used to describe top-rated surgeons were associated with good bedside manner and efficient pain management, whereas those with the worst reviews are often characterized as rude and unable to relieve pain.
Conclusions:
This study provides insight into both demographic and behavioral factors contributing to positive reviews and reinforces the importance of pain expectation management.
Keywords
Introduction
In recent decades, the internet has become a primary source of information for patients seeking out medical advice and providers. In particular, physician review websites (eg, healthgrades.com) are increasing in popularity and have been shown to have a significant influence on a patient’s provider selection.1,2 Expectedly, negative reviews on these Web sites can impact both a physician’s reputation and a prospective patient’s choice of a provider. 3 As the trend in internet usage continues, it is imperative that we understand the content reported on these review Web sites, especially as online ratings are not always associated with accepted patient satisfaction scores (ie, Press Ganey Medical Practice Surveys). 4 By understanding the aspects of health care that patients are stressing online, we can better address potential disconnects between patients and providers.
Usage of physician review websites has been seen throughout orthopedics. Bakhsh and Mesfin 5 demonstrated that surgeon knowledge, proper bedside manner, timeliness, and scheduling simplicity were associated with overall higher ratings. Multiple reports confirmed these relationships while further indicating that staff kindness and surgeon positivity were associated with positive reviews.6,7 Other studies have indicated that the number of online reviews, surgeon sex and age, and surgical outcomes were key drivers of online ratings.8,9 However, these variations in what characteristics are associated with rating positivity may reflect the different subspecialities within orthopedics, and for these reasons, subspeciality analysis of physician review websites is essential. Currently, while there is a growing body of work for physician review websites in orthopedics, there is still a paucity of literature that analyzes online hand surgeon reviews. In 1 study, Kirkpatrick et al 10 showed that physician age was negatively associated, while staff kindness, bedside manner, and responsiveness were positively related to hand surgeon ratings. In a similar hand-based study, Trehan et al 11 did not see a relationship in surgeon age and review type but found a positive association between patient perceived surgeon competence and positive reviews. This disagreement in the limited available literature stresses the importance of additional research within the field of hand surgery.
As patient review Web site comments can alter patient perceptions of a surgeon, it is important to fully understand what additional information is circulating on the internet. This point becomes increasingly important when we consider the fact that nonphysician-related facets of health care, such as office environments, can drive down a physician’s online rating. 12 Using modern approaches to parse language such as that from reviews, it is now possible to analyze big data that was previously only able to be done through qualitative analysis and on a much smaller scale. In this study, using machine learning, we quantitatively analyze publicly available physician reviews in a high throughput fashion, to better characterize patients’ experiences. Sentiment analysis through natural language processing is a form of machine learning that provides the ability to quantitatively assess text to provide unbiased, actionable feedback of patient reviews for surgeons looking to improve their practice. The objective of this study is to analyze a large subset of online written reviews of hand surgeons, using sentiment analysis, to provide unbiased feedback and inferences for further study.
Methods
Data Acquisition
Online written reviews and star-rating reviews of hand surgeons were obtained from healthgrades.com and webmd.com. These 2 Web sites were chosen as when generally searching for providers, these review Web sites were a few of the first few Web sites suggested as well as due to ease of ability to Web scrape large amounts of data from the Web sites without restriction. Star-rating reviews for the remainder of this article refer to the reported ratings out of 5 stars given to surgeons on these Web sites for various categories. These data are publicly available, and the star-rating reviews provide an overall average star rating for each surgeon as well as ratings for individual categories (explains conditions well, answers questions, trustworthiness, etc). Inclusion criteria included surgeons who were listed on the “The Physician Payments Sunshine Act” 13 as “Hand Surgeons.” This list of surgeons was then also cross-checked online review Web sites to confirm that they were listed as hand surgeons within their online profiles as well. Exclusion criteria included those surgeons who had no online ratings or less than 7 written reviews.
Sentiment Analysis
The Valence Aware Dictionary and Sentiment Reasoner (VADER) Sentiment analysis is a widely used and accepted python package used to obtain compound sentiment analysis scores of written text. 14 The package is able to take written prose and assign value scores to the “sentiment” of the sentence. This means that it is able to rank how positive or negative a sentence is through analysis of the words used and the connotations of the words. The VADER was used to obtain scores of each written review for every surgeon. The VADER package is built into the Natural Language Toolkit (NLTK) library. It can take sentences as inputs to output a compound score based on positivity or negativity of specific words with equations to account for punctuation, capitalization, and modulators.
Valence Aware Dictionary and Sentiment Reasoner Score Calculation
The VADER relies on a dictionary of specific and common words that retain inherent positive or negative qualities. The dictionary was developed by having 10 independent human raters, who were trained and quality checked for inter-rater reliability, assign scores ranging from −4 to + 4, with 0 representing a neutral sentiment, to each word in the dictionary. 14 The VADER then takes the inputted sentences, scanning for these specific words, and summing and normalizing the scores to between −1 and +1, where −1 indicates a worse sentiment and +1 indicates a positive sentiment. In addition, there are scores assigned to specific punctuation marks such as exclamation marks as well as prolonged capitalization use, which is often used as emphasis in reviews: “He is the WORST.”
The VADER’s calculation also factors in potential modifiers to words. A positive or negative empirically derived mean is factored into the calculation when either an emphasizing adverb or negating adverb is used respectively. This means that phrases such as “very helpful” would be given a higher score than just “helpful.” Furthermore, negation is also factored in when it precedes words within the VADER dictionary by reversing the rating’s sign. Therefore, normally positive sentiment words would be calculated as negative. This allows for phrases such as “not helpful” to be scored negatively as the context around the word has changed.
Model Validation
Linear regression analysis was performed comparing the average sentiment analysis score for every doctor with their average star score to assess the relationship between calculated scores of this study and the online rating. A linear regression was used because if the calculated sentiment analysis scores were an accurate depiction of the reviews left for a provider, it would have a linear relationship with the reported star scores online.
Data Analysis
Mann-Whitney U tests were performed to determine the relationship between demographic variables (age, sex) and average sentiment score of written reviews. For the age analysis, the ages were dichotomized to above and below 50 years old. This cutoff was used as 50 years allowed for the most even distribution of providers between the 2 groups.
Positive and negative word frequency analysis was also performed to provide context to the words used to describe surgeons. To obtain the word frequency for the best reviews and worst reviews, all reviews with a score greater than 0.75 were analyzed for word frequency as well as all reviews with a negative score. In addition, to provide greater context for these words, the most frequently used word pairs, or bigrams, were also calculated.
Finally, a multiple logistic regression was performed on key words/phrases to assess the odds of their association with a score greater than 0.5. This regression was performed to identify the likelihood that selected high frequency, clinically relevant words and bigrams would be included in a review with an overall sentiment analysis score of greater than 0.5. A review with a score greater than 0.5 is a largely positive review.
Results
Following the inclusion and exclusion criteria, 786 hand surgeons were analyzed consisting of 7638 reviews.
Surgeon Demographics
All the demographic characteristics of the surgeons were also extracted from healthgrades.com. The physician’s age and sex identity were pulled directly from what was reported online (Table 1).
Demographic Data on Hand Surgeons Analyzed. a
Some physicians did not have their sex or age listed, as such not included in respective analyses.
Model Validation: Linear Regression
The linear regression analysis of average sentiment analysis scores to average star scores showed a statistically significant relationship between the 2 scores (Figure 1, Pearson correlation coefficient = 0.604, P < .01), indicating good concordance between sentiment scores and reported overall star reviews.

Linear regression analysis of average online reported star score compared with calculated sentiment analysis score.
Model Validation and Demographic Analysis: Mann-Whitney U Tests
There was no significant difference between sex and greater or lower sentiment analysis score (men: median = 0.607, range = 1.5; women: median = 0.585, range = 0.96; P = .20). The average star scores compared with sex were also insignificant (men: median = 4.3, range = 3.5; women: median = 4.2, range = 3.0; P = .09). These results are summarized in Table 2.
Student T Test Comparing Star and Written Reviews With Sex and Age.
There was a significant difference between older surgeons and lower sentiment analysis scores (<50: median = 0.623, range = 1.25; >50: median = 0.591, range = 1.46; P <
Word Frequency Analysis
Frequencies of most used words recognized by NLTK are also reported. The most frequently used and meaningful words used to describe top-rated surgeons are words relating to care, compassion, and comfort, whereas those with the worst reviews are often characterized as rude, arrogant, and unable to relieve the pain of their patients. Words that were high frequency but not clinically or behaviorally relevant were removed to focus on characteristics that would be helpful in determining what factors affect patient reviews. For example, words such as “great” and “horrible” were removed because although they describe generally the experience the patient had with a physician, it does not aid in our analysis of what behavioral or practice characteristics are associated with these reviews.
For surgeons who were most positively reviewed, their single-word descriptors mainly focused on qualitative and behavioral attributes, that is, “caring,” “friendly,” “compassionate,” and “comfortable” (Table 3). For the most negatively reviewed surgeons, their descriptors centered around levels of pain as well as inefficiency in pain management. Of the reviews used in this analysis—those reviews which had a sentiment analysis score of less than 0—pain was used 537 times, and the next relevant word was unprofessional at 56.
Clinically Relevant Single Word Frequency Analysis of Best and Worst Reviews.
Bigram Frequency Analysis
The most frequently used bigrams used in these top and worst rated reviews were also calculated (Table 4). In the bigram analysis of the most positive review, the most frequently used 2-word sequence that was clinically/behaviorally relevant was “no pain,” thus indicating the importance of pain management to patients. When looking at the bigram analysis of the most negatively reviewed surgeons, 4 of the top 5 clinically relevant, highest frequency bigrams were about pain or descriptors of pain.
Clinically Relevant Bigram Frequency Analysis of Best and Worst Reviews.
Multiple Logistic Regression
Finally, a multiple logistic regression was performed on clinically relevant keywords. The results of this regression showed us that words defining positive surgeon behaviors, such as “listens,” “knowledgeable,” and “confident,” were positively associated with reviews that had positive sentiment scores. The more positive behaviors exemplified by the surgeon, the more likely she or he is to get a better review. This is shown through the top behaviors with the greatest statistically significant odds ratios being “confident,” “warm,” and “listens,” with odds ratios of 15.5, 4.33, and 2.12, respectively, indicating that these words were associated with a 2×, 4×, and 15× chance of receiving an overall positive score if included in a review (Table 5). A surgeon that was described as knowledgeable also had an odds ratio of 2.03, indicating that those reviews were twice as likely to be positive.
Multiple Logistic Regression Analysis on Clinically Relevant Keywords.
Note. CI = confidence interval; OR = odds ratio.
Finally, results of this analysis highlight the impact that pain can have on a surgeon’s patient reviews. Inclusion of the words “pain” and/or “severe pain” was significantly associated with decreased odds of receiving positive reviews (0.445 and 0.343, respectively), whereas the inclusion of “pain free” and “relief” in surgeon reviews conferred 3 and 2 times greater likelihood that a surgeon received a positive review.
Discussion
Using sentiment analysis, in this study, we used surgeon reviews to quantitatively assess a large amount of publicly available written reviews on hand surgeons to glean unbiased inferences about behaviors and major factors contributing to their online presence. We found a statistically significant difference in the average sentiment analysis scores of the reviews based on provider age. Furthermore, we determined which words and phrases are most frequently used in the descriptions of positively and negatively reviewed surgeons. In doing this, we ascertained factors for physicians to focus on to strengthen their online presence.
Previous research has analyzed the influence of sex and age on physician review websites for orthopedic surgeons. Our study shows that there is no significant difference between the sex of hand surgeons and review positivity. Consistent with this, Kirkpatrick et al 10 and Trehan et al 11 found that sex differences were not associated with hand surgeon scores; these findings are also supported across different orthopedic subspecialties.9,5,15 However, in a study of physician review websites for sports medicine surgeons, female surgeons were significantly more likely to have positive ratings. 8 Alternatively, in analyses of orthopedic physician review websites as a whole, multiple studies found that surgeon age was negatively associated with positive reviews.8,9,16,17 Similarly, Kirkpatrick et al 10 showed that age was negatively associated with review positivity in their analysis of 433 hand surgeons. Our study agrees with the aforementioned literature, as we show that hand surgeon age resulted in a significant difference in overall scores, indicating that older surgeons are more likely to receive negative online feedback. However, the median score difference of 0.03 in sentiment analysis scores and 0.2-star difference may not indicate clinical significance as patients would not be able to discern this level of detail through reading the reviews or looking at the star scores. Furthermore, in an analysis of 245 hand surgeons, Trehan et al 11 found no relationship between review positivity and age; Garofolo et al 18 showed similar results. Given the variation among the current literature for hand surgeons, our research acts to discern the trends seen across multiplied physician review websites; however, we do recognize that more research is needed to draw definitive conclusions.
Multiple physician personal characteristics have also been shown to be positive contributors to their online ratings in the past literature. In a study by Kirkpatrick et al, 10 bedside manner, listening, and spending time with patients were all associated with significantly higher scores on physician review websites for hand surgeons. Similarly, it was shown that hand surgeons who display competence and proper communication were more likely to receive positive reviews online 11 ; Bakhsh and Mesfin 5 supported these notions for general orthopedic surgeons. Our study agrees with the current literature as we saw that patients who perceive hand surgeons as warm, confident, knowledgeable, and attentive are likely to write positive comments for their surgeons online. Thus our work and the aforementioned literature should encourage hand surgeons to continuously stress these interpersonal characteristics within their medical doctrine.
Furthermore, recent literature has shown that a substantial proportion of comments left on physician review websites pertain to nonphysician-related aspects of health care. Burns et al 12 even stressed that 48% and 24% of questions across 14 review Web sites were either related to a combination of the physician and the health care setting or just the office setting, respectively. Thus, it is not surprising that a distinct amount of the comments left on physician review websites are related to ancillary characteristics of a physician’s practice, not the surgeons themselves. Related to this, Kirkpatrick et al 10 showed that positive comments for hand surgeons were significantly related to staff courteousness. Additional studies in the orthopedic literature have shown that long wait times, poor office environment, and scheduling difficulty are associated with negative online reviews.11,15 However, other reports showed that office wait times were not always related with more negative reviews for hand surgeons in particular. 10 While we too did not see this relationship in our study when we considered wait times or staff friendliness, we do believe that it is an important consideration for surgeons to address, especially as a single negative review can act to deter potential patients.
Our bigram and word frequency analysis shows that 1 of the clear factors driving negative reviews of hand surgeons is pain and pain management. This is in contrast to the findings of Orhurhu et al in their article evaluating the online patient reviews of chronic pain practices. They found that, even when focusing on a cohort of physicians focused on chronic pain management, the largest contributing characteristics of negative reviews were mainly administrative. 19 However, this discrepancy is likely because they were only evaluating 331 negative reviews since their analysis was conducted manually by 2 human raters. This reinforces the strength in this study’s methods as it is able to produce results through an unbiased approach and analyze significantly more reviews. Moreover, this finding serves to reinforce the need for physicians to establish proper pain expectations with their patients before any operations or interventions. It has been established by Jerant et al 20 that patients enter visits with high expectations for their pain control. Jerant also showed that clinician denial of patient requests of pain medication, even when it was inappropriate, led to significant patient dissatisfaction. Strøm et al 21 indicated that thoroughly informing patients is a vital portion to disease management as the more informed the patient is, the better their ability to cope after their visits and their overall quality of life. As such, the results of this study strengthen the importance of informing patients about pain and the possibility of it not being completely resolved with the given interventions. By doing so, physicians are more likely to quell patient anxieties and clarify innate presumptions before their procedure and prevent them from believing that their lasting pain is an artifact of the skills of their provider.
Limitations
Despite these data, a higher number of reviews will result in a higher score, and many providers may request patients to leave them reviews or have a marketing team aid in influencing their online presence. As such, there exists the potential for bias within the results as providers may be selectively posting only positive reviews and attempting to suppress any negative ones, thus forcing the review scores into the positive range. However, although this bias exists, as mentioned in the “Introduction” section, the general population ends up seeing only what is presented on these Webpages and attempts to make an informed decision of which provider to see based on the reports. Therefore, it is critically important to know the topics influencing the reviews to make proper adjustments. Furthermore, although the difference between a provider’s age and their average sentiment analysis score was statistically significant, it may not be large enough of a change to be clinically significant. Patients, when reading online reviews, may or may not be able to discern difference of such magnitude; however, small differences in ratings still may be perceived as meaningful to patients when selecting a provider. In addition, our study lacks the ability to determine whether the pain described in the poor reviews was written despite the surgeons taking proper steps to implement pain management interventions. Pain is inherently difficult to completely resolve, so persistent pain may arise despite proper and extensive treatment from a provider. This study is thus unable to discern from providers who received negative reviews despite proper treatment and those who did not provide pain care.
Overall, it is clear that in both the most positive and negative surgeon reviews, pain is a substantial focus, whether it is due to effective management of pain or that the patients are still in pain. However, this study is not to say that surgeons should simply direct focus onto pain mitigation and to inappropriately use pain medication. Rather, this study emphasizes the importance of a physician’s ability to establish expectations before treatment for their patients. This study’s results reinforce the fact that despite proper bedside manner, pain is the largest contributor to a patient’s experience with a physician. Therefore, by properly establishing expectations for a patient before any intervention that pain removal is never a guarantee and that they still may experience the pain despite proper treatment may help alleviate some of the negative outlooks patients may leave visits with. Secondary to that, doctors should still continue to self-evaluate their bedside manner and try to incorporate aspects found in this study to improve their patients’ experiences and thus their perceptions of the overall care.
Footnotes
Ethical Approval
This study was approved by our institutional review board.
Statement of Human and Animal Rights
This article does not contain any studies with human or animal subjects
Statement of Informed Consent
Informed consent was obtained when necessary.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: S.K.C. is the board or committee member of the American Academy of Orthopaedic Surgeons; American Orthopaedic Association; AOSpine North America; Cervical Spine Research Society; North American Spine Society; and Scoliosis Research Society; paid consultant of CGBio; received IP royalties and other financial or material support from Globus Medical and paid consultant of Globus Medical; is the paid consultant of and received research support from Zimmer; J.S.K.: Aldentyfy Inc: Stock or Stock Options. The other authors have no conflicts of interest or sources of support that require acknowledgment.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
