Response Inconsistency of Patient-Reported Symptoms as a Predictor of Discrepancy Between Patient and Clinician-Reported Depression Severity

Abstract

The aim of this study was to assess the extent to which discrepancy between self-reported and clinician-rated severity of depression are due to inconsistent self-reports. Response inconsistency threatens the validity of the test score. We used data from a large sample of outpatients (N = 5,959) who completed the self-report Beck Depression Inventory–II (BDI-II) and the clinician-rated Montgomery–Åsberg Depression Rating Scale (MADRS). We used item response theory based person-fit analysis to quantify the inconsistency of the self-report item scores. Inconsistency was weakly positively related to patient–clinician discrepancy (i.e., higher BDI-II scores relative to MADRS scores). The mediating effect of response inconsistency in the relationship between discrepancy and demographic (e.g., ethnic origin) and clinical variables (e.g., cognitive problems) was negligible. The small direct and mediating effects of response inconsistency suggest that inaccurate patient self-reports are not a major cause of patient–clinician discrepancy in outpatient samples. Future research should investigate the role of clinician biases in explaining clinician–patient discrepancy.

Keywords

Beck Depression Inventory–II clinician-rated depression Montgomery–Åsberg Depression Rating Scale person-fit analysis response inconsistency self-report and clinician-report discrepancy self-report depression

Many instruments exist to measure the severity of depressive symptoms. The instruments can be distinguished into self-report and clinician-rated measures. Among the frequently used scales are the self-report Beck Depression Inventory–II (BDI-II; Beck, Steer, Ball, & Ranieri, 1996), the clinician-rated Montgomery–Åsberg Depression Rating Scale (MADRS; Montgomery & Åsberg, 1979), and the clinician-rated Hamilton Depression Rating Scale (Hamilton, 1960). Both types of measures are widely used in clinical practice and in research (e.g., DeFife, Drill, Nakash, & Westen, 2010; Möller, 2000). Arguments in favor of self-report measures include time- and cost-efficiency and direct access to the patient’s inner world and past mental state. A key advantage of clinician ratings is the ability to put symptom severity into context relative to that of other patients. The aim of this study was to assess the extent to which discrepancy between self-report and clinician-rated measures relates to the quality of the self-report assessment.

Agreement between self-report and clinician-rated depression severity has been investigated extensively, and reviews reported Pearson correlation coefficients ranging from .1 to .9 (Dunlop et al., 2010; Möller, 1991). Discrepancy seems to be largest at baseline assessment prior to treatment and decrease over time (e.g., Dunlop et al., 2010, Dunlop et al., 2011; Lin, Lu, Wong, & Cheng, 2014). Part of the discrepancies may be due to differences between the content of the self-report and clinician-rated depression measures (e.g., Tsujii et al., 2014). However, some studies that compared measures with matched content revealed only moderate correlations between self-report and clinician ratings (e.g., Carroll, Feinberg, Smouse, Rawson, & Greden, 1981) and differences between matched measures were meaningful in predicting future events. For example, patients with higher self-report scores than clinician-rated scores had lower remission rates than patients for whom self-report and clinician ratings were comparable (Tada et al., 2014).

Much research has investigated whether demographic, clinical, and personality variables explain discrepancy between self-report and clinician-rated depression severity. Consistent with the majority of this research, we use the term overreporting to denote self-report scores that are higher than corresponding clinician ratings and underreporting when self-report scores are lower. Several studies have suggested that older adults (Dorz, Borgherini, Conforti, Scarso, & Magni, 2004; Enns, Larsen, & Cox, 2000; Lyness et al., 1995; Rush et al., 2006) and males (Carter, Frampton, Mulder, Luty, & Joyce, 2010; Shain, Naylor, & Alessi, 1990) are more likely to underreport symptoms than younger adults and females, respectively. Gender role theory may explain men’s underreporting (Carter et al., 2010; Oliver & Toner, 1990) and cohort or maturation effects may explain age effects (Enns et al., 2000). Despite these plausible explanations, a substantial number of studies found no age or gender effect (e.g., Dorz et al., 2004; Duberstein & Heisel, 2007; Kim et al., 2013; Tsujii et al., 2014). Other studies found that education level did not affect overreporting or underreporting (e.g., Duberstein & Heisel, 2007; Kim et al., 2013; Tsujii et al., 2014), and that White and African Americans were more likely to overreport symptoms than Hispanic patients (Rush et al., 2006).

Previous studies have also found that symptoms of psychopathology and personality traits relate to discrepancy. Patients high on neuroticism and hostility (Duberstein & Heisel, 2007; Enns et al., 2000) and rumination (Carter et al., 2010) tended to overreport and patients low on these traits tended to underreport. Openness (Duberstein & Heisel, 2007) and self-esteem (Domken, Scott, & Kelly, 1994) have been found to have similar linear but negative effects on overreporting. Findings on the effects of melancholic depression (Carter et al., 2000; Enns et al., 2000) and psychoticism (Corruble, Legrand, Zvenigorowski, Duret, & Guelfi, 1999; Rush et al., 2006) are inconsistent. Other studies have tested effects on either overreporting or underreporting of symptoms, not assuming a linear relationship between discrepancy and explanatory variables. These studies have found relative overreporting of symptoms for patients higher on hopelessness and suicide attempts (Tsujii et al., 2014), higher anxiety and phobic anxiety (Corruble et al., 1999; Kim et al., 2013), and higher impulsivity but lower conscientiousness (Kim et al., 2013). None of these variables showed inverse effects on underreporting of symptoms.

Despite the large body of research investigating which demographic, personality, and clinical variables relate to discrepancy, the processes or biases that may underlie discrepancy are still unknown. Various rater biases may compromise the quality of clinician ratings (Hasemann, 1971; Möller, 2000, 2009), such as reliance on a general or initial impression of the patient or preconceptions of symptom combinations. Moreover, the quality of clinician ratings depends on the patients’ motivation and capability to communicate (Meyer et al., 2001) and on the clinician’s experience with the scale (Paykel & Norton, 1986). However, there is also a range of potential problems with self-report measures. The problems are lack of insight into illness (Tada et al., 2014), different types of response bias (e.g., social desirability response bias, extreme response style; Logan, Claar, & Scharff, 2008; Möller, 2009), cognitive bias toward negative ideations (Tada et al., 2014), defensive or self-presentational biases (DeFife et al., 2010), and low cognitive skills or poor concentration (Cuijpers, Li, Hofmann, & Andersson, 2010; Hamilton, 1976).

Some indirect evidence supports the hypothesis that discrepancy, especially overreporting by patients, may be due to patient biases and low cognitive skills. For example, negative relationships were found between overreporting and intelligence (Kim et al., 2013) and between overreporting and verbal fluency task performance (Tsujii et al., 2014). Factor analysis results suggest that variation in self-report depressive scores mainly reflect a global tendency to complain, rather than an accurate representation of specific depressive symptoms (Möller, 2009).

Patients’ careless responding or concentration problems may also explain discrepancy; this is supported by research indicating a positive relationship between impulsivity and overreporting and a negative relationship between conscientiousness and overreporting (Kim et al., 2013). For example, in the self-report BDI-II, symptoms are rated on a 0 to 3 scale and the instructions are that normal variation of a symptom should be scored “0.” A 1-score on the item “sadness” indicates feeling sad most of the time. Respondents who feel sad regularly but not most of the time may be tempted to score higher than 0, particularly if they are impulsive, unconscientious, or lack concentration.

Patients suffering from such unintended psychological or test-taking factors may produce inaccurate item-score patterns that are inconsistent with any estimated depression trait value. For example, an inconsistent pattern may include endorsement of the extreme item “I feel utterly hopeless” but no endorsement of the milder item “I feel down.” The presence of many of such inconsistencies within an item-score pattern suggests misinterpretation of item content, response bias, carelessness, random responding, a different cause, or a combination of causes (e.g., Wardell, Rogers, Simms, Jackson, & Read, 2014). Irrespective of the cause, the total score resulting from inconsistent item scores is not a valid indicator of the measured latent trait (Meijer, Niessen, & Tendeiro, 2016). Response inconsistency as a measure of the quality of the self-report is thus a plausible predictor of discrepancy between self-reports and clinician ratings.

To summarize, previous research indicates that several demographic, clinical, and psychological variables relate to patient–clinician discrepancy but results are inconsistent, particularly with respect to demographic variables. Moreover, the underlying causes of discrepancy are unknown. Various authors (e.g., Cuijpers et al., 2010; Tada et al., 2014) have suggested that poor quality of the patient’s self-report explains discrepancy but the authors did not investigate this suggestion. Also, research into explanatory variables for patient–clinician discrepancy (e.g., Corruble et al., 1999; Dorz et al., 2004; Duberstein & Heisel, 2007; Kim et al., 2013; Tsujii et al., 2014) suffers from limitations. First, previous research used small sample sizes, at most N = 544 (Rush et al., 2006), so that power may have been low. Second, most studies did not control for the self-reported severity of depression, and lack of control may have confounded research results because symptom severity relates to both discrepancy and to demographic, personality, and clinical explanatory variables. Finally, most studies did not use measures with matched item content so that identified explanatory variables may have reflected differences between measured symptom patterns rather than symptom overreporting or underreporting.

Study Aim

Using BDI-II self-reports and MADRS clinician ratings from a large sample of psychiatric outpatients (N = 5,959), we investigated explanations for patient–clinician discrepancy with respect to depression severity. While accounting for several limitations of previous studies, we addressed the following research questions:

Research Question 1: Which demographic and clinical variables explain patient–clinician discrepancy?

Research Question 2: Does response inconsistency of patients’ self-report responses explain patient–clinician discrepancy?

Research Question 3: Does response inconsistency of patients’ self-report responses mediate the effects of demographic and clinical variables on patient–clinician discrepancy?

Research Question 1 involves a replication of previous research. Based on previous research (e.g., Carter et al., 2010; Corruble et al., 1999; Enns et al., 2000; Kim et al., 2013; Lyness et al., 1995; Rush et al., 2006), the following explanatory variables were considered: age, gender, ethnicity, education level, hostility, anxiety, and phobic fear. Based on theoretical expectations (e.g., Hamilton, 1976), the following explanatory variables were also considered: cognitive problems, because they include concentration difficulties; psychoticism, because it may result in patients’ distrust or difficulty revealing their symptoms; and paranoid ideation, because it may lead to biased awareness of one’s own symptoms.

Research Questions 2 and 3 address the relationship of patient–clinician discrepancy with the quality of self-reports. We expected that discrepancy, which can be either underreporting or overreporting, increases with increasing levels of response inconsistencies. Also, we expected response inconsistency to mediate effects of demographic and clinical variables on underreporting and overreporting. For example, patients with a low education level may be more prone to inaccurate responding (e.g., due to misunderstanding of items/instructions) than other patients, causing increased patient–clinician discrepancy for patients with a lower education.

To quantify response inconsistency, we employed item response theory (IRT) based person-fit analysis (Meijer & Sijtsma, 2001). Person-fit analysis offers a well-established method for assessing the validity of individual test scores in the context of psychopathology measurement (Conijn, Emons, De Jong, & Sijtsma, 2015; Wanders, Wardenaar, Penninx, Meijer, & De Jonge, 2015; Wardenaar, Wanders, Roest, Meijer, & De Jonge, 2015).

Method

Participants

Participants were 6,006 adults (37.5% male; age range 17 to 65 years, M = 38.6, SD = 12.3) referred to treatment of a mood, anxiety, and/or somatoform disorder at either the regional mental health provider Rivierduinen or the psychiatric outpatient department of Leiden University Medical Centre. These facilities have implemented routine outcome monitoring (ROM; Van Noorden et al., 2012). The data were collected at intake between 2004 and 2010. After a regular intake procedure carried out by a clinician, in the baseline ROM assessment, a psychiatric nurse blind to any patient self-ratings and not involved in the treatment, administered a standardized diagnostic interview (Mini-International Neuropsychiatric Interview–Plus; MINI-Plus; Sheehan et al., 1998) and the MADRS. Next, patients completed self-report instruments on touch screen computers. ROM data are (in part) based on ratings by the ROM staff. A specialized staff of 20 trained ROM assistants (psychiatric research nurses and psychologists) participated in the study. These ROM assistants have been extensively trained in the administration of the MINI-Plus interview and the rating scales including the MADRS. During the training, which was provided by ROM assistants with more than 4 years of experience, videotaped interviews with patients were rated, and the ratings were afterward compared and discussed to reach consensus. Reliability of these ratings ranged from .60 to .74 (Cohen’s κ). This denotes moderate to substantial agreement between raters (see De Beurs et al., 2011; McHugh, 2012).

Before 2007, nationality and education level were not recorded, so that nationality was missing for 636 participants (10.7%) and education level was missing for 584 participants (9.8%). For the remaining participants, 28.9% had a non-Dutch ethnic origin, including second- and third-generation immigrants. Education level was 9.3% elementary education or lower, 33.8% low-level high school, 37.3% high-level high school, and 19.6% university or college education. Most prevalent diagnoses (which could be comorbid) included depression (61.7%), anxiety disorders (41.7%), somatoform disorders (13.3%), psychotic disorders (1.8%), eating disorders (1.3%), and substance abuse (0.6%).

Clinical Measures

MADRS

The MADRS (Montgomery & Åsberg, 1979) is a 10-item clinician-rated scale for depressive symptoms. Example items are “apparent sadness” and “inner tension” (Table 1). Clinicians interviewed patients and rated each symptom on a scale from 0 to 6, for example, (0) no sadness to (6) looks miserable all the time. Extremely despondent. Interrater reliability of the MADRS between a general practitioner and a psychiatrist equaled .97 (Montgomery & Åsberg, 1979). In our sample, coefficient alpha equaled .83.

Table 1.

Matched Item Content for MADRS (10 Items) and BDI-II (21 Items) and BDI-II Item Weights.

Items on the MADRS	Matched items from the BDI-II	BDI-II item(s) weight
Reported sadness, apparent sadness	Sadness, crying	1
Inner tension	Agitation, irritability	1/2
Reduced sleep	Changes in sleeping pattern	1
Reduced appetite	Changes in appetite	1
Concentration difficulties	Concentration difficulty	1
Lassitude	Loss of energy, tiredness or fatigue, indecisiveness	1/3
Inability to feel	Loss of pleasure, loss of interest, loss of interest in sex	1/3
Pessimistic thoughts	Pessimism, past failure, guilty feelings, punishment feelings, self-dislike, self-criticalness, worthlessness	1/7
Suicidal thoughts	Suicidal thoughts or wishes	1

Note. MADRS = Montgomery–Åsberg Depression Rating Scale; BDI-II = Beck Depression Inventory–II.

BDI-II

The BDI-II (Beck et al., 1996) is a 21-item self-rated questionnaire for depression containing two subscales: affective (8 items) and somatic (13 items). Example items are “pessimism” (affective subscale) and “changes in sleep pattern” (somatic subscale; Table 1). Participants rated how they felt during the past 2 weeks on a scale from 0 to 3, for example, (0) I do not feel sad to (3) I am so sad or unhappy that I can’t stand it. For the Dutch BDI-II, the total-score coefficient alpha and test–retest reliability equaled .92 and .82, respectively (Van der Does, 2002). In our sample, coefficient alpha equaled .91.

To increase conceptual comparability between the clinician ratings and self-reports, total scores on the BDI-II were recalculated based on matched item content with the MADRS (see Table 1). Specifically, in computing the BDI-II total score, BDI-II items were reciprocally weighted by the number of BDI-II items that matched a particular MADRS item. For example, one BDI-II item (“suicidal thoughts or wishes”) matched the MADRS item “suicidal thoughts”; hence, this BDI-II item received a weight equal to 1. Three BDI-II items (“loss of pleasure”, “loss of interest”, “loss of interest in sex”) matched the MADRS item “inability to feel”; hence, each of these three BDI-II items received weight 1/3.

Brief Symptom Inventory (BSI)

The BSI (De Beurs, 2004; Derogatis & Melisaratos, 1983) is a 53-item self-rated questionnaire designed to assess general psychological distress. The BSI is a shortened version of the SCL-90-R (Derogatis, 1977) and 49 of the 53 items are divided into nine subscales containing 4 to 7 items: somatization, cognitive problems, interpersonal sensitivity, depression, anxiety, hostility, phobic fear, paranoid ideation, and psychoticism. Patients rated the extent to which each item (e.g., “feeling worthless”) bothered them in the past week on a 5-point Likert-type scale ranging from (0) not at all to (4) extremely. In our sample, coefficient alpha for the psychoticism subscale was .70, and for each of the other eight subscales alpha exceeded .80.

Measure of Response Inconsistency

We used the l_z person-fit statistic for polytomous items (Drasgow, Levine, & Williams, 1985) to quantify response inconsistency of the self-report symptom scores with respect to the graded response IRT model (GRM; Samejima, 1997). Statistic l_z is a continuous measure of response inconsistency, and may be combined with cutoff values to classify item-score patterns as inconsistent or overly consistent. The GRM is a unidimensional IRT model for ordered categorical data, and assumes unidimensionality of the latent trait, local independence, and increasing logistic item-step response functions. The l_z statistic is the standardized likelihood of an item-score pattern given the estimated latent trait level (e.g., depression severity) and GRM item parameters.

The l_z statistic quantifies person fit more reliably as more items are used and is unreliable for short scales containing no more than 20 items (Conijn, Emons, & Sijtsma, 2014; Reise & Due, 1991). To obtain reliable measures of response inconsistency, response inconsistency was based on the item scores of both the BDI-II and the BSI (70 items in total). We first computed l_z for each of the BDI-II and BSI subscales separately, and then computed the multiscale l_zm statistic (Conijn et al., 2014; Drasgow, Levine, & McLaughlin, 1991), which equals the sum of the subscale l_z values. Conijn et al. (2014) showed that l_zm statistic has high power to detect response inconsistencies when the total number of items is at least 60. Underlying this procedure was the assumption that patients’ response inconsistency is related across the BDI-II and BSI. For example, we assumed that patients lacking concentration on one instrument also lacked concentration on the other instrument. This assumption was supported by positive correlations between the l_z statistic based on different subscales (range r = .04 to .26; M = 0.13, SD = 0.5). These correlations may seem trivial, but it should be taken into account that the observed correlations are attenuated by the unreliability of the l_z statistic, particularly on the shorter BSI subscales consisting only of four or five items.

Before computing the l_zm statistic, we examined whether the subscale data of the BDI-II and the BSI satisfied the GRM model assumptions; see the appendix for results. Consistent with Brouwer, Meijer, and Zevalkink (2013), we found that the BDI-II data were sufficiently unidimensional for IRT analyses. We standardized the l_zm statistic using a bootstrap procedure in which we generated a null distribution for each individual respondent (De La Torre & Deng, 2008). The resulting l_zm statistic was normally distributed and independent of the latent trait (e.g., depression severity for the BDI-II) under the null model of response consistency to the GRM. For descriptive purposes, we also used the bootstrap procedure to compute the number of patients showing significant response inconsistency based on the l_zm statistic, with two-sided α = .05. To facilitate interpretation, we recoded the l_zm person-fit statistic such that higher values reflected greater inconsistency.

Statistical Analysis

To address our research questions, we first computed discrepancy scores defined as the standardized weighted BDI-II total score minus the standardized MADRS total score. Positive discrepancy scores indicate overreporting (BDI-II > MADRS) and negative values indicate underreporting (BDI-II < MADRS).

Research Questions 1 and 2

In preliminary analyses, we visually inspected scatterplots and partial scatterplots (controlling for the effects of the remaining explanatory variables) to assess potential nonlinear effects of explanatory variables on the discrepancy scores. Visual inspection of the (partial) scatterplot was considered particularly relevant for assessing potential nonlinear effects of response inconsistency. First, we expected that response inconsistency is positively related to both underreporting and overreporting, because an inaccurate self-report may lead to both positive and negative bias in the test score. Second, several studies showed that in addition to response inconsistency, overconsistency casts doubt on the validity of the test score (Conijn, Sijtsma, & Emons, 2016; Ferrando, 2014; Stukenberg, Brady, & Klinetob, 2000). That is, overconsistent response patterns (i.e., an overly consistent response style) may result from social desirability, malingering or symptom exaggeration, and response inconsistency may therefore have a curvilinear relationship with discrepancy.

In case the graphical analyses showed linear relationships between explanatory variables and discrepancy, a multiple linear regression model was estimated regressing discrepancy on these explanatory variables. A second linear regression model included the BDI-II as a control variable. Using BDI-II as a control variable accounted for confounding effects: Self-reported depression symptom severity may correlate with both discrepancy and the demographic and clinical predictors in the model. The controlled model was used for interpretation, while a comparison between the effects in the controlled and uncontrolled models was used to assess the effect of confounding, as most previous research did not control for self-reported depression severity (e.g., Duberstein & Heisel, 2007; Kim et al., 2013; Tsujii et al., 2014).

SPSS was used for linear regression analyses and continuous explanatory variables were standardized. In regression analyses the variance inflation factor values for the explanatory variables did not exceed 2.7, suggesting absence of multicollinearity problems (Keith, 2006). Considering the large sample size, significance of the regression coefficients was tested at a 1% significance level.

Research Question 3

To test the hypothesis that response inconsistency mediates the effect of the clinical/demographic variables on the discrepancy scores, a mediation analysis was conducted for each clinical and demographic variable that satisfied the following criteria (Kenny, Kashy, & Bolger, 1998): (a) the potential mediator is related to the independent variable and (b) the potential mediator is related to the outcome variable after controlling for the effect of the independent variable. The PROCESS add-on for SPSS (Hayes, 2013) was used for conducting the mediation analysis. This program uses bootstrapped confidence intervals to assess mediation effects. We used α = .01 for testing regression coefficients and mediation was assumed to occur if the 99% confidence intervals for the indirect effect do not contain the value 0.

Results

Descriptive Statistics

Two participants (0.03%) had missing or incomplete MADRS data, and 45 participants (0.75%) had missing or incomplete BSI data, leaving n = 5,959. Data were missing on ethnicity for 636 participants (10.7%) and on education level for 584 participants (9.8%). We kept these cases in the analysis, because they concerned large percentages of the data and the reason for the missing data was known (i.e., ethnicity and education data were not recorded). Hence, we estimated the models without these variables using the full sample, and the models including these variables using the smaller sample of participants.

Spearman’s rho between the BDI-II and MADRS was .72, and it increased to .74 for BDI-II scores weighted to match the MADRS by content. The BDI-II (skewness = 0.06; kurtosis = −0.40) and the MADRS (skewness = 0.03; kurtosis = −0.34) had similar distributions. The similarity implies that standardized scores on these measures reflect comparable levels of depression severity. Discrepancy was normally distributed (skewness = 0.03; kurtosis = 0.19) with a mean of 0.00 and a standard deviation (SD) of 0.71.

Response inconsistency ranged from −1.58 to 3.94 (skewness = 1.13; kurtosis = 2.11), with higher values reflecting higher inconsistency. The number of significantly inconsistent item-score patterns equaled n = 725 (12.1%). The correlation between the MADRS and the weighted BDI-II was lower for patients with inconsistent item-score patterns (rho = .66) than for patients with normal patterns (rho = .75). Response inconsistency was weakly related to the explanatory variables, and correlated highest with the BSI subscale scores: cognitive problems, anxiety, paranoid ideation, and psychoticism (.20 < r < .23).

Research Questions 1 and 2: Explaining Discrepancy

Figure 1 shows the bivariate relationship between response inconsistency and discrepancy. The plot suggests a linear, but weak positive relationship. Controlling for the other explanatory variables revealed a similar relationship. Scatterplots and partial scatterplots between the other explanatory variables and discrepancy also suggest a linear, weak relationship (plots not displayed).

Figure 1.

Scatterplot of response inconsistency and discrepancy scores including regression line.

Given linear relationships, we computed Pearson correlations (Table 2). Except for education level, all other correlations were significant (α = .01). Relative to clinicians, males and older adults were less likely to overreport depressive symptoms. Respondents of non-Dutch ethnic origin were more likely to overreport symptoms. Response inconsistency and all clinical variables correlated positively with overreporting of symptoms. Correlations ranged from .10 for response inconsistency to .18 for psychoticism and paranoid ideation.

Table 2.

Results of Linear Regression Model Predicting Discrepancy Scores (BDI-II z-score–MADRS z-score) From Explanatory Variables (EVs).

	Pearson correlation (EV, discrepancy)	Model A		Model B controlling for BDI-II
	Pearson correlation (EV, discrepancy)	b	β	b	β
Intercept	—	.40 (.04)**	—	.07 (.04)**	—
Male gender	−.11**	−.14 (.02)**	−.10	−.08 (.02)**	−.06
Age	−.13**	−.06 (.01)**	−.08	−.08 (.01)**	−.11
High education level (vs. low)	−.03	−.05 (.02)	−.03	−.02 (.02)	−.02
Non-Dutch ethnicity	.04*	−.04 (.02)	−.02	−.06 (.02)*	−.03
Response inconsistency	.10**	.04 (.01)**	.05	.03 (.01)**	.05
BSI
Cognitive problems	.14**	.01 (.01)	.02	−.11 (.01)**	−.16
Anxiety	.11**	−.04 (.02)	−.05	−.09 (.02)**	−.12
Phobic fear	.12**	.00 (.01)	.00	.00 (.01)	−.01
Psychoticism	.18**	.07 (.02)**	.10	−.04 (.02)	−.05
Hostility	.17**	.06 (.01)**	.08	.02 (.02)	.02
Paranoid ideation	.18**	.05 (.01)**	.07	.06 (.01)**	.08
BDI-II	.34**	—	—	.36 (.01)***	.50

R ²		.07		.17

Note. BSI = Brief Symptom Inventory; MADRS = Montgomery–Åsberg Depression Rating Scale; BDI-II = Beck Depression Inventory–II. Positive discrepancy scores indicate overreporting (BDI-II > MADRS) and negative scores indicate underreporting (MADRS > BDI-II). All continuous explanatory variables were standardized.

p < .01. **p < .001.

Next, partial effects of the explanatory variables were estimated using a multiple regression model (Table 2; Model A) including all explanatory variables (R² = .07). We only discuss effects significant at α = .01. Cognitive problems, anxiety, phobic fear, and ethnicity had no unique effect on overreporting, but all other predictor effects were consistent with the correlations. After controlling for the BDI-II in Model B (R² = .17), several predictor effects changed. Non-Dutch ethnicity, cognitive problems, and anxiety related negatively to overreporting. Predictor effects were small. For example, for patients with high-response inconsistency (i.e., 2 SDs above the mean inconsistency), the model-predicted discrepancy was .13, but for patients with low-response inconsistency values (i.e., 2 SDs below the mean inconsistency value), the discrepancy was .01. Hostility and psychoticism did not have a significant effect on discrepancy after controlling for the BDI-II.

Research Question 3: Response Inconsistency as a Mediator

In terms of significance (tested at α = .01), response inconsistency met the prerequisites for acting as mediator (Kenny et al., 1998; see Method section) between discrepancy and any of the other explanatory variables: Response inconsistency was related (a) to the independent variable and (b) to discrepancy after controlling for the effect of the independent variable. However, the relationship between response inconsistency and the independent variables gender and age was weak (|r| < .05), and mediation analysis was not considered useful for these independent variables. For the remaining explanatory variables, we estimated a mediation model including the independent variable, response inconsistency as the mediator, and discrepancy as the outcome variable. The BDI-II acted as a control variable in the model. See Figure 2 for an example model. The mediation effect equals the product of regression coefficients a and b.

Figure 2.

Path diagram displaying the mediation model for the effect of non-Dutch ethnicity on discrepancy (BDI-II z-score –MADRS z-score), including the BDI-II as a control variable.

Table 3 shows the regression coefficients and 99% confidence intervals for the mediating effects (a × b). The mediating effects of response inconsistency were significant for most variables but effect sizes were negligible, ranging −.002 (education level) to .006 (anxiety). Figure 2 shows the estimated model for non-Dutch ethnicity: Non-Dutch patients had higher response inconsistency (a = .164), while higher response inconsistency in turn led to increased discrepancy (b = .026). The estimate of the direct effect of ethnicity on overreporting (c = −.054) was larger than the estimate of the indirect effect of ethnicity (a × b = .004). For several variables, the direct effects (c paths) were nonsignificant but they were stronger than the mediating effects (a × b paths). This inconsistency reflects the relatively high power to detect indirect effects compared with direct effects in mediation models (Kenny & Judd, 2014). Overall, results suggested that the mediating effect of inaccuracy of self-reported item scores on the patient-clinician discrepancy associated with most of the demographic and clinical variables was negligible.

Table 3.

Results of Mediation Analysis Predicting Discrepancy From Independent Variables and Response Inconsistency as a Mediator.

Independent variable	Path (see Figure 2)
Independent variable	a	b	c	a × b [99% confidence interval]
High education level	−.097	.023^ns	−.020^ns	−.002 [−.006, <.000]
Non-Dutch ethnicity	.164	.026	−.054^ns	.004 [.001, .010]
Cognitive problems	.130	.030	−.143	.004 [.001, .008]
Anxiety	.166	.033	−.124	.006 [.002, .010]
Phobic fear	.066	.023	−.065	.002 [>.000, .004]
Psychoticism	.152	.026	−.064	.004 [.001, .008]
Hostility	.106	.020^ns	−.005^ns	.002 [<.000, .006]^ns
Paranoid ideation	.136	.020^ns	−.005^ns	.003 [−.001, .007]^ns

Note. Higher discrepancy scores reflect overreporting on the Beck Depression Inventory–II relative to the Montgomery–Åsberg Depression Rating Scale; ns denotes not significant; all other coefficients are significant at α = .01. The mediation (a × b) paths are significant in case the 99% confidence interval does not include 0. Unstandardized coefficients are provided. The Beck Depression Inventory–II was included as a control variable.

Comparison of Results Based on Matched Versus Unmatched Measures

So far, we used BDI-II scores weighted to match MADRS scores, but a perfect matching was impossible because instruments are different. To assess the sensitivity of the results to the remaining content differences between BDI-II and MADRS, we contrasted the results from analyses using matched BDI-II and MADRS scores versus unmatched scores. Both Model A and Model B (Table 2) were reestimated, using discrepancy scores based on unmatched BDI-II and MADRS scores. Results of the unmatched analyses (not tabulated) and matched analyses showed minor differences. Unmatched Model A explained 2% more variance, suggesting slightly inflated effects due to content differences between the two instruments. Unmatched Model B explained 3% less variance due to a weaker relationship between the instruments. Coefficients were similar for most independent variables, and the only differences with respect to significance at α = .01 were a negative effect of anxiety in Model A (b = −.07), and a positive effect of psychoticism in Model B (b = .07).

Discussion

This study was partly a replication study of demographic and clinical explanatory variables for discrepancy between self-report and clinician-rated depression symptom severity. Findings for age and education were consistent with most previous studies (e.g., Enns et al., 2000; Rush et al., 2006), but our study is one of the few studies demonstrating gender differences consistent with gender role theory (Carter et al., 2010). Cognitive problems, paranoid ideation, and anxiety also related to patient–clinician discrepancy. In contrast to previous research (Corruble et al., 1999; Duberstein & Heisel, 2007; Kim et al., 2013), we did not find a positive effect of hostility or phobic fear on overreporting, and we found a negative effect of anxiety on overreporting instead of the expected positive effect. The anxiety effect is consistent with avoidance-coping mechanisms, common among patients with anxiety problems. Differences between our study and previous studies may be explained by our approach of controlling for self-reported depression symptom severity.

Our main goal was to assess whether patient–clinician discrepancy may be related to inaccurate self-reported symptom severity. The positive linear effect of response inconsistency on discrepancy scores suggests that patients with higher self-reported scores than clinician-rated scores, may show this discrepancy due to a low quality self-report.

This conclusion would be consistent with previous research showing intelligence to be related to overreporting by patients but not underreporting (Kim et al., 2013). A possible explanation for these results is that underreporting of symptoms may proceed consistently across test items, whereas overreporting may proceed in a more impulsive, inconsistent manner producing response inconsistency. In other words, it may be easier to consistently deny symptoms than to consistently “overendorse” symptoms, thereby admitting extreme symptoms of psychopathology.

The small direct and mediating effects of response inconsistency suggest that response behaviors typically producing response inconsistency (e.g., careless responding or misinterpretation of item content) are unlikely to play a large role in patient–clinician discrepancy. This result is reassuring considering that cognitive biases, concentration difficulties and low motivation are common symptoms of depression and known causes of careless and random responding. The modest effect of response inconsistency may also be due to a low prevalence of invalid responding in outpatients, leading to a small effect size in group-level analyses. Possibly, predictive power of response inconsistency is stronger in psychiatric inpatients, because they experience more severe difficulties concentrating and more severe symptomology in general. In any case, an important implication of our results is that patient–clinician discrepancy in outpatients may largely be due to clinician biases, or due to patient biases that cannot be captured with response inconsistency indices (e.g., agreement response bias).

Despite small predictive power of response inconsistency for patient–clinician discrepancies, the group of patients showing the largest discrepancies with inconsistent response patterns ought to be studied in more clinical detail (Meijer et al., 2016). This might help us disentangle the causes underlying the heterogeneous symptomatology observed in patients with major depressive disorder.

Implications for Future Research

This study has several implications for future research into patient–clinician discrepancy. First, controlling for the BDI-II total score in the multiple regression analysis had a large impact on the effects of most of our explanatory variables. This result suggests that explanatory variables (specifically clinical and personality variables) may relate to discrepancy, because they relate to symptom severity as measured by the self-report measure and not to overreporting or underreporting per se. Hence, future research explaining patient–clinician discrepancy should control for self-reported symptom severity. Second, given a large sample resulting in considerable power, the results suggest that differences between outcomes based on matched instrument content and nonmatched instrument content are small. These results suggest that previous results concerning explanatory variables for discrepancy between the MADRS and BDI-II based on smaller sample sizes are generally valid. Third, our results suggest that future research should assess the contribution of clinician biases in discrepancy. For example, explanatory power of clinicians’ response inconsistency and patients’ response inconsistency on patient–clinician discrepancy could be compared. The small number of clinician-rated MADRS items in this study prevented us from reliably assessing the clinician’s response inconsistency (e.g., Emons, 2008; Reise & Due, 1991). Hence, such research requires data of longer rating scales, multiple rating scales, or repeated assessments.

Future research should also overcome the limitations of the current study. In our study, the clinical variables were based on self-reports. The self-report scores were likely also affected by overreporting or underreporting and share method variance with BDI-II and MADRS test scores, thereby inflating their explanatory value (Podsakoff, MacKenzie, Lee, & Podsakoff, 2003). Future research could use objective measures of clinical explanatory variables, such as lab tests of cognitive functioning or the average symptom ratings by multiple informants.

A second limitation is that we cannot completely eliminate differences in item content as a cause of our findings, although we did minimize differences in item content between the BDI-II and MADRS by re-weighting the scores. Future studies that use measures with equal content allow additional analyses to be conducted, for example, to assess the absolute differences between scores, and to assess whether patients’ self-reported item scores are consistent with IRT item parameters estimated in clinician-rated data. These future studies should also assess equality of depression constructs across self-report and clinician ratings (e.g., Bernstein, Rush, Carmody, Woo, & Trivedi, 2007). Inequality suggests that self-report and clinician-rated depression represent different constructs, but inequality does not suggest one is valid and the other is not; both may be valid. This result implies that discrepancy may not primarily be an indicator of data quality but instead the product of different depression constructs.

A third limitation is also related to the meaning of discrepancy. We used the term overreporting to denote self-reported scores higher than clinician ratings and underreporting to denote relatively low self-reported scores. This terminology suggests that clinician ratings are standard but we notice that clinician ratings are not necessarily more accurate than patient self-reports (e.g., De Los Reyes & Kazdin, 2004), and overreporting may thus equally well indicate a clinicians’ underestimation of depression severity. The estimated predictor effects can and should thus be interpreted in various ways.

Conclusion

Using a large outpatient sample, our study provided a replication of previous results on explanatory variables for patient–clinician discrepancy. Besides that, we provided additional results pertaining to the mechanisms underlying discrepancy. Although response inconsistency related to overreporting of symptoms by patients as compared with clinician ratings, our results suggest that inaccurate self-report responding is not a major explanation of patient–clinician discrepancy. Future research into patient–clinician discrepancy should be directed at potential biases in the clinician ratings.

Footnotes

Appendix

To assess the GRM unidimensionality and local independence assumptions for the subscale data of the BDI-II and the BSI, we conducted confirmatory factor analyses in Mplus (Muthén & Muthén, 2007). Since for the BDI-II various factor structures have been identified (e.g., Beck et al., 1996; Buckley, Parker, & Heggie, 2001), we compared model fit of several of the proposed factor structures in our data. To assess model fit, we used cutoff values for the comparative fit index (CFI) and Tucker–Lewis index (TLI) equal to .95 and cutoff values for the root mean square error of approximation (RMSEA) equal to .08. Model-fit comparisons yielded the structure corresponding to a 9-item cognitive subscale and 12-item somatic–affective subscale (Beck et al., 1996) to be most appropriate, although the RMSEA was poor for both subscales. RMSEA model fit improved from mediocre to satisfactory by freeing one residual correlation for each subscale. For the BSI, we used the 9-factor structure which includes 49 of the 53 items (De Beurs, 2004). For most BSI subscales, CFI and TLI suggested acceptable model fit. For the psychoticism and hostility subscales, RMSEA suggested poor model fit, but CFI and TLI suggested acceptable fit. Poor RMSEA may be due to a small number of items (Kenny, Kaniskan, & McCoach, 2015), and statistic l_z is robust against mild model misfit (Conijn et al., 2014; Conijn et al., 2015). Hence, we decided to include all BSI subscales for computing the l_zm person-fit statistic. The GRM assumption of monotone increasing logistic item step response curves was checked by visual inspection of the item step response functions using the Mokken Package in R (Van der Ark, 2012), and was supported by the data.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Beck

A. T.

Steer

R. A.

Ball

Ranieri

(1996). Comparison of Beck Depression Inventories-IA and-II in psychiatric outpatients. Journal of Personality Assessment, 67, 588-597.

Bernstein

I. H.

Rush

A. J.

Carmody

T. J.

Woo

Trivedi

M. H.

(2007). Clinical vs. self-report versions of the Quick Inventory of Depressive Symptomatology in a public sector sample. Journal of Psychiatric Research, 41, 239-246.

Brouwer

Meijer

Zevalkink

(2013). On the factor structure of the BDI-II: G is the key. Psychological Assessment, 25, 136-145.

Buckley

T. C.

Parker

J. D.

Heggie

(2001). A psychometric evaluation of the BDI-II in treatment-seeking substance abusers. Journal of Substance Abuse Treatment, 20, 197-204.

Carroll

B. J.

Feinberg

Smouse

P. E.

Rawson

S. G.

Greden

J. F.

(1981). The Carroll rating scale for depression. I: Development, reliability and validation. British Journal of Psychiatry, 138, 194-200.

Carter

J. D.

Frampton

C. M.

Mulder

R. T.

Luty

S. E.

Joyce

P. R.

(2010). The relationship of demographic, clinical, cognitive and personality variables to the discrepancy between self and clinician rated depression. Journal of Affective Disorders, 124, 202-206.

Conijn

J. M.

Emons

W. H. M.

De Jong

Sijtsma

(2015). Detecting and explaining aberrant responding to the Outcome Questionnaire-45. Assessment, 22, 513-524.

Conijn

J. M.

Emons

W. H. M.

Sijtsma

(2014). Statistic l_z based person-fit methods for noncognitive multiscale measures. Applied Psychological Measurement, 38, 122-136.

Conijn

J. M.

Sijtsma

Emons

W. H. M.

(2016). Identifying person-fit latent classes and explanation of categorical and continuous person misfit. Applied Psychological Measurement, 40, 128-141.

10.

Corruble

Legrand

J. M.

Zvenigorowski

Duret

Guelfi

J. D.

(1999). Concordance between self-report and clinician’s assessment of depression. Journal of Psychiatric Research, 33, 457-465.

11.

Cuijpers

Hofmann

S. G.

Andersson

(2010). Self-reported versus clinician-rated symptoms of depression as outcome measures in psychotherapy research on depression: A meta-analysis. Clinical Psychology Review, 30, 768-778.

12.

De Beurs

. (2004). De Brief Symptom Inventory: Handleiding [The Brief Symptom Inventory: Manual]. Leiden, Netherlands: Pits.

13.

De Beurs

Den Hollander-Gijsman

M. E.

Van Rood

Y. R.

Van der Wee

N. J.

Giltay

E. J.

Van Noorden

M. S.

. . . Zitman

F. G

. (2011). Routine outcome monitoring in the Netherlands: Practical experiences with a web-based strategy for the assessment of treatment outcome in clinical practice. Clinical Psychology & Psychotherapy, 18, 1-12.

14.

De La Torre

Deng

(2008). Improving person-fit assessment by correcting the ability estimate and its reference distribution. Journal of Educational Measurement, 45, 159-177.

15.

De Los Reyes

Kazdin

A. E

. (2004). Measuring informant discrepancies in clinical child research. Psychological Assessment, 16, 330-334.

16.

DeFife

J. A.

Drill

Nakash

Westen

(2010). Agreement between clinician and patient ratings of adaptive functioning and developmental history. American Journal of Psychiatry, 167, 1472-1478.

17.

Derogatis

L. R.

(1977). SCL-90-R: Administration, scoring, and procedures manual 1. Baltimore, MD: Clinical Psychometric Research.

18.

Derogatis

L. R.

Melisaratos

(1983). The Brief Symptom Inventory: An introductory report. Psychological Medicine, 13, 595-605.

19.

Domken

Scott

Kelly

(1994). What factors predict discrepancies between self and observer ratings of depression? Journal of Affective Disorders, 31, 253-259.

20.

Dorz

Borgherini

Conforti

Scarso

Magni

(2004). Comparison of self-rated and clinician-rated measures of depressive symptoms: A naturalistic study. Psychology and Psychotherapy: Theory, Research and Practice, 77, 353-361.

21.

Drasgow

Levine

M. V.

McLaughlin

M. E.

(1991). Appropriateness measurement for some multidimensional test batteries. Applied Psychological Measurement, 15, 171-191.

22.

Drasgow

Levine

M. V.

Williams

E. A.

(1985). Appropriateness measurement with polychotomous item response models and standardized indices. British Journal of Mathematical and Statistical Psychology, 38, 67-86.

23.

Duberstein

P. R.

Heisel

M. J

. (2007). Personality traits and the reporting of affective disorder symptoms in depressed patients. Journal of Affective Disorders, 103, 165-171.

24.

Dunlop

B. W.

Kornstein

S. G.

Friedman

E. S.

Rothschild

A. J.

Pedersen

. . . Keller

(2010). Correlation between patient and clinician assessments of depression severity in the PREVENT study. Psychiatry Research, 177, 177-183.

25.

Dunlop

B. W.

Kornstein

S. G.

Friedman

E. S.

Rothschild

A. J.

Pedersen

. . . Trivedi

M. H.

(2011). Concordance between clinician and patient ratings as predictors of response, remission, and recurrence in major depressive disorder. Journal of Psychiatric Research, 45, 96-103.

26.

Emons

W. H. M.

(2008). Nonparametric person-fit analysis of polytomous item scores. Applied Psychological Measurement, 32, 224-247.

27.

Enns

M. W.

Larsen

D. K.

Cox

B. J.

(2000). Discrepancies between self and observer ratings of depression: The relationship to demographic, clinical and personality variables. Journal of Affective Disorders, 60, 33-41.

28.

Ferrando

P. J.

(2014). A general approach for assessing person fit and person reliability in typical-response measurement. Applied Psychological Measurement, 38, 166-183.

29.

Hamilton

(1960). A rating scale for depression. Journal of Neurology, Neurosurgery & Psychiatry, 23, 56-62.

30.

Hamilton

(1976). Comparative value of rating scales. British Journal of Clinical Pharmacology, 3, 58-60.

31.

Hasemann

(1971). Verhaltensbeobachtung. In Heiss

(Ed.), Handbuch der psychologie [Handbook of Psychology] (3rd ed., pp. 807-836). Göttingen, Germany: Hogrefe.

32.

Hayes

A. F.

(2013). Introduction to mediation, moderation, and conditional process analysis: A regression-based approach. New York, NY: Guilford Press.

33.

Keith

T. Z.

(2006). Multiple regression and beyond. Boston, MA: Allyn & Bacon.

34.

Kenny

D. A.

Judd

C. M.

(2014). Power anomalies in testing mediation. Psychological Science, 25, 334-339.

35.

Kenny

D. A.

Kaniskan

McCoach

D. B.

(2015). The performance of RMSEA in models with small degrees of freedom. Sociological Methods and Research, 44, 486-507.

36.

Kenny

D. A.

Kashy

D. A.

Bolger

(1998). Data analysis in social psychology. In Gilbert

Fiske

Lindzey

(Eds.), The handbook of social psychology (Vol. 1, 4th ed., pp. 233-265). Boston, MA: McGraw-Hill.

37.

Kim

E. Y.

Hwang

S. S.

Lee

N. Y.

Kim

S. H.

Lee

H. J.

Kim

Y. S.

Ahn

Y. M.

(2013). Intelligence, temperament, and personality are related to over- or under-reporting of affective symptoms by patients with euthymic mood disorder. Journal of Affective Disorders, 148, 235-242.

38.

Lin

C. H.

M. J.

Wong

Chen

C. C.

(2014). Comparison of physician-rating and self-rating scales for patients with major depressive disorder. Journal of Clinical Psychopharmacology, 34, 716-721.

39.

Logan

D. E.

Claar

R. L.

Scharff

(2008). Social desirability response bias and self-report of psychological distress in pediatric chronic pain patients. Pain, 136, 366-372.

40.

Lyness

J. M.

Cox

Curry

Conwell

King

D. A.

Caine

E. D.

(1995). Older age and the underreporting of depressive symptoms. Journal of the American Geriatrics Society, 43, 216-221.

41.

McHugh

M. L.

(2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22, 276-282.

42.

Meijer

R. R.

Niessen

A. S. M.

Tendeiro

J. N.

(2016). A practical guide to check the consistency of item response patterns in clinical research through person-fit statistics: Examples and a computer program. Assessment, 23, 52-62.

43.

Meijer

R. R.

Sijtsma

(2001). Methodology review: Evaluating person fit. Applied Psychological Measurement, 25, 107-135.

44.

Meyer

Finn

Eyde

Kay

Moreland

Dies

. . . Reed

(2001). Psychological testing and psychological assessment: A review of evidence and issues. American Psychologist, 56, 128-165.

45.

Montgomery

S. A.

Åsberg

(1979). A new depression scale designed to be sensitive to change. British Journal of Psychiatry, 134, 382-389.

46.

Möller

H. J.

(1991). Outcome criteria in antidepressant drug trials: Self-rating versus observer-rating scales. Pharmacopsychiatry, 24, 71-75.

47.

Möller

H. J.

(2000). Rating depressed patients: Observer- vs self-assessment. European Psychiatry, 15, 160-172.

48.

Möller

H. J.

(2009). Standardised rating scales in psychiatry: Methodological basis, their possibilities and limitations and descriptions of important rating scales. World Journal of Biological Psychiatry, 10, 6-26.

49.

Muthén

B. O.

Muthén

L. K.

(2007). Mplus: Statistical analysis with latent variables (Version 5.0). Los Angeles, CA: Author.

50.

Oliver

S. J.

Toner

B. B.

(1990). The influence of gender role typing on the expression of depressive symptoms. Sex Roles, 22, 775-790.

51.

Paykel

E. S.

Norton

K. R. W.

(1986). Self-report and clinical interview in the assessment of depression. In Sartorius

Ban

T. A.

(Eds.), Assessment of depression (pp. 356-366). New York, NY: Springer.

52.

Podsakoff

MacKenzie

Lee

Podsakoff

(2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88, 879-903.

53.

Reise

S. P.

Due

A. M.

(1991). The influence of test characteristics on the detection of aberrant response patterns. Applied Psychological Measurement, 15, 217-226.

54.

Rush

A. J.

Carmody

T. J.

Ibrahim

H. M.

Trivedi

M. H.

Biggs

M. M.

Shores-Wilson

. . .Kashner

T. M.

(2006). Comparison of self-report and clinician ratings on two inventories of depressive symptomatology. Psychiatric Services, 57, 829-837.

55.

Samejima

(1997). Graded response model. In van der Linden

W. J.

Hambleton

R. K.

(Eds.), Handbook of modern item response theory (pp. 85-100). New York, NY: Springer.

56.

Shain

B. N.

Naylor

Alessi

(1990). Comparison of self-rated and clinician-rated measures of depression in adolescents. American Journal of Psychiatry, 147, 793-795.

57.

Sheehan

D. V.

Lecrubier

Sheehan

K. H.

Amorim

Janavs

Weiller

. . . Dunbar

G. C.

(1998). The Mini-International Neuropsychiatric Interview (M.I.N.I.): The development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10. Journal of Clinical Psychiatry, 59(Suppl. 20), 22-33.

58.

Stukenberg

Brady

Klinetob

(2000). Use of the MMPI-2’s VRIN scale with severely disturbed populations: Consistent responding may be more problematic than inconsistent responding. Psychological Reports, 86, 3-14.

59.

Tada

Uchida

Suzuki

Abe

Pollock

B. G.

Mimura

(2014). Baseline difference between patients’ and clinicians’ rated illness severity scores and subsequent outcomes in major depressive disorder: Analysis of the sequenced treatment alternatives to relieve depression data. Journal of Clinical Psychopharmacology, 34, 297-302.

60.

Tsujii

Akashi

Mikawa

Tsujimoto

Niwa

Adachi

Shirakawa

(2014). Discrepancy between self- and observer-rated depression severities as a predictor of vulnerability to suicide in patients with mild depression. Journal of Affective Disorders, 161, 144-149.

61.

Van der Ark

L. A

. (2012). New developments in Mokken scale analysis in R. Journal of Statistical Software, 48, 1-27.

62.

Van der Does

A. J. W

. (2002). BDI-II-NL. In Handleiding: De Nederlandse versie van de Beck Depression Inventory [Manual: The Dutch version of the Beck Depression Inventory] (2nd ed.). Lisse, Netherlands: Harcourt Test.

63.

Van Noorden

M. S.

Van Fenema

E. M.

Van der Wee

N. J.

Van Rood

Y. R.

Carlier

I. V.

Zitman

F. G.

Giltay

E. J

. (2012). Predicting outcomes of mood, anxiety and somatoform disorders: The Leiden routine outcome monitoring study. Journal of Affective Disorders, 142, 122-131.

64.

Wanders

R. B. K.

Wardenaar

K. J.

Penninx

B. W. J. H.

Meijer

R. R.

Jonge

P. D.

(2015). Data-driven atypical profiles of depressive symptoms: Identification and validation in a large cohort. Journal of Affective Disorders, 180, 36-43.

65.

Wardell

J. D.

Rogers

M. L.

Simms

L. J.

Jackson

K. M.

Read

J. P.

(2014). Point and click, carefully: Investigating inconsistent response styles in middle school and college students involved in web-based longitudinal substance use research. Assessment, 21, 427-442.

66.

Wardenaar

K. J.

Wanders

R. B. K.

Roest

A. M.

Meijer

R. R.

De Jonge

(2015). What does the beck depression inventory measure in myocardial infarction patients? A psychometric approach using item response theory and person-fit. International Journal of Methods in Psychiatric Research, 24, 130-142.