Abstract
This article demonstrates how qualitative comparative analysis can be used to test mechanisms using large-N survey data (large-N qualitative comparative analysis) as part of realistic evaluation. Large-N qualitative comparative analysis may offer advantages when testing mechanisms in realistic evaluation: it can explore aspects of intervention complexity that are often harder to capture with regression-based methods, while simultaneously assessing the consistency and empirical relevance of mechanisms across a larger group of participants, thereby improving the credibility and generalizability of the program theory. Despite these benefits, qualitative comparative analysis is rarely used to test mechanisms with large-N data. The purpose of this article is to illustrate how this can be done by outlining a four-step process. A weakness of the proposed method is that it identifies the co-occurrence of conditions and outcomes without clarifying the generative processes that connect them. Therefore, the final part of the article discusses how qualitative comparative analysis can be combined with qualitative research.
Keywords
Introduction
Realistic evaluation (RE) is increasingly used to evaluate complex interventions in various fields such as health and employment. However, quantitative methods are not well-established in RE (Hawkins, 2016; Ravn, 2019; Renmans and Pleguezuelo, 2023). This has led to a quantitative turn in RE, with various authors illustrating methods for testing mechanisms using quantitative data (e.g., Ford et al., 2018; Giffoni et al., 2018; Hawkins, 2016; Ravn, 2019). The methods promoted include, among others, “The Bayesian Network Approach” (Giffoni et al., 2018), “Structural Equation Modelling” (Ford et al., 2018), and “Logistic Regression” (Ravn, 2019).
The article aims to contribute to this quantitative turn by illustrating how qualitative comparative analysis (QCA) combined with large-N survey data (>100 respondents) can be used to test mechanisms within RE. QCA was originally developed to analyze small to medium N samples (Rihoux and Ragin, 2009) but is now increasingly applied to larger datasets, referred to as large-N QCA (Rutten, 2022). Previous authors have noted that QCA and RE share a number of common philosophical assumptions, making QCA a suitable method for testing mechanisms (Befani et al., 2007; Gerrits and Verweij, 2016; Rutten, 2022). However, relatively few studies have combined QCA and RE (Befani et al., 2007; Goicolea et al., 2015; Warren et al., 2022), particularly when it comes to applying large-N QCA to test mechanisms among program participants in RE (Renmans and Pleguezuelo, 2023). To expand the range of methods available to realist evaluators, this article seeks to demonstrate:
How QCA, when combined with large-N survey data, can be used to test mechanisms within RE.
The article is structured as follows: First, the article outlines how mechanisms can be tested quantitatively in RE. Next, large-N QCA is introduced, and its potentials and limitations in relation to RE are discussed. Then, a four-step approach to combining large-N QCA and RE is outlined. After this, the case used to illustrate the approach is presented, and a simple program theory is developed. Following that, an illustrative analysis is performed, demonstrating how large-N QCA can test mechanisms as part of RE. The final section discusses how the findings from the QCA analysis can be integrated with qualitative research.
Testing mechanism in RE
RE is a theory-driven approach to evaluation that uses program theory to explain why programs work, for whom, and in what contexts (Pawson and Tilley, 1997). Programs do not work by themselves. Instead, their outcomes are produced by mechanisms in the form of participants’ reasoning and interpretations within the specific contexts in which interventions are implemented (Pawson, 2013). The interaction between context and mechanism in producing outcomes is captured through the so-called CMO configuration (context + mechanism = outcome).
RE is a method-neutral approach. The selection of methods should be guided by the research question and the program theory under investigation. RE has predominantly relied on qualitative methods, and several approaches already exist for investigating and testing mechanisms qualitatively, such as realistic interviewing (Manzano, 2016), causation coding (Saldaña, 2013), dimensional analysis (Bonell et al., 2022), linked coding (Jackson and Kolla, 2012), and process tracing (Wauters and Beach, 2018). The reliance on qualitative methods is perhaps unsurprising, as qualitative methods are well-suited to exploring participants’ experiences and reasoning, the detailed processes that link interventions to outcomes, and the ways in which mechanisms may operate differently across different contexts (Bonell et al., 2022; Maxwell, 2021). For the same reasons, qualitative evidence is indispensable when conducting RE.
However, there has been increasing attention on how quantitative methods can complement qualitative evidence in RE. A quantitative test of mechanisms builds on the idea that the difference between the presence and absence of a mechanism represents a variation in itself that can be used to investigate why and when programs work (Dahler-Larsen et al., 2021). By comparing variations in the activation of mechanisms and outcomes across participants within the same program, sometimes referred to as intra-program comparison (Ravn, 2019), it becomes possible to examine whether relationships exist between mechanisms and outcomes. In addition, contextual factors can be incorporated into the analysis, enabling an exploration of when and for whom these relationships apply.
Some realists fear that a realist understanding of mechanisms could be lost when adopting a variance-oriented perspective (Marchal et al., 2013). However, according to realist philosophy, mechanisms exist objectively in nature and society, but they are not directly visible and can therefore only be accessed through indirect traces and indicators (Dalkin et al., 2015). There will always be a conceptual gap between the mechanisms identified by the researcher and the “real” mechanisms operating in the world (Williams, 2018). That is, regardless of the methods used, we are always dealing with traces of mechanisms, not the theoretical mechanisms themselves. As a result, in the article’s view, there is no inherent reason why mechanisms cannot be captured through quantitative traces and indicators as well as qualitative ones. Furthermore, recent realist literature has pointed out that it can be beneficial to not regard the mechanisms as activated/not activated. Instead, mechanisms can be activated to a greater or lesser extent, thereby affecting the outcome of the program by acting like a “dimmer switch” (Dalkin et al., 2015). If mechanisms can be measured through quantitative indicators (such as survey questions) and operate at varying levels, then differences in the activation of mechanisms across participants in the same intervention should result in variations in outcomes, which can be measured quantitatively (Dahler-Larsen et al., 2021; Ravn, 2019). However, it is important to emphasize that in QCA, the degree of variation is interpreted in set-theoretic terms, which is different from how variation is treated in traditional quantitative methods. In QCA, variation refers to the degree to which cases belong to a set, commonly referred to as “fuzzy sets,” where the set represents the concept or mechanism of interest (Oana et al., 2021). Whether a case is partially or fully a member of a set depends on the conceptual boundaries the evaluator defines for the set, making sets fundamentally different from variables. For example, when measuring a mechanism through a survey, a variable-oriented approach will typically capture the variation of the mechanism on a numerical scale (e.g., ranging from 1 to 5). In contrast, a set-theoretic approach interprets the survey scores as degrees of membership in a set, based on the evaluator’s qualitative and conceptual thresholds. For instance, a score of 5 might represent full membership (1), 4 partial membership (e.g., 0.8), and 1 full non-membership (0). It is therefore essential not to equate variation with variables when applying QCA. The next section further outlines QCA and explores its potential benefits.
Large-N QCA outlined
QCA is a comparative method that applies the principles of set theory and Boolean algebra to determine which conditions are necessary or sufficient for a particular outcome (Oana et al., 2021; Rihoux and Ragin, 2009). Conditions and outcomes are conceptualized as sets within QCA, with cases being members of these sets. For instance, an unemployed person may belong to the set “good health” if they believe their health does not impede their ability to gain employment. Classifying cases into various conditions and outcomes allows QCA to analyze the relationships between them, focusing on necessary and sufficient conditions. A condition is necessary if the outcome does not occur in its absence and sufficient if the outcome always occurs in its presence.
QCA adopts a multiple configurational approach to causality (Berg-Schlosser et al., 2009), suggesting that outcomes are typically produced by a combination of causally relevant conditions (AB → Y). This corresponds with RE’s assumption that outcomes emerge from mechanisms activated in specific contexts (C + M → O) (Dalkin et al., 2015; Pawson and Tilley, 1997). In addition, QCA posits that different combinations of conditions can lead to the same outcome (AB OR CD → Y) and that outcomes can occur with or without a certain condition, depending on the context (AB → Y as well as aC → Y). This aligns with RE’s assumption that different mechanisms can lead to the same outcome across diverse contexts and that a single mechanism may be important in one context but not in another.
To analyze necessity in QCA, the evaluator examines whether the outcome occurs only when a specific condition is present. To analyze sufficiency, the evaluator creates a truth table (Oana et al., 2021). This table lists every possible combination of conditions in separate rows and shows whether those combinations are associated with the outcome. All sufficient combinations are then minimized through Boolean minimization, where any condition is removed from a combination if its presence or absence does not affect the outcome. The final result is one or more combinations of conditions (linked by the Boolean operators AND, NOT, and OR) that are sufficient for the outcome.
Combining QCA with large-N data may offer some advantages over both probabilistic regression-based methods and qualitative approaches (including small-N QCA). Compared with regression-based methods, QCA provides a more straightforward way to analyze which combinations of factors are associated with specific outcomes. While regression analyses can examine the effects of combinations of conditions through interaction terms, they often face difficulties incorporating multiple interactions within a single model, especially when dealing with a moderately large number of cases (Vis, 2012). Moreover, these interaction terms must be predetermined, making it harder to discover unexpected patterns in the data. In addition, interpreting interactions involving more than two variables can be challenging and may lead to multicollinearity issues (Braumoeller, 2004; Vis, 2012). Furthermore, while regression analyses aim to identify a single model that best fits the data, QCA enables the exploration of multiple pathways to the same outcome. For instance, QCA may identify three or more combinations of factors that are each sufficient for the same outcome. Taken together, QCA can shed light on certain aspects of the complexity of interventions in ways that are less often achievable with regression-based methods.
Turning to the qualitative camp, combining QCA with large-N survey data allows for the testing of mechanisms across a broader range of program participants than is typically possible in qualitative research. A common critique of qualitative research is its limitation in testing whether the identified mechanisms apply to all program participants. Positive outcomes for participants not included in the qualitative study might result from different mechanisms or from minimal activation of the same mechanism (Ravn, 2019). Here, large-N QCA makes it possible to evaluate whether a mechanism is consistent across a larger number of cases. If a mechanism is found to be consistently related to outcomes across cases, this strengthens the causal claim (Shan and Williamson, 2023). Conversely, if a mechanism is not related to the outcome, this might indicate that the mechanism is not relevant to the outcome, or that it only works under more specific contextual conditions (Goertz, 2017). In both cases, however, the program theory would need to be revised.
Moreover, large-N QCA can help assess the empirical relevance of mechanisms (Oana et al., 2021; Ragin, 2006), which can be more difficult to evaluate when the analysis is based on only a few cases. While there may be a consistent relationship between a mechanism and an outcome across cases, it is possible that most participants have their outcomes caused by other mechanisms not accounted for in the program theory. In such a case, the program theory would be accurate for some participants but incomplete due to unaccounted key mechanisms. Thus, large-N QCA can complement qualitative evidence by assessing the consistency and empirical relevance of mechanisms across a larger group of participants, thereby potentially improving the credibility and generalizability of the program theory.
However, there are some caveats to using large-N QCA within RE, as the method identifies the co-occurrence of conditions and outcomes without examining the detailed processes that connect them. Consequently, the understanding of how and why conditions lead to outcomes—the generative perspective—is somewhat diminished and must be reintroduced (Befani et al., 2007; Rutten, 2022). The identified combinations do not speak for themselves and need to be interpreted. As Befani et al. (2007) note, it is often only by returning to the cases that the results in QCA can be fully understood.
Combining large-N QCA and RE
The main contribution of this article is therefore to demonstrate and discuss how large-N QCA and RE can be combined, outlining a four-step approach, as shown in Figure 1. First, the relevant outcomes and conditions are chosen. In RE, this corresponds to the step where the evaluator formulates an initial program theory with hypotheses and assumptions, which are subsequently tested and refined through empirical investigation (Pawson and Tilley, 1997). As too many conditions can make theoretical interpretation difficult, most QCA studies include between three and seven conditions for explaining an outcome of interest (Oana et al., 2021). If the initial program theory contains many elements, it may be necessary to focus on the elements that are considered most critical to one’s purpose. For the same reason, it is important to have clear theoretical hypotheses about the intervention, either from existing literature or through exploratory interviews (Manzano, 2016).

Steps for combining large N-QCA and realistic evaluation.
Once the relevant mechanisms have been identified, the next step involves collecting and preparing the empirical data. In principle, there are no limitations on the types of data that can be used to measure mechanisms in QCA. One of the advantages of QCA is that it can handle both qualitative and quantitative data, as well as large and small datasets. However, the benefit of surveys is that they offer an efficient way to gather a large amount of information at the individual level. (Renmans and Pleguezuelo, 2023). Individual-level data is particularly useful when the evaluator is interested in participants’ reasoning and outcomes. If the evaluation focuses on other levels of the intervention, other data sources may be more relevant.
One weakness of survey data is that it relies on self-reported information, which can lead to systematic biases (such as response bias, measurement bias, etc.) and issues with construct validity. Therefore, when measuring the hypothesized CMO configurations through survey questions, it is advisable to have a clear definition of the mechanisms, specifying the defining attributes that must be met before the evaluator can infer the presence of the mechanism (Jopke and Gerrits, 2019), and to draw inspiration for question formulation from existing research literature (Ravn, 2019). Whenever possible, using validated scales is recommended.
When the data is collected, it needs to be transformed into set membership scores, which determine whether and to what extent a case belongs to the conditions and outcomes. This process is referred to as calibration in QCA (Oana et al., 2021). There are different types of sets to which cases can be members. Crisp sets are binary: they only distinguish between cases that are members of the set (membership score of 1) and cases that are not (membership score of 0). However, sometimes we are also interested in the degree to which a case is a member of a set. Fuzzy sets make it possible to capture differences in degree, where the membership of a case in a set can range between 0 (full non-membership) and 1 (full membership). Whether one chooses to conduct a fuzzy-set QCA or a crisp-set QCA depends on the research objective and the data available (Rohlfing, 2020).
After cases have been assigned membership scores to the selected conditions and outcomes, the third step involves analyzing whether there are any combinations of conditions that are either necessary or sufficient for the outcomes of interest. To assess set relationships in QCA, two parameters of fit are used: consistency and coverage (Ragin, 2006). Consistency measures the proportion of cases with the same combinations of conditions that share the same outcome, indicating the degree to which a set relationship aligns with a perfect pattern of sufficiency or necessity. Values range from 0 to 1, where higher values signify more consistent set relations. Generally, conditions should have a consistency score of at least 0.9 to be counted as necessary conditions and 0.75 or higher to be counted as sufficient conditions (Fainshmidt et al., 2020; Ragin, 2006)
Coverage, on the contrary, determines the empirical relevance or triviality of conditions. For necessary conditions, low coverage may imply that the condition is trivial for the outcome. For sufficient conditions, coverage reveals how much of the outcome is explained by the conditions. If consistency is low, it indicates that there is little empirical support for a necessary or sufficient relationship between the conditions and the outcome. If coverage is low, it indicates that important factors might be missing from the model (Oana et al., 2021; Rubinson et al., 2019; Warren et al., 2022).
In the final and fourth step, the QCA results must be causally interpreted. This involves interpreting how and why specific combinations of conditions lead to the observed outcome (Oana et al., 2021; Rutten, 2022). At this step, the evaluator can use qualitative research (e.g., qualitative interviews) to provide deeper insights into the specific processes leading to certain outcomes, including the reasoning of actors, which is essential for a generative understanding (Dalkin et al., 2015; Lemire et al., 2020). Naturally, the process is not as linear as depicted in Figure 1. Both QCA and RE are iterative in nature, where new insights gained from qualitative research or theoretical reflections may necessitate revisiting and refining the initial QCA model (Befani et al., 2007; Thomas et al., 2014).
Model specification
The intervention, used as an example to illustrate the proposed approach, originates from the Danish Active Labor Market Policy. The illustrative analysis in the article is based on an RE of company-based internships (CBIs) and their potential to improve progression and employment for disadvantaged unemployed clients with problems besides unemployment. Previous studies have shown that CBI has positive employment effects for vulnerable client groups (Det Økonomiske Råd, 2012; Ekspertgruppen om udredning af den aktive beskæftigelsesindsats, 2015; Graversen, 2012; Rambøll and Metrica, 2018). However, there is a more limited understanding of how and why CBI is effective (Dall et al., 2023; Salado-Rasmussen and Bredgaard, 2016). Thus, a realistic approach to evaluation was adopted. The research question aimed to uncover which mechanisms were activated in which contexts and the results they produced. An initial program theory was developed based on existing literature and explorative interviews with job counselors. This helped identify potential CMO configurations that could explain how CBI operates.
Table 1 presents an overview of the conditions corresponding to the different CMO components of the program theory. The definitions of the context, mechanism, and outcome were based on existing literature (Bonell et al., 2022; Dalkin et al., 2015; Lemire et al., 2020; Nielsen et al., 2022; Pawson and Tilley, 1997). In line with this literature, mechanisms were defined as the clients’ reasoning and reactions to the intervention, leading to outcomes in specific contexts.
Program theory components and corresponding conditions.
The overall and long-term goal of the intervention is to get clients off public benefits and into employment. However, studies have highlighted the advantages of evaluating more than just the long-term employment outcomes of job interventions (Arendt and Jacobsen, 2017; Caswell, 2017; Crépon and Van Den Berg, 2016). For vulnerable clients, securing employment can be a goal that takes a long time to achieve. Therefore, focusing on shorter-term outcomes along the way to employment can be beneficial. As a result, the analysis looks at self-reported progress (prog) and paid hours (phr) at the company. Paid hours refer to the hours during which the individual receives wages for part of their work while still being on welfare benefits.
As the most likely active mechanisms the following were selected:
Purpose: Clients who perceive a clear purpose in CBI, that is in accordance with their interests and situation, are expected to take ownership and feel motivated toward the program, enhancing the chances of a positive outcome (Rambøll, 2017; Salado-Rasmussen and Bredgaard, 2016).
Competence: The experience of competence and mastering job tasks is expected to strengthen the client’s self-efficacy and skills, thereby moving them closer to the labor market (Epinion, 2017; Gammelgaard et al., 2017; Madsen et al., 2016; Rambøll and Metrica, 2018).
Belongingness: Clients’ sense of belonging and inclusion, where their contributions are valued, is also expected to strengthen the motivation and performance of the unemployed (Epinion, 2017; Gustafsson et al., 2018; Madsen et al., 2016; Rambøll and Metrica, 2018).
The three mechanisms are all consistent with the overall philosophy of the intervention (Styrelsen for Arbejdsmarked og Rekruttering, 2021) and focus on the unemployed individual’s experience and interpretation of CBI according to the understanding of mechanisms in RE. The hypothesis is that, when present, these mechanisms will likely lead to progression for unemployed individuals, moving them closer to the job market and securing paid employment. However, it is not clear how the mechanisms interact with each other or if there are contextual factors that affect their effectiveness. Based on explorative interviews with frontline workers and existing literature, the analysis includes the following two contextual conditions:
Self-assessed health: The perception of one’s own health situation is expected to have an impact on the outcome of CBI (Rosholm et al., 2017). Individuals with poor health are expected to have more difficulty experiencing progression toward the labor market than individuals with good health. To illuminate the “for whom” question, health is therefore included as a condition in the analysis.
Time at the company: The time spent at the company may also influence the effectiveness of CBI. Madsen et al. (2016), for example, suggest that it often takes time for vulnerable clients and companies to adjust to each other, which is why longer programs may be more effective than shorter ones.
Measure and calibrate CMO
The three mechanisms—purpose, competence, and belongingness—along with self-assessed health and progression, were measured in the questionnaire using an ordinal 5-point scale, where 1 indicates “not at all” and 5 indicates “to a very high degree.” Time spent at the company (time) was measured on a scale from 1 to 6, with 1 representing less than 1 month’s duration and 6 representing more than 1 year’s duration. Paid hours at the company were determined through a yes/no question. For detailed information on the specific question wording and calibration, please refer to the Supplementary Material.
In conceptualizing mechanisms, the evaluation found it useful to draw on Goertz’s (2006) understanding of concepts. According to Goertz, concepts (or mechanisms) are structured on three levels: the top level represents the central concept itself; the middle level consists of secondary but still abstract defining attributes that make up the central concept; and the bottom level concretely identifies these abstract attributes in empirical evidence. For example, “belongingness” was defined as clients’ sense of belonging to work communities where their contributions are valued. This was delineated into two defining attributes: (1) clients’ sense of belonging and (2) their sense of being valued. This conceptualization guided the wording of the survey questions, where clients were asked if they felt valued and had good relationships with their colleagues.
As stated, the focus of the analysis is on CBI in Denmark, with disadvantaged unemployed clients serving as informants. Through collaboration with nine different municipalities, the survey was distributed to clients who were participating in CBI. A total of 166 responses were received, out of which 126 were fully completed. Given that QCA faces challenges in managing missing values, the analysis relied on these 126 complete responses. The aim was thus to examine which combinations of mechanisms and contextual factors were either necessary or sufficient for the two outcomes of interest.
Regarding the assignment of fuzzy set memberships, it was assumed that clients who experienced the mechanism to a very high degree (5) or a high degree (4) could be considered as members of the mechanism (Emmenegger et al., 2014; Pappas and Woodside, 2021; Ravn, 2019). Clients who did not experience the mechanism at all (1) or to a low degree (2) were excluded from mechanism membership. Those who experienced the mechanism to some degree (3) were assessed as partial members of the conditions, but more out than in the mechanism. The same logic was applied to individuals’ self-assessed health and progression. The membership crossover point for time at the company was set at 3, indicating that individuals who had been with the company for at least 4 months were included as part of the condition “long time at the company.” The rationale was that CBIs, on average, last about 3.5 months (Jobindsats.dk, 2023). The same logic was applied for both fuzzy and crisp sets, ensuring that cases considered members of the condition in the fuzzy set were also members in the crisp set. The only difference was that the nuances regarding the degree of cases’ membership disappeared in the crisp set. When calibrating data, it is important to be mindful of skewness in case membership for conditions and outcomes, as overly skewed sets can create analytical challenges. While there is no fixed threshold for problematic skewness, a general rule of thumb is that no less than 20 percent of the cases should be more “in” than “out,” and no less than 20 percent of the cases should be more “out” than “in” (Oana et al., 2021).
To illustrate the calibration logic, the mechanism “purpose” was defined as clients perceiving a clear purpose in the CBI that aligns with their interests and situation. This was measured with two survey questions: (1) whether clients were involved in setting goals for their workplace program and (2) whether the program focused on what they considered important for getting closer to a job. Only clients who strongly or very strongly agreed with both questions were included as members of the mechanism. Membership in the mechanism was well balanced, with 57 percent of the participants being more “in” than “out” of the mechanism (see the Supplementary Materials).
Analyze necessary and sufficient relations
In this section, the results based on the above data are presented. The analysis focuses on identifying the conditions (context and mechanisms) that are necessary and/or sufficient for individuals to experience progression (outcome 1) and to achieve paid hours at the company (outcome 2). The progression analysis was conducted as a fuzzy set analysis, while the analysis of paid hours was conducted as a crisp-set analysis, since paid hours are measured in the questionnaire via a dichotomous yes/no question (Rohlfing, 2020). Separate analyses of necessary and sufficient conditions should be conducted for both the presence and absence of the outcome (Oana et al., 2021). Due to space constraints, only the analyses for the presence of the outcome are presented here.
None of the selected conditions was necessary for obtaining paid hours at the company. Therefore, only the necessity analysis for progression is shown in Table 2. It can be observed that belongingness is the only condition necessary for vulnerable clients to experience progression, with a consistency score of 0.947. This suggests that nearly all clients who experience progression in CBI also feel a sense of belonging, highlighting it as a key ingredient in effective CBIs.
Necessary conditions for progression.
However, it is not certain that belongingness alone determines whether clients get closer to the labor market. Therefore, the subsequent analysis examines which conditions are sufficient for progression or obtaining paid hours.
The sufficient solutions are presented in Table 3. A fuzzy-set analysis identified two sufficient combinations for clients’ self-assessed progression. In combination 1, clients experience belongingness, competence, and good health. In combination 2, clients experience purpose, belongingness, and competence and have been with the company for at least 4 months. Both combinations have a consistency score of over 0.8. Combined, the two combinations have a coverage score of 0.748, indicating that they can explain a relatively large portion of all clients who experience progression.
Sufficient solutions for progression and paid hours.
A crisp-set analysis of paid hours revealed a single sufficient pathway (combination 3) where clients experience belongingness and competence, have been with the company for at least 4 months, and possess good health. However, the combination has a coverage score of only 0.209, indicating that it explains only a small portion of all clients who achieve paid hours. The results therefore suggest that there are additional factors, not included in the QCA model, that influence whether clients obtain paid hours. Nonetheless, the empirical evidence suggests that the combination of belongingness and competence, alongside relatively good health and a prolonged tenure at the company, is most likely to lead to paid hours.
From the sufficiency analysis, preliminary implications for the intervention emerge. Looking first at the two combinations for progression, it is noteworthy that all three mechanisms are present in at least one of the two combinations, providing empirical support for their importance. Yet, it is important to highlight that no single mechanism is sufficient on its own; outcomes always result from a combination of mechanisms and contextual conditions—a nuance that might be overlooked in traditional statistical analyses. Belongingness is common to both solutions, reinforcing the earlier finding of belongingness as a necessary and important condition for progression.
The contextual conditions differ between the two solutions. In combination 1, progression is associated with good health, irrespective of the time spent at the company. In combination 2, progression is seen in clients who are in longer CBIs (at least 4 months), regardless of their health status. This can indicate that while individuals in good health may integrate more swiftly into the company, achieving both professional and social integration, those without good health might require more time, making the duration at the company a significant factor. Thus, longer interventions could be more beneficial for those with substantial health issues. Moreover, the necessity of having a clear purpose in combination 2 suggests that longer interventions, which likely need ongoing adjustments, require a clear purpose to maintain motivation.
Looking next at the sufficient solution for paid hours (combination 3), it can be noted that competence and belongingness are again represented. Therefore, the analysis indicates that these two mechanisms may be more empirically relevant than purpose, as they appear to be more consistently involved in the various pathways to employment outcomes (Cartwright and Hardie, 2012). Simultaneously, only clients with relatively good self-assessed health and who have spent more time at the company achieve paid hours. This could indicate that employers may need to observe clients for a while before they are willing to offer paid hours, a point previously highlighted in the literature (Madsen et al., 2016). However, as previously mentioned, the low coverage indicates that a large portion of the unemployed with paid hours remains unexplained.
Interpretating QCA results
The combinations identified by QCA indicate potential causal relationships but do not alone establish generative causality. To reintroduce the generative perspective, one strategy is to go back to the cases (Befani et al., 2007; Rutten, 2022). Qualitative research can enhance the linkage of the identified combinations in QCA by examining in greater detail the processes that connect the conditions and outcomes, as well as identifying relevant contextual factors or mechanisms that were not included in the QCA analysis.
Based on the QCA analysis, it is possible to make strategic selections of respondents for follow-up interviews. If the goal is to investigate the identified combinations, typical cases should be selected, that is, respondents who are both members of the combination and exhibit the outcome (CM = 1, O = 1). If the aim is to improve the program theory, cases that either deviate from the combinations (CM = 1, O = 0) or are not explained by the combinations (CM = 0, O = 1) should be selected (Goertz, 2012).
In the evaluation, which included the QCA analysis, interviews were subsequently conducted with clients and employers to investigate the identified combinations. To analyze the data, a useful method was “causation coding” (Saldaña, 2013) which aimed to map causal chains (CODE1 → CODE2 → CODE3) in the participants’ narratives, corresponding to a context, a mechanism, and an outcome. This method helped identify how actors made implicit and explicit connections between conditions and outcomes. For instance, clients described how the mechanisms enhanced their motivation and development in the workplace. Similarly, many employers emphasized the importance of clients mastering their work tasks and socially integrating at the workplace when making hiring decisions. Insights about the reasoning of employers and clients thus helped explain why and how the mechanisms were related to the different outcomes. The qualitative interviews also pointed out key contextual factors that were not included in the original QCA analysis. For example, employers’ engagement and job openings in the company were highlighted as essential prerequisites for CBIs to lead to employment, indicating that the recruitment of vulnerable clients also depends on structural company-related factors. These factors could be included in future studies, potentially increasing the empirical relevance of the program theory.
As the above examples illustrate, large-N QCA cannot stand alone in conducting RE but should be combined with qualitative research to shed light on the generative processes connecting the identified combinations and support the causal interpretation. The integration of QCA with qualitative data, as mentioned earlier, is based on the assumption that both qualitative and quantitative observations can serve as indirect traces or indicators of the same underlying mechanisms (Williams, 2018). However, when combining QCA with qualitative research, it is important to ensure that the qualitative and quantitative analyses capture the same theoretical mechanisms. If the QCA analysis and the qualitative analysis rely on distinct definitions of the same concept, they are essentially measuring different phenomena, making it difficult to integrate or compare their results in a meaningful way (Ahram, 2013). Therefore, to ensure conceptual consistency, it is important for the evaluator to be aware of the defining attributes of the theoretical mechanism throughout the entire evaluation.
Conclusion
This article has proposed and demonstrated the use of large-N QCA as a possible method for testing mechanisms in RE. The hypothesized mechanism, contextual factors, and outcomes were measured and calibrated using survey questions, which made it possible to examine which CM combinations were necessary or sufficient for the outcomes across the participants. For realist researchers, this approach allows for the examination of how different factors interact in producing outcomes, thereby illuminating aspects of program complexity that are more difficult to investigate with regression-based methods. The analysis, for example, showed how the contextual conditions differed between the various sufficient combinations, providing insight into the interactions between mechanisms and context in producing outcomes.
In addition, by testing mechanisms across a larger group of participants than is possible in qualitative studies, it becomes possible to assess the consistency and empirical relevance of mechanisms, potentially increasing the credibility and generalizability of the program theory. The analysis, for instance, showed that competence and belongingness were consistently involved in the various pathways to employment outcomes, emphasizing their empirical relevance. Conversely, the included conditions could only explain the outcomes for a small portion of participants who achieved paid hours, indicating that the proposed program theory is incomplete. Future studies should explore additional factors and combinations that can explain why participants achieve paid hours at the company.
However, like all methods, the proposed approach has its limitations. From a realist perspective, a central limitation of QCA is that it does not imply generative causality but rather identifies the simultaneous presence of conditions and outcomes. To address this weakness, the article has discussed and demonstrated how a generative perspective can be reintroduced by combining QCA with qualitative research. In addition, conducting a survey is not always practical for various reasons. The program may be too small in scale for a survey to be meaningful, or the focus may be on other levels of the program rather than the participants. Therefore, this approach to investigating mechanisms is only feasible when there is a sufficient number of participants and the research focus is primarily on the participants.
It should also be emphasized that the purpose of this article has not been to criticize existing practices in RE. As already mentioned, there are already well-suited qualitative and quantitative methods for testing mechanisms. Instead, the aim has been to encourage reflection on the methods we can use as part of RE. Testing mechanisms through large-N QCA is one such method and could potentially improve practice and contribute to the further development of the RE approach.
Supplemental Material
sj-docx-1-evi-10.1177_13563890251317198 – Supplemental material for Testing mechanisms using large-N qualitative comparative analysis in realistic evaluations
Supplemental material, sj-docx-1-evi-10.1177_13563890251317198 for Testing mechanisms using large-N qualitative comparative analysis in realistic evaluations by Esben Højmark in Evaluation
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Innovation Fund Denmark and The Danish Agency for Labor Market and Recruitment.
Ethics approval and informed consent statements
The study was approved by The Danish Agency for Labor Market and Recruitment. According to Danish legislation, neither ethics approval nor patient consent is required.
Data availability statement
Due to the legal restrictions supporting data are not available.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
