Abstract

How far is the lack of a common lexicon or even grammar holding back advances in the field of evaluation? This question was brought into focus for me recently when reading an article in our US sister journal, the American Journal of Evaluation, which reviewed ‘capacity development’ in evaluation without considering the way the ‘institutionalisation’ of evaluation tends to be used in Europe and elsewhere as an overlapping and sometimes equivalent concept. This reflection allows for a segue towards the two opening contributions of this issue, which are both in their different ways about the institutionalisation of evaluation. But the point about language is much broader. Four of the articles in this issue explore ways of evaluating impact: by applying and usually combining approaches and methods – quantitative and qualitative; counterfactual and ‘theory-informed’ – that have often been thought of as antithetical or at least incommensurate. All these articles are interesting, yet the meaning of words and underlying concepts is not always consistent or explicated: what is meant by programme theory; how does programme or policy theory intersect with implementation; and what should we learn from methodological failure? These concerns took me back to a very useful article by my editorial colleague, Nicoletta Stame, who in 2010 wrote an ‘invited article’ distinguishing between ‘three failures’, that is, programme theory failure, implementation failure and methodology failure. This article (see Stame in Evaluation 16(4)) merits revisiting as an accompaniment to reading the contents of this issue if only because it signposts possible steps towards an improved lexicon to deconstruct and reconstruct key debates about what we call ‘impact’.
But first to ‘institutionalisation’ illuminated at an operational, real-world level in Peter Dahler-Larsen’s article and at a policy overview level by Danielle Lamarque in the text of her Speech.
Much has been written about ‘evaluation systems’ in this journal and elsewhere. (See, for example, Leeuw and Furubo in Evaluation 14(2)). According to Peter Dahler-Larsen, these systems are a part of how evaluation is institutionalised within what he has called ‘the evaluation society’. The case study that Dahler-Larsen describes concerns one such evaluation system – Denmark’s national school testing system. He is especially concerned with how evaluation systems and related policies change and discusses the processes of change around the case of a Ministerial ‘advisory group’ which he himself chaired. This advisory group was intended both ‘to advise the minister on the evaluation of the national tests and give recommendations following the evaluation findings’. The article begins with a succinct introduction to ‘evaluation systems’ and the ‘evaluation society’ – a useful point of entry for any reader not familiar with these key themes in the institutionalisation of evaluation.
Dahler-Larsen centres his arguments about how evaluation systems change around ‘deliberation’, a concept in which he observes we have seen a ‘booming interest’ in the policy studies and in social studies of science and innovation – specially relevant for considerations of ‘evaluation systems’ at a policy level. (Deliberation was probably first introduced into evaluation discourse by Ernie House – see Astbury, Evaluation, 22(1) 2016 – and has underpinned many discussions of evaluative practice more than discussions of evaluation policy over the last 25 years.) Dahler-Larsen discusses the advisory group that brought together different stakeholders with different interest in terms of a well-researched deliberative device, that of ‘mini-publics’.
A distinctive feature of this article is that it is authored by the Chair of the advisory group. He reflects from this standpoint within a well-theorised perspective on his deliberation strategies, advisory group processes and outcomes. This includes how the advisory group dealt with challenges such as ‘the relation between experts and lay participants and the interaction between evaluative-technical arguments and practical-political arguments’. If indeed ‘evaluation systems’ are as crucial to the state of evaluation – to the ‘evaluation society’ – as Dahler-Larsen and others suggest, we should all be interested in how these systems work and how they can change. The author certainly makes the case for deliberative mini-publics as one approach to system change in a democratic but also technical and professionalised field.
In a speech first delivered to a workshop of Portuguese policy makers and evaluators in Lisbon, Danielle Lamarque also discusses the topic of evaluation institutionalisation. Her focus is on what we know about the extent of institutionalisation across different countries drawing on the Evaluation Atlas – see Jacob, Speer and Furubo in Evaluation 21(1); on OECD reviews; and on her own extensive experience. These sources describe ‘discernible, if uneven, progress in many countries’ in the extent to which evaluation is institutionalised. Despite the widespread adoption of legal frameworks often reinforced by government guidelines on evaluation, Lamarque identifies several barriers to the progress of institutionalisation. These include the extent to which administrative cultures and the budgetary process adapt; whether a ‘strong evidence-informed policy system’ exists; and – most worryingly – whether sometimes the professionalisation of evaluators ‘may lead to evaluator-capture’. Lamarque points out that not all responsibility can be put on governments: institutionalisation also requires partnership with Parliaments and National Audit bodies. Indeed all tiers of government have their weaknesses and the author suggests that these weaknesses restrict achieving different ‘objectives’ of evaluation such as improved decision-making, strengthening democracy and being able to address complexity. Danielle Lamarque also argues that one of evaluation’s failing even when formally ‘institutionalised’ is the extent to which evaluation accommodates to ‘multi-level governance’ serving local as well as central government needs.
The next four articles in their different ways all cast a light both on contemporary evaluation practice at a time when causal mechanisms and theory have moved centre-stage, as well as on some of the different terminologies used to discuss ‘theories of change’, ‘implementation theory’ and ‘impact’.
The first two articles exemplify what many consider to be current good-practice, combining qualitative/theory-informed with quantitative/counterfactual methods. Combining different evaluation designs to assess causal inference is often advocated by evaluators wishing to demonstrate ‘impacts’, moving beyond the paradigm wars that have for so long beset evaluation. These combinations are especially favoured when policies and programmes are deemed ‘complex’. We often see counterfactual statistical models used to identify average treatment effects, leaving the heavy-lifting on causal inference at a more granular level, to theory-informed approaches such as Process Tracing, Realist Synthesis or Contribution Analysis. In the meantime, those interested in measuring causal effects in complex settings have over the last decade been integrating existing methods and applying new methods so as to strengthen quantitative evaluations of impact.
Deborah DiLiberto, Charles Opondo, Sarah Staedke, Clare Chandler and Elizabeth Allen deploy ‘causal mediation analysis’ (CMA) in order to ‘isolate specific mechanisms on the causal pathway between the intervention and the outcome’. It has been argued that this method is well-suited to ‘complex interventions that have several interacting components’. (For the less statistically minded, Imai et al. (2011) offer an accessible introduction to CMA.) The authors studied the PRIME public health interventions in rural Uganda that were intended ‘to improve quality of care for malaria and other febrile illness’. This was a mixed-methods approach: the programme’s intervention model and Theory of Change was used to select mediator variables and outcomes; cluster randomised controlled trials (RCTs) were conducted across 20 health centres alongside cross-sectional community surveys, exit interviews and a process evaluation. In the event, CMA as applied was able to demonstrate that PRIME had an effect on chosen ‘mediators’, for example, diagnostic capacities and health workers’ attitudes, but ‘these mediators did not lead to an improvement in community health outcomes’.
The authors reflected that perhaps the right mechanisms were not targeted; or alternatively, ‘larger heath system shortfalls’ accounted for poor outcomes. More fundamentally, the ‘logic of mediation analysis’ required a ‘simplification process’ inconsistent with the way the authors conceptualised the intervention which was ‘as multidimensional and synergistic activities implemented into dynamic and unpredictable contexts’. The methods adopted were not able to accommodate interactions between ‘mediators’ across different causal pathways or pay sufficient attention to context, so critical (as Realist Evaluators demonstrate) to understanding how causal mechanisms operate.
DiLiberto and colleagues grapple with what is a pervasive problem nowadays, especially when the policy community believes in the need to prioritise quantification. How far must we ‘decomplexify’ in order to explain using quantitative tools, and how far can such simplifications remain true to what the authors describe as an ‘interpretation of the intervention and context as emergent and synergistic’? The authors (following Byrne and Callaghan) offer one possible answer: distinguish between ‘restricted’ and ‘general’ complexity, suggesting that CMA applies best to instances of restricted complexity. This echoes John Kay and Mervyn King’s distinction between ‘resolvable’ and ‘radical’ uncertainty. (In radical uncertainty, standard decision tools that assume probability distributions and trends cease to work! See Kay and King, 2020.) This leaves us with large expanses of the policy and practice landscape where a wish for quantification comes face-to-face with epistemological as well as methodological barriers.
Gabriele Tomei describes what she and colleagues called the DOME approach to the evaluation of another complex set of interventions. These are government initiatives in Italy intended to combat educational poverty. This has led to hundreds of local projects that aim to address structural inequality limiting young people’s opportunities for ‘the development of basic skills and abilities and experimentation with psychological and relational experiences . . .’ The programme has been variously implemented by public–private partnerships, regions and local municipalities. The author argues that ‘the assessment of programmes against educational poverty in Italy constitutes a valuable laboratory for the experimentation of innovative approaches to the evaluation of complexity’. However, she also considers that the preconditions for ‘causal attribution’ along traditional lines do not exist: beneficiaries may coincide with entire populations; objectives are sometimes vague, varied or few in number – and settings are diverse. In order to accommodate these characteristics, Tomei sees the need to integrate counterfactual analysis to identify programme effects; with Theory-Based designs, to explain how and why the programme works. However given ‘the novelty and extreme heterogeneity’ of these programmes, ‘the Theory of Change (ToC) does not represent a solid and predefined map’. There is also a need to be ‘sensitive to the reflexive involvement of educational professionals, families and partner organisations’. This adds another element: ‘reflexive competence’ in order to understand ‘emerging strategies’ and ‘ultimately to use the results of the evaluation to stimulate renewed strategic commitment’. This strand of the DOME model shares elements with developmental evaluation (hence the D in DOME), action-research and dialogical and empowerment evaluation traditions. However, according to the author, their approach is ‘more similar to the model of evaluation called “trailing research”’ (see Finne et al., Evaluation 1(1) 1995, the first issue of this journal).
In many ways, Tomei confronts similar challenges to Deborah DiLiberto and colleagues. However, the Italian team chose a participative and reflexive rather than a methods-route to interpret ‘an intervention and context’ that is ‘emergent and synergistic’. Perhaps this was an option also for the PRIME evaluation, although the public health interventions in Uganda were far less heterogeneous than those of the DOME evaluation. In their different ways, both of these articles exemplify evaluation’s ongoing methodological journey and ‘learning by doing’. This inevitably involves occasional ‘methodological failure’ in the face of different programme characteristics, across diverse contexts, where programme participants have varying degrees of agency.
The next two articles address programme impacts and effects in theory-informed ways explicitly foregrounding programme implementation. For a while, a refocussing on ‘impact evaluations’ seemed to be associated with reduced attention to the evaluation of implementation. One strand in the impact movement from the early 2000s questioned the usefulness of ‘process evaluations’, a main source of evidence about implementation. We now seem to be seeing a resurgence in the evaluation of implementation. However, a frequent feature of contemporary implementation evaluations, as in these two articles, is the way processes and subsequent results are linked through the identification of causal mechanisms.
Gráinne Hickey, Sinead McGilloway, Yvonne Leckey and Tracey Bywater explore an early parenting intervention – the PIN (parents and infants) programme offered to parents of children from birth to 2 years in the Republic of Ireland. Although part of a much larger study, this was a small-scale, qualitative exercise relying on in-depth interviews of those coordinating and delivering PIN. Parents will be involved at a later stage. The authors ‘aimed to articulate the theory of the PIN programme by documenting its underpinning causal mechanisms’ drawing on ‘the early stage of a detailed process evaluation’ which ‘will involve an in-depth exploration of programme implementation’. Following Weiss, Hickey and colleagues explicitly distinguish between two aspects of programme theory: ‘theory of change’ that links ‘the mechanisms of an intervention and anticipated outcomes’ and ‘implementation theory’, that is, ‘what is needed to translate objectives into service delivery and programme operation’. The authors also draw on Damschroder’s Consolidated Framework for Implementation Research (CFIR) as an input into ‘theory-building’.
This article reports from an early stage in the planned process evaluation and therefore concentrated on clarifying implementation theory and underlying causal mechanisms. More evidence-informed insights into how implementation affects results will have to wait until later in the process evaluation. However, the authors are optimistic and able to conclude ‘that their programme theory illustrates how decisions regarding administrative support and implementation planning at a macro level can help to build commitment and buy-in for implementation among programme providers involved in the micro-context of programme implementation’.
We have published quite extensively since 2020 on the importance of incorporating climate change into evaluation thinking and practice. The concluding article in this issue tries to address the question in a practical fashion by asking, ‘How should evaluations take account of climate change?’ with regard to food safety or more specifically pork safety in Vietnam’s SafePork programme. The article is co-written by a team that includes Steven Lam, Warren Dodd, Hung Nguyen-Viet, Fred Unger, Trang Le, Sinh Dang-Xuan, Kelly Skinner, Andrew Papadopoulos and Sherilee Harper. There are a number of overlaps with the previous article. For example, the authors again use Damschroder’s CFIR implementation framework, including a 2022 update, to structure context. This is seen as important given the centrality of context for implementation. This was supplemented by what the authors describe as a ‘realist lens’ in order to better understand the mechanisms and outcomes at work in these contexts. Although this is another small-scale qualitative study set within a larger implementation evaluation of the SafePork programme, it opens up significant concerns that go beyond but are clearly reinforced by climate change. For example, the authors note how multiple external shocks reinforce each other – drought combined with heat stress, Covid-19, market shocks and African swine fever all ‘compounding risks’. Resilience and adaptation to ‘shocks’ is an almost universal feature of programming nowadays that is only just beginning to be addressed. Building ‘adaptive capacity’ into programming is also a widespread concern well beyond food-safety programmes. The preoccupation with shocks and adapting to shocks may explain why evaluating implementation may well be coming back into fashion. Evaluating the causal mechanisms that shape implementation could become central ‘for understanding how programs work, why they sometimes fail, and how they can be adapted to different settings’.
The way implementation frameworks such as those by Damschroder’s align with different evaluation approaches and contexts is a topic that deserves to be unpicked further. Sometimes implementation science thinking has been associated with standardised protocols and experimental methods – but not always. Service implementation is an understandable preoccupation when delivering often universal or routine services, as is often the case in healthcare, social care and education. These domains have been the forcing-ground for ‘implementation science’ thinking. How transferable are such frameworks to the evaluation of one-off, nonstandard policies and programmes – for example, area-based or R&D interventions? A strength of implementation science thinking is its emphasis on discretion and the capabilities of those on the front-line who implement programmes. As political scientists have understood for decades now, implementation processes and the exercise of discretion are at the heart of understanding policy and programme outcomes.
