Assured autonomy: How operations research powers and orchestrates generative AI systems

Abstract

Generative artificial intelligence (GenAI) is shifting from conversational assistants toward agentic systems—autonomous decision-making systems that sense, decide, and act within operational workflows. This shift creates an autonomy paradox: as GenAI systems are granted greater operational autonomy, they should, by design, embody more formal structure, more explicit constraints, and stronger tail-risk discipline. We argue that stochastic generative models can be fragile in operational domains unless paired with mechanisms that provide verifiable feasibility, robustness to distribution shift, and stress testing under high-consequence scenarios. To address this challenge, we develop a conceptual framework for assured autonomy grounded in operations research (OR), built on two complementary approaches. First, flow-based generative models frame generation as deterministic transport characterized by an ordinary differential equation, enabling auditability, constraint-aware generation, and connections to optimal transport, robust optimization, and sequential decision control. Second, operational safety is formulated through an adversarial robustness lens: decision rules are evaluated against worst-case perturbations within uncertainty or ambiguity sets, making unmodeled risks part of the design. This framework clarifies how increasing autonomy shifts OR’s role from solver to guardrail to system architect, with responsibility for control logic, incentive protocols, monitoring regimes, and safety boundaries. These elements define a research agenda for assured autonomy in safety-critical, reliability-sensitive operational domains.

Keywords

Generative AI Autonomous Agents Operations Research Flow-based Generative Models

1. Introduction

Artificial intelligence (AI) is moving from advice to action. The question is no longer whether generative AI (GenAI) can draft text or write code, but whether an agent can operate—place orders, route vehicles, allocate clinical resources, balance power grids, coordinate logistics—under real constraints and uncertainty. This shift from “chatbot” to “operator” exposes a paradox that should guide the next decade of Operations Research (OR): greater autonomy demands more structure. We call this the autonomy paradox.

To address this paradox, we view assured autonomy as an organizational design problem as much as a modeling one. An autonomous system is credible only when technical choices are paired with clear decision rights and records that make actions inspectable after deployment. In that sense, model quality is necessary, but it is not by itself evidence of safe operation.

The design logic is essential because autonomy delivers speed and scale, yet it also amplifies the cost of small errors, hidden constraint violations, and rare failures. In high-stakes settings, expected performance is a weak guide: low-probability regimes can dominate social cost, and “almost always safe” is not safe enough.

1.1. Operational risk and the case for OR

Even highly engineered autonomous systems exhibit edge cases that trigger recalls, investigations, and public concern. In December 2025, a U.S. regulator disclosed a Waymo recall tied to a software issue that could cause vehicles to pass stopped school buses—a rare, high-consequence scenario (Khanna, 2025). The lesson generalizes. When AI is embedded in operations, safety depends on engineered discipline: systems auditable and monitorable, respecting hard constraints, and stress-tested against tail events (Weick and Sutcliffe, 2015).

Meanwhile, autonomy advances quickly under controlled conditions. A late-2025 lab study shows multi-agent systems built from frontier generative models can manage the Beer Game and reduce total costs relative to human teams (Long et al., 2025b). The GenAI Beer Game provides an interactive testbed pairing a natural-language interface with an OR decision engine (Long et al., 2025a). Field failures and lab gains point to a single bottleneck: autonomy scales under decision regimes that stay feasible and stable when conditions drift and constraints bind unexpectedly.

These requirements define a design problem: build autonomous operators whose behavior stays feasible, monitorable, and stable under distribution shift and interaction. OR has spent decades developing the corresponding structure—explicit constraints, flow conservation, queueing stability, sequential decision control, and robust planning under uncertainty.

Forged during World War II to orchestrate complex military operations, OR was born in high-stakes settings and carried into infrastructure, logistics, and service systems (Flagle, 2002; Gass and Assad, 2011). Rising autonomy renews that role. In low-autonomy settings, OR acts as a solver. In medium-autonomy settings, it supplies guardrails—constraints, audits, and risk measures. In high-autonomy settings, OR becomes the architect of the operating regime: control logic, incentive protocols, monitoring rules, and safety boundaries within which fleets of agents act.

1.2. Assured autonomy as an OR problem

A defining feature of assured operational autonomy (“assured autonomy”) is that decisions are sequential, stateful, and coupled over time. Actions taken now reshape future feasibility, risk exposure, and information through delayed and nonlinear dynamics. This distinguishes operational autonomy from one-shot generative tasks and places OR’s tradition of sequential decision-making, control, and stability analysis at the center.¹ The value of a generative model is not its static realism, but whether the closed-loop system it induces remains feasible, stable, and safe over long horizons.

To formalize scope, we define operational autonomy along six measurable dimensions: action scope ( $S$ ), commitment authority ( $C$ ), action latency ( $L$ ), override structure ( $O$ ), temporal state coupling ( $U$ ), and guardrail tightness ( $G$ ). The pair $(C, O)$ makes delegation boundaries explicit by specifying how decision rights are partitioned between autonomous agents and human supervisors. We define assurance as a closed-loop deployment property requiring feasibility, robustness to shift, detectability, recoverability, and auditability over time. In our architecture, autonomy expands only when each assurance dimension is tied to a measurable signal, a pre-specified control action, and a clearly assigned escalation owner.

The autonomy paradox implies two design commitments. First, the generative mechanism must be constrainable and auditable—closer to an engineered dynamical system than a black-box sampler. This motivates deterministic, flow-based generators—continuous normalizing flows, flow matching, and probability-flow ordinary differential equation (ODE) formulations—that confine randomness to a source distribution while keeping the transformation dynamics deterministic and governed by an ODE. The flow is the engine of generation; transformers provide representations, interfaces, and orchestration.

Second, operational safety should be built through adversarial design rather than post hoc filtering. Catastrophic failures in aviation, supply chains, power grids, or hospitals arise in tail regimes and under interaction; a natural abstraction is game-theoretic. A controller minimizes cost while an adversary—nature, attackers, or distribution shift—maximizes loss, linking assured performance to distributionally robust optimization and to one of OR’s historical roots: postwar game theory (Gass and Assad, 2011; Shubik, 2002).

These commitments locate OR’s leverage. Operational settings expose bottlenecks current generators often sidestep: feasibility by construction, explicit tail regimes that dominate social cost, and generation coupled to optimization so “plausible” means decision-impactful rather than surface-similar.

Assured autonomy is autonomy with guarantees: explicit invariants, tail-risk and shift stress tests, and auditable rules for monitoring, escalation, and deferral. It separates systems that look competent in routine settings from those that remain safe when conditions become novel, extreme, or adversarial. Assured autonomy aligns with core OR principles—explicit constraints, worst-case reasoning, and accountable decision rules. Autonomy must be engineered with structure, constraints, and governance, and that redesign is an OR project.

This article is conceptual and design-oriented. We specify what assured autonomy requires, diagnose why current GenAI fails under hard constraints and tail risk, and show how OR fills the gap. We synthesize recent work into an OR-powered integration stack—deterministic, transport-style generation; minimax stress testing; optimization and control; and continuous monitoring with fallback—as both blueprint and research agenda.

To make the logic explicit, we organize the paper as a layered architecture for assured autonomy. Section 2 diagnoses operational failure modes in off-the-shelf GenAI; Section 3 develops constrained generation as the representation layer; Section 4 develops minimax/DRO stress testing as the robustness layer; Section 5 formalizes the orchestration layer through monitoring, escalation, fallback, and delegation rights; and Sections 6–7 illustrate and extend the architecture in domain settings. Each layer addresses a distinct failure mode and produces artifacts needed by the next layer.

2. Why current GenAI is insufficient for assured autonomy

GenAI has made rapid gains in producing fluent text and realistic media, and looks compelling in controlled demonstrations. Operational autonomy, however, is judged by a different metric than plausibility. Most GenAI models generate high-probability continuations of observed patterns, whereas operations require actions that satisfy hard constraints, remain stable under feedback and delay, and perform acceptably under distribution shift—especially in rare, correlated regimes that dominate social cost. The failures are structural: imitation-based training does not enforce feasibility, stability, or tail robustness. This mismatch sharpens along two dimensions—non-determinism in the decision engine and safety-criticality of the domain. When both are high, residual stochastic error becomes a systematic source of tail risk. Cummings argues “GenAI is simply too dangerous to include in safety-critical systems” (Cummings, 2025). The point extends beyond weapons: when rare failure modes cannot be modeled, bounded, and monitored, delegating control to a stochastic generator increases exposure to catastrophic outcomes.

We first explain how stochastic generation complicates certification in safety-critical settings, using large language models (LLMs) and diffusion models as exemplars. We then distinguish semantic from structural constraints that define operational feasibility. Finally, we discuss tail risk, distribution shift, and accountability—the regimes where average-case performance is least informative and post hoc diagnosis is essential.

2.1. Stochastic generators and safety-critical control

LLMs make the gap concrete. An LLM produces the most plausible continuation of a prompt given its training distribution and context. For drafting, summarizing, and brainstorming, this works well. For allocating scarce resources, scheduling tightly coupled activities, or enforcing safety rules that admit no exceptions, it becomes a liability. Correctness here means satisfying feasibility and safety conditions, not producing a reasonable narrative. Experts have argued GenAI should be prohibited “to control, direct, guide or govern any weapon” until hallucinations can be modeled and predicted, because the technology remains hard to certify and bound where failures cost lives (Cummings, 2025). Where the acceptable failure rate is effectively zero, plausibility does not substitute for assurance.

Diffusion models (Chen et al., 2024; Song et al., 2021a, 2021b) illustrate a related structural misalignment. Standard diffusion destroys structure by adding noise and reconstructs it via a stochastic reverse-time process (Anderson, 1982)—fine for images and text, where intermediate states need not be meaningful. Operational systems differ: feasibility, conservation laws, stability conditions, and safety envelopes must hold along the entire trajectory, not just at the endpoint. A generator that wanders stochastically and is “corrected” after the fact is hard to certify and underrepresents tails unless training is redesigned for the rare regimes that dominate social cost.

2.2. Semantic versus structural constraints

It is tempting to impose constraints through prompting, penalty terms, or post hoc filtering. But this is brittle: prompts are not enforceable constraints, and soft penalties are not hard feasibility. Even with an external checker, the generator becomes a proposal mechanism whose outputs must be governed by a separate decision regime. Recent work on constrained learning for diffusion models is therefore best read as both progress and diagnosis: Lagrangian-style training can yield satisfaction guarantees for certain constraint classes (Khalafi et al., 2025), yet much of that literature emphasizes semantic or preference-based constraints rather than the conservation, capacity, integrality, and stability constraints that define OR problems.

Distinguishing semantic from structural constraints helps. Semantic constraints regulate what is generated (attributes, labels, preferences, fairness criteria). Structural constraints govern feasibility as a dynamical object—flow conservation, capacity limits, integrality, stability, feasibility over time—and should hold globally throughout execution. Violating a structural constraint produces not a lower-quality output but an infeasible or hard-to-certify action. Assured autonomy thus cannot rest on semantic alignment alone when generative models enter decision loops.

2.3. Tail risk, distribution shift, and accountability

The deeper problem is tail risk and distribution shift. Operational safety is rarely about being right on average; it is about not failing when the system is stressed: when demand surges, disruptions correlate, sensors degrade, or interaction produces congestion and cascades. Standard GenAI objectives reward typical-case fidelity and smooth away rare structures that robust planning should confront. OR frameworks, by contrast, have long emphasized tail events (Blanchet and Glynn, 2008), encoding them through chance constraints, ambiguity sets, or worst-case objectives.

Operations also require accountability and diagnosis. High-reliability practice depends on learning from near misses and tracing failures back to their mechanisms (Weick and Sutcliffe, 2015). Black-box generators make this difficult: when performance degrades, the cause is often unclear—data drift, prompt drift, hidden constraint violations, or a failure mode absent from training. Without auditable structure, governance becomes reactive and fragile.

Off-the-shelf GenAI is not built to enforce trajectory-level feasibility, control tail regimes that dominate social cost, or support accountability in high-reliability operations. These gaps explain why current GenAI remains unreliable as an autonomous operator in safety-critical, constraint-driven settings. The next sections examine technical directions targeting these failure modes, and Figure 1 summarizes the OR–AI integration architecture for assured autonomy.

Figure 1.

The OR–AI integration architecture for assured autonomy.

To complement Figures 1, Figure 2 maps operational weaknesses (constraint-violation risk, tail-risk blindness, and distribution shift) to autonomy regimes (low, medium, high) and assurance mechanisms. It also organizes key deployment artifacts across layers—constraint-consistent scenarios, feasible decisions, robust-risk certificates, and runtime control signals—and makes their feedback loop explicit so local arguments can be read against the full system design.

Figure 2.

Operational weaknesses, autonomy regimes, and assurance layers. The dashed loop indicates deployment-evidence feedback. Rows indicate primary emphasis, not exclusive pairings; all assurance layers may operate across autonomy regimes, with their relative importance increasing as delegated authority expands.

We use $P_{0}$ to denote the nominal distribution and $P_{adv}$ to denote the least-favorable distribution for the stress test. These symbols are used in the minimax and application sections, where $P_{adv} \in P (P_{0})$ denotes governed distribution shift. We use GenAI for foundation generative models, an agentic system for tool-using decision stacks under guardrails, and an autonomous operator for agents with delegated commitment authority.

3. Flow-based generative models: Structured evolution in probability space

We introduce flow-based generative models as a design pattern for auditable, constrainable generation—not as a claim of empirical superiority. Flow-based models construct complex distributions by transporting probability mass through a sequence of maps, from discrete normalizing flows to continuous-time formulations based on neural ODEs and transport-based variants (Albergo et al., 2023; Chen et al., 2018; Dinh et al., 2017; Geng et al., 2025; Lipman et al., 2023; Rezende and Mohamed, 2015; Song et al., 2023; Xu et al., 2023). The transport formulation is implementation-agnostic: simple or kernel-based maps suffice in low-dimensional settings, while neural parameterizations matter when high-dimensional heterogeneity makes expressivity the bottleneck (Peyré and Cuturi, 2019). Through an OR lens, flow-based generation is an iterative algorithm in probability space (Xie and Cheng, 2026), making it natural to impose constraints and risk functionals on distributional evolution. The rest of this section formalizes continuous-time distributional dynamics, shows how deterministic transport yields structure-by-construction and auditability, and explains how the interface supports constraint- and tail-aware scenario generation.

In Figure 2, this section corresponds to the representation layer: turning raw model capability into constraint-aware, auditable scenario generation that can be governed downstream.

3.1. Continuous-time dynamics and distributional evolution

In the continuous-time setting, let $X_{t} \in R^{d}$ denote the system state at time $t \in [0, T]$ , with initial condition $X_{0} \sim ρ_{0},$ where $ρ_{0}$ is a source distribution (a simple reference such as Gaussian in generative use, or the empirical data distribution in analytical use; see below). A flow-based model specifies a time-dependent velocity field $v_{ϕ} : R^{d} \times [0, T] \to R^{d},$ parameterized by $ϕ$ , and the system evolves according to the ordinary differential equation

\begin{aligned} \frac{d X_{t}}{d t} = v_{ϕ} (X_{t}, t), t \in [0, T] . \end{aligned}

(1)

As particles evolve under these deterministic dynamics, the induced probability distribution evolves continuously. Let

ρ_{t} (x)

denote the probability density of

X_{t}

. Its evolution is governed by the continuity equation

\begin{aligned} \partial_{t} ρ_{t} (x) + \nabla \cdot (ρ_{t} (x) v_{ϕ} (x, t)) = 0, \end{aligned}

(2)

which expresses conservation of probability mass along the flow.

Under mild regularity conditions, the dynamics in equation (1) are deterministic and invertible, yielding explicit transport between distributions. The velocity field $v_{ϕ}$ is typically a neural network—often residual or time-incremental—learned via variational objectives in probability space or flow-matching formulations. Gaussian references are common, but the framework is not restricted: the terminal distribution may represent structured or adversarial targets, such as least-favorable distributions in distributionally robust optimization (Xu et al., 2024).

The same transport admits two interpretations. In generative use, one draws samples from a reference distribution $Q$ and pushes them forward to obtain samples from a target distribution $P$ . In analytical or decision-oriented settings, the flow pushes data or scenarios from $P$ toward a worst-case $Q$ . Both viewpoints rely on the same transport structure and differ only in how the learned dynamics are traversed. This connects to Wasserstein gradient-flow structure and JKO-type constructions (Ambrosio et al., 2005; Cheng et al., 2024; Jordan et al., 1998), providing a variational view of training and stability aligned with OR’s optimization foundations.

From an OR perspective, equations (1) and (2) make a central duality explicit. Constraints can be imposed on probability measures—e.g., requiring $ρ_{t}$ to concentrate mass on a feasible set or control tail risk—or on individual sample paths, by designing $v_{ϕ} (\cdot, t)$ so trajectories preserve invariants or remain within a feasible region. Once the velocity field is fixed, both particle paths and induced distributional evolution are determined, yielding feasibility by construction.

Diffusion-based generators can also incorporate constraints, typically through stochastic sampling with guidance, projection, or correction steps. These mechanisms work well in many applications, but make hard sample-wise certification and traceability harder. Figure 3 illustrates the contrast: deterministic-by-design transport makes auditability, replayability, and enforcement interfaces intrinsic, and these features matter disproportionately in safety-critical operational loops.

Figure 3.

Structural contrast between flow-based and diffusion-based generative models under constraints. Red curves denote a constrained feasible region. Left: flow-based models transport samples deterministically from a reference distribution $Q$ to a target distribution $P$ , allowing constraints to be incorporated directly into the transformation. Right: diffusion-based models generate samples through iterative stochastic refinement, with randomness injected at each step and constraints typically enforced through guidance or correction mechanisms.

3.2. Deterministic transport and structure-by-construction

Modern flow-based generative modeling can be understood as learning transport maps between probability distributions. Rather than treating uncertainty as static input, this perspective models system evolution as a structured transformation of distributions governed by invariants, feasibility conditions, and stability requirements. In OR, many constraints are identities, not preferences—flow conservation, capacity limits, nonnegativity, balance conditions, integrality, or temporal ordering—that hold globally. A transport-based view accommodates these: distributions evolve under dynamics designed to preserve structure by construction, rather than enforced post hoc through rejection, penalties, or repair. Generation becomes a controlled process governed by structure, not a black-box sampling procedure aimed at reproducing observed data.

Diffusion and score-based models can be interpreted within this framework but typically realize it through stochastic dynamics. Diffusion models generate samples by simulating noisy reverse-time processes (Ho et al., 2020; Song et al., 2021b), and LLMs often introduce randomness during sampling-based decoding; under fixed decoding regimes they can be reproducible, but reproducibility alone does not enforce operational feasibility (Brown et al., 2020). Stochasticity enables diversity in creative domains. In operations, it becomes a liability: inconsistent outputs, intermittent feasibility violations, and failure modes that resist diagnosis. While diffusion models admit a deterministic sampling formulation via the probability-flow ODE under idealized conditions (Song et al., 2021b), their predominant formulations emphasize stochastic sampling, with determinism playing a secondary role.

Flow-based generative models align with the transport-map view. Randomness is confined to the initial draw; generation then proceeds through deterministic dynamics governed by an ODE. This enables exact traceability and replayability: an output can be traced to a specific initial state, its atypicality quantified, and the trajectory replayed for auditing. Operationally, the deterministic transport map becomes the central object of assured autonomy, exposing the control surfaces required for feasibility enforcement, monitoring, and governance. Recent developments—flow matching, consistency models, and mean-flow formulations—decouple transport from density estimation, easing computational concerns while preserving determinism and controllability for safety-critical decision-making (Geng et al., 2025; Lipman et al., 2023; Song et al., 2023).

Deterministic transport improves replayability, debugging, and audit tracing because identical initial states and fixed solvers produce identical trajectories. By itself, however, determinism does not guarantee safety or certifiability; it can also scale misspecification faster if invariants are wrong. Assurance still depends on invariant specification, robustness to distribution shift, numerically stable integration, and runtime monitoring with fallback authority. Diffusion models can also admit deterministic probability-flow formulations; our claim is therefore architectural rather than model-family exclusive. Determinism contributes to assurance only when coupled with explicit constraints, stress testing, and governance controls.

Operational deployments rarely require training from scratch. A practical alternative is to start from a pretrained generative or representation model and adapt it via fine-tuning or lightweight updates—conditioning layers, adapters, or low-rank parameterizations (Hu et al., 2022)—using limited in-domain data and operational objectives. In the transport-map view, this warm-starts the velocity field and refines it to encode domain constraints, tail-sensitive regimes, and decision-coupled loss signals, reducing data and compute requirements while improving reliability. This is where OR enters: constraint penalties, risk functionals, and decision-layer feedback can be imposed during fine-tuning rather than deferred to inference-time correction.

4. Game-theoretic safety and robust autonomy

Constrainability is not safety. A generator that faithfully reproduces the training distribution still fails when demand spikes, sensors degrade, or an adversary probes for weaknesses. Assured autonomy requires stress testing—systematic confrontation with futures that have not yet occurred. OR formalizes this as a game: the decision-maker chooses a policy; an adversary chooses the scenario that breaks it. Operational autonomy is ultimately a confrontation with surprise. Models improve and sensors proliferate, yet the world produces combinations outside yesterday’s training distribution. If an autonomous system is judged by what happens when things go wrong, safety should be designed for those regimes, not appended afterward. OR has a name for this stance: robustness, formalized as a game between a decision-maker and an adversary representing nature, strategic opponents, or distribution shift. Figure 4 previews this logic: the designer chooses a policy, an adversary selects a worst-case shift within a credible ambiguity set, and safety is the resulting equilibrium, not a post-hoc patch.

Figure 4.

Minimax game-theoretic framework for AI safety.

In Figure 2, this is the robustness layer: stress-testing nominal plans under least-favorable shifts before autonomy is delegated and decision rights are expanded.

A coupling between generative models and decisions is the stochastic optimization problem

\begin{aligned} min_{θ} E_{x \sim P_{0}} [C (θ, x)], \end{aligned}

(3)

where

θ

denotes a decision or controller,

x

represents realized disturbances and inputs,

C (θ, x)

is the induced operational loss, and

P_{0}

is the nominal designer distribution learned or estimated from data. This captures the classical “average-case” objective but leaves systems exposed to rare, correlated, or unmodeled regimes that dominate operational risk.

A canonical robust formulation is distributionally robust optimization (DRO, see, e.g.,Blanchet and Murthy, 2019; Gao and Kleywegt, 2023; Kuhn et al., 2019; Selvi et al., 2025; Shapiro, 2017; Wang et al., 2025):

\begin{aligned} min_{θ} max_{P_{adv} \in P (P_{0})} E_{x \sim P_{adv}} [C (θ, x)], \end{aligned}

(4)

where

P (P_{0})

is an ambiguity set of plausible shifts around the nominal designer distribution

P_{0}

. The designer chooses

θ

; the adversary chooses

P_{adv}

; the resulting equilibrium trades off performance against vulnerability. Equation (4) makes “what could go wrong?” operational: specify how the environment may shift, then optimize against the least favorable case.

In the GenAI setting, related minimax and worst-case formulations have emerged in recent work on high-dimensional problems (Cheng et al., 2025; Xu et al., 2024). These developments reflect growing recognition that flow-based generative models can represent and learn complex high-dimensional worst-case distributions, enabling direct sample generation from such adversarial models.

4.1. Distribution-level constraints and sample-level enforcement

Operational constraint control distinguishes two classes. Semantic constraints regulate output attributes or preferences; structural constraints govern feasibility as a dynamical object—flow conservation, capacity limits, integrality, or stability—and should hold globally. In OR, structural constraints are non-negotiable identities, not preferences.

For generative models used in decision-making, this distinction motivates imposing constraints on the generated scenarios, and hence on the induced scenario distribution. Let $x \in X$ denote an operational scenario (e.g., demand trajectories, lead times, outages), drawn from $P$ , and let $h_{k} (x) \leq 0$ encode structural feasibility requirements. When generation is coupled to downstream decisions, a natural objective is

J (P) = min_{θ \in Θ} E_{x \sim P} [C (θ, x)],

which links constrained scenario generation directly to the minimax formulation in equation (4).

A canonical formulation selects (or learns) a distribution $P$ within an admissible model class $Q$ such that feasibility holds with high probability:

\begin{aligned} max_{P \in P} J (P) s.t. P_{x \sim P} (h_{k} (x) \leq 0) \geq 1 - ε_{k}, k = 1, \dots, K . \end{aligned}

(5)

More generally, feasibility can be enforced through risk-sensitive functionals,

\begin{aligned} max_{P \in P} J (P) s.t. E_{x \sim P} [ψ_{k} (h_{k} (x))] \leq δ_{k}, k = 1, \dots, K, \end{aligned}

(6)

where

ψ_{k} (\cdot)

encodes tail-weighted or risk-sensitive penalties such as hinge losses or Conditional Value at Risk (CVaR)-type functions. Lagrange multipliers yield a relaxation over distribution-level constraint violations,

\begin{aligned} min_{P \in P} max_{λ_{k} \geq 0} - J (P) + \sum_{k = 1}^{K} λ_{k} (E_{x \sim P} [ψ_{k} (h_{k} (x))] - δ_{k}), \end{aligned}

(7)

which highlights an OR-specific distinction:

h_{k}

represent global structural feasibility, not semantic attributes, and feasibility is enforced over generated operational scenarios rather than model parameters. In the above equations, ε _k is the allowable violation probability in (5), δ_k the tolerance for therisk-sensitive constraint in ( 6 ), and λ_k ≥0 are Lagrange multipliers in (7).

Flow-based generators have a practical advantage: distribution-level requirements can be realized at the sample level. Because flow models implement generation as a deterministic transport map, pointwise constraints—upper bounds, conservation laws, state-dependent safety conditions—can be enforced or penalized along generated trajectories. Constraints on the induced distribution can often be implemented by shaping the transport dynamics. This sample–distribution duality is natural in OR, where feasibility is defined pointwise rather than over model parameters, and it aligns with stochastic programming.

Comparable constraint control is possible in diffusion-based generators, typically through stochastic dynamics with guidance, projection, or correction during sampling. Such approaches can make hard sample-wise guarantees and traceability more delicate. The explicit transport structure of flow-based models makes constraint handling transparent and directly compatible with OR formulations of feasibility and risk.

Integrality-heavy applications require an explicit continuous-to-discrete bridge. For routing, assignment, batching, and capacity-commitment decisions, we use a hybrid design in which continuous generation proposes structured scenarios while a mixed-integer/combinatorial layer enforces discrete feasibility at execution. Actions requiring projection or repair are logged as assurance events; repeated repairs are treated as monitoring warnings that can tighten admissible action sets or trigger human review. In practice, integrality constraints define hard governance boundaries, not soft preferences.

4.2. Minimax stress testing and least-favorable distributions

Minimax stress testing treats reliability as performance against an adversary. The idea traces to von Neumann’s zero-sum games (von Neumann and Morgenstern, 1944) and enters OR through robust optimization and robust control, converting worst-case logic into tractable prescriptions (Ben-Tal and Nemirovski, 2002). Distributionally robust optimization (DRO) places ambiguity on the data-generating process, often through moment sets, $ϕ$ -divergences, or Wasserstein neighborhoods (Rahimian and Mehrotra, 2019). This provides a calibration problem: which stress classes are credible, and how much probability mass should represent misspecification?

Two implications are central. First, robustness targets more than overt attacks. Operational disasters concentrate in rare, correlated, cascading regimes that dominate social cost. High-reliability organizations institutionalize a sustained “preoccupation with failure” (Weick and Sutcliffe, 2015). Grid operators formalize it with contingency standards; aviation relies on certification culture; hospitals operationalize escalation protocols. Across domains, expected performance is a weak standard. The operational standard is avoiding catastrophic modes within defined stress classes.

Second, minimax reasoning becomes engineering only when the adversary is computable. Worst-case design requires a searchable representation. Deterministic generators supply one. When uncertainty is modeled as a transport map pushing a reference measure into operational scenarios, the adversary in equation (4) can be parameterized by that map. The problem becomes worst-case generation: the adversary explores a continuous family of distributions in $P (P_{0})$ rather than a menu of hand-built stress tests. Recent work characterizing least-favorable distributions in Wasserstein space as pushforwards of transport maps provides this bridge (Cheng et al., 2025). In data-scarce regimes, the generator need only represent the ambiguity set with structure and control; fidelity to the true process is secondary. This yields decision-coupled stress testing: the generator searches for failures damaging to the deployed policy $θ$ , not generic extremes.

For implementation, we recommend a six-step robust-design protocol: define invariants and mission loss; identify plausible shift classes; choose ambiguity geometry (e.g., Wasserstein, divergence-based, moment-based, or event-wise); calibrate uncertainty size on historical stress windows; tune the conservatism–performance frontier; and bind runtime triggers to robust-bound violations. Reporting nominal and robust outcomes jointly keeps the conservatism dial explicit for managerial and regulatory review and forces transparent trade-offs between false reassurance and over-conservatism.

4.3. Interaction, governance, and operational escalation

Robust autonomy has an interaction dimension. Many operational settings are systems of agents—vehicles negotiating right-of-way, inventory nodes coordinating replenishment, software agents bidding in markets. Safety depends on incentives, equilibrium selection, and protocol design alongside worst-case uncertainty. OR’s game-theoretic toolkit is central: design communication and commitment rules that prevent deadlock, align local objectives with system goals, and rule out pathological equilibria. The aim is to reduce the strategic degrees of freedom through which interaction produces systemic harm.

Robustness without governance is incomplete. Deployment is not solving equation (4) once; drift, nonstationarity, and updates make control ongoing. Assured autonomy needs an explicit monitoring-and-deference policy: what is measured, what counts as “out of control,” and what action follows. OR’s sequential decision and monitoring traditions apply: escalation is itself a control policy, and fallback is a designed operating regime, not an improvised human response.

5. From solver to system architect: The evolving role of OR in assured autonomy

OR began by solving well-posed decision problems. Humans supplied judgment and executed the plan. With partial automation, OR shifted to assurance: checking feasibility, bounding actions, auditing recommendations from heuristics or learning systems. As operational autonomy rises, that separation no longer holds. OR must design the decision regime—the rules, constraints, monitoring triggers, escalation policies, and coordination protocols governing how agents act and interact. Within the layered architecture introduced in Section 1.2, this section is the orchestration layer that turns constrained generation (Section 3) and minimax stress testing (Section 4) into deployable assurance.

Table 1 summarizes the change. In decision support, AI is predictive and OR prescriptive. Optimization turns forecasts and risk scores into plans, and a human decides. In partial autonomy, the stack becomes “propose-then-certify.” A learning component proposes; OR certifies against hard constraints and risk limits, projects onto the feasible set as needed, and triggers overrides or reversion to conservative policies when monitoring signals elevated risk.

Table 1.
Evolution of OR roles with increasing AI autonomy.

Level of autonomy AI paradigm Role of OR

Decision support (human in charge) Predictive analytics Solver: provides optimal or near-optimal solutions

Human-in-the-loop automation Stochastic generative AI Guardrail: sets constraints, validates AI suggestions

Fully autonomous operations Constraint-aware, auditable generative AI Architect/legislator: designs system, rules, and objectives

Level of autonomy	AI paradigm	Role of OR
Decision support (human in charge)	Predictive analytics	Solver: provides optimal or near-optimal solutions
Human-in-the-loop automation	Stochastic generative AI	Guardrail: sets constraints, validates AI suggestions
Fully autonomous operations	Constraint-aware, auditable generative AI	Architect/legislator: designs system, rules, and objectives

5.1. Designing decision regimes for autonomous operations

In fully autonomous settings—a dark factory, an autonomous supply chain, or a smart city where signals and vehicles negotiate flows in real time—the OR task is constitutional. OR specifies the regime: admissible actions, objectives, inviolable constraints, information rights, and conflict-resolution procedures. The design object is no longer an instance, but the system producing instances and acting on them.

This shift is visible in power systems. Operators once used optimal power flow to recommend adjustments humans executed. As renewables and fast-acting grid-edge devices proliferate, control becomes automated, and OR’s emphasis moves to rules and limits: reserve requirements, frequency-response obligations, and operating envelopes that maintain reliability. OR practitioners write the “grid code” autonomous controllers must satisfy.

A parallel shift appears in automated fulfillment. In robotized warehouses, managers do not route individual robots. OR defines the operating regime: zoning policies, priority rules, congestion management, and conflict-resolution logic that prevents deadlock while preserving throughput. Optimization remains, but shapes protocols rather than selecting each move.

OR fits this role because architecture is a choice under constraints. Autonomous systems need rules that are efficient, robust, and aligned with system objectives. Mechanism design offers an analogy: choose rules so decentralized behavior yields acceptable outcomes. In autonomy, the “players” are software modules. The designer decides what each module controls, observes, and how conflicts are resolved.

5.2. Monitoring, escalation, and fallback

A second task is fallback. Every autonomous system meets regimes outside its validated envelope, and assured autonomy requires explicit fallback rules: when monitoring signals out-of-control behavior, the system contracts its action space, reverts to conservative policies, or defers to humans. OR defines the triggers and optimizes fallback behavior so safety is preserved while residual performance is retained.

Operationally, monitoring should be implemented as a signal-trigger-action authority loop. Representative signals include safety-envelope slack, detection statistics, repeated constraint-binding counts, queue-instability indicators, and near-miss precursor rates. Hard-threshold breaches tighten admissible actions; persistent breaches trigger conservative policy reversion; and sustained high-risk states require human authorization before commitment. Thresholds should be calibrated as an explicit operating trade-off among missed detections, false alarms, detection delays, and human-review workload. Each escalation action should be logged with state, trigger, action, and rationale to preserve auditability.

The same pattern applies to autonomous model-debugging workflows in OR. Recent solver-in-the-loop benchmarks study self-correction workflows in which an agent uses solver feedback (including irreducible-infeasible-subset (IIS) diagnostics) to identify and repair infeasibilities iteratively (Ao et al., 2026). Infeasibility serves as an escalation signal, one that comes with a traceable path back to resolution.

The legislator’s role includes compliance. External requirements—regulation, safety standards, organizational policy—must translate into executable constraints. Fairness requirements in hiring or credit are increasingly mathematical constraints. For high-stakes autonomy, compliance comes by design when constraints and auditability are built in.

This shift implies meta-level optimization: the upper level chooses objectives, loss functions, information structures, and coordination protocols; the lower-level outcome is the behavior induced by learning and interaction. The bi-level task is to choose rules so the equilibrium is stable and aligned.

5.3. OR inside the GenAI stack: Inference and resource scheduling

The OR–GenAI interaction is bidirectional. OR supplies structure and assurance when generative models enter operational loops; OR also improves the GenAI stack itself. Inference scheduling, memory and compute allocation, and latency–throughput tradeoffs in LLM serving are queueing and scheduling problems. Recent work shows OR models can guide online LLM inference scheduling under memory constraints (Ao et al., 2025). OR governs both the decision regime and the computational substrate on which agentic AI runs.

5.4. Implications for OR research and practice

Supply chain autonomy illustrates this. In Beer-Game testbeds, multi-agent GenAI systems can outperform human teams, yet performance and stability hinge on the regime—what information is shared, what constraints cap extreme actions, how costs are defined (Long et al., 2025b; Simchi-Levi et al., 2025). The GenAI Beer Game makes the regime explicit by embedding generative agents in a structured operational system, enabling study of auditability, constraint adherence, and failure modes beyond static benchmarks (Long et al., 2025a). As autonomy scales, OR moves up the ladder—from solving instances, to certifying actions, to designing the regime—calling for a broader toolkit treating learning, adaptation, and multi-agent interaction as design objects.

6. Applications across domains

The preceding sections argued that operational autonomy scales only when assured by design. The layered architecture introduced in Section 1.2—constrained generation, minimax stress testing, and orchestration—defines the design problem; this section illustrates what it implies in practice, where the autonomy paradox binds most tightly, and how OR’s role evolves as autonomy increases (Table 2). Relative to Figure 2, this section instantiates the full stack in concrete sectors and shows how regime-specific invariants, escalation rules, and delegation boundaries differ in deployment.

Table 2.
Assured autonomy across domains: What becomes autonomous, what should never be violated, and where OR provides the “assurance” layer.

Domain Operational autonomy Invariants & tail risks OR assurance mechanisms

Supply chains Multi-agent replenishment, allocation, and routing over rolling horizons Flow balance, capacity, lead times; cascades (bullwhip) under shocks and delays Network optimization and robust control; constraint-preserving scenario generation; adversarial stress tests

Mobility & aviation Real-time trajectory choice with interaction among many agents Right-of-way and separation minima; rare sensor/comm failures; interaction tail events Constrained optimal control and reachability; certified safety envelopes; worst-case interaction/weather generators

Healthcare operations Workflow autonomy (documentation, triage, scheduling) and selective clinical support “Never-miss” events; case-mix shift; propagation of errors through workflows Queueing and resource allocation with safety constraints; statistical monitoring; explicit deferral and escalation rules

Power grids Automated dispatch, reserve management, and remedial actions at machine speed N-1 / N-1-1 security and stability; correlated outages; extreme weather Security-constrained OPF/UC; contingency analysis; DRO/minimax over outage distributions; real-time monitoring

Domain	Operational autonomy	Invariants & tail risks	OR assurance mechanisms
Supply chains	Multi-agent replenishment, allocation, and routing over rolling horizons	Flow balance, capacity, lead times; cascades (bullwhip) under shocks and delays	Network optimization and robust control; constraint-preserving scenario generation; adversarial stress tests
Mobility & aviation	Real-time trajectory choice with interaction among many agents	Right-of-way and separation minima; rare sensor/comm failures; interaction tail events	Constrained optimal control and reachability; certified safety envelopes; worst-case interaction/weather generators
Healthcare operations	Workflow autonomy (documentation, triage, scheduling) and selective clinical support	“Never-miss” events; case-mix shift; propagation of errors through workflows	Queueing and resource allocation with safety constraints; statistical monitoring; explicit deferral and escalation rules
Power grids	Automated dispatch, reserve management, and remedial actions at machine speed	N-1 / N-1-1 security and stability; correlated outages; extreme weather	Security-constrained OPF/UC; contingency analysis; DRO/minimax over outage distributions; real-time monitoring

6.1. Supply chains

Supply chains are a clean laboratory for the autonomy paradox: locally sensible actions, under delay and uncertainty, can generate globally unstable dynamics. The bullwhip effect formalizes this amplification analytically and experimentally in the Beer Distribution Game tradition (Lee et al., 1997; Sterman, 1989). As autonomy rises, the design object shifts from average cost reduction to closed-loop stability: will a multi-agent system remain well behaved when information is inconsistent, conditions shift, or one agent makes a rare extreme mistake?

Inventory planning is the canonical OR setting for sequential decision-making under uncertainty. Long before modern reinforcement learning, dynamic programming and stochastic inventory theory produced practical policy classes—base-stock and $(s, S)$ policies—offering a transparent lens on service levels, stability, and tail-risk trade-offs (Porteus, 2002; Zipkin, 2000). In an assured-autonomy framing, these policies define admissible control structures, provide interpretable fallback modes, and anchor evaluation when learning-based components are introduced.

The interface layer changes faster than the optimization layer. As LLMs reduce the cost of specifying objectives, constraints, and “what-if” analyses, the limiting factor becomes the reliability of the optimization primitives. Early deployments use LLMs as a natural-language layer translating planner intent into OR-based optimization and simulation workflows (Menache et al., 2025; Simchi-Levi et al., 2026), accelerating formulation rather than replacing solvers (Simchi-Levi et al., 2025). Autonomy reallocates OR effort toward admissible action spaces, invariants, and stress tests inside digital twins.

Recent evidence shows promise and limits. Multi-agent supply chains can be operated autonomously in simulation using frontier generative models (Long et al., 2025b). Yet gains hinge less on “free reasoning” than on OR design choices: explicit objectives, information policies, and hard constraints preventing destabilizing actions (e.g., budget caps or action bounds blocking panic ordering). In this evidence-to-deployment bridge, LLM-agent simulations are policy-evidence generators, not policy executors. Before simulated outputs inform real decisions, they should pass a constraints-first gate: feasibility screening, distribution-shift stress testing, and pre-specified escalation/fallback logic. Because human and LLM behaviors can diverge out of distribution, delegated decision rights should remain conditional on runtime monitoring and override readiness, not model confidence alone. Autonomy arrives when the supply chain is treated as a controlled dynamical system with admissible inputs.

Flow-based generation and minimax safety make this operational. Supply chain uncertainty spans demand, lead times, yield, capacity, and disruptions, so generators should preserve inventory balance, nonnegativity, and capacity. A deterministic flow maps a latent source into constraint-consistent scenarios, making generation auditable inside the decision loop. The minimax layer enforces tail resilience via equation (4): choose $θ$ against $P_{adv} \in P (P_{0})$ over demand-and-disruption paths $x$ with cost $C (θ, x)$ . What is new is computable worst cases: in Wasserstein formulations, least-favorable distributions arise as transported measures, giving a continuous adversary and decision-coupled stress tests rather than a fixed scenario menu (Cheng et al., 2025; Xu et al., 2024).

A compact two-echelon inventory example illustrates the full stack end-to-end. Let plant inventory be $I_{t}^{P}$ , distribution-center on-hand inventory be $I_{t}^{D}$ , backlog be $B_{t}$ , production be $x_{t}$ , shipment be $y_{t}$ , and arrival to the distribution center be $a_{t}$ . The exogenous uncertainty is $ξ = (D_{1 : T}, L_{1 : T}, K_{1 : T})$ , where $K_{t}$ is plant capacity, $D_{t}$ is stochastic demand, and $L_{t}$ is shipment lead time. A learned constrained generative model produces $ξ = G_{ϕ} (z; ω)$ , where $z$ is latent noise, $ω$ encodes operational context, and $ϕ$ denotes generator parameters; the model is trained so that generated demand, lead-time, and capacity paths preserve nonnegativity, temporal dependence, and admissible capacity envelopes. This induces a nominal distribution $P_{0}$ over admissible scenario paths. Given $P_{0}$ , the decision layer chooses a causal replenishment policy $π \in Π$ that maps observed history into production and shipment decisions. For simplicity, treat $L_{t}$ as an integer lead time realized when the shipment leaves the plant. The resulting pathwise dynamics satisfy

\begin{aligned} I_{t + 1}^{P} = I_{t}^{P} + x_{t} - y_{t}, 0 \leq x_{t} \leq K_{t}, 0 \leq y_{t} \leq I_{t}^{P} + x_{t}, \end{aligned}

\begin{aligned} a_{t} = \sum_{u = 1}^{t} y_{u} 1 {u + L_{u} = t}, \\ I_{t + 1}^{D} - B_{t + 1} = I_{t}^{D} - B_{t} + a_{t} - D_{t}, \end{aligned}

\begin{aligned} I_{t + 1}^{D} \geq 0, B_{t + 1} \geq 0, I_{t + 1}^{D} B_{t + 1} = 0. \end{aligned}

A stylized minimax formulation is

min_{π \in Π} sup_{P_{adv} \in P (P_{0})} E_{ξ \sim P_{adv}} [\sum_{t = 1}^{T} (c x_{t} + h_{P} I_{t + 1}^{P} + h_{D} I_{t + 1}^{D} + p B_{t + 1})],

where c, h_P, h_D, and p are the unit production, plant-holding, distribution-center-holding, and backlog-penalty costs, respectively;

P (P_{0})

is an ambiguity set around

P_{0}

, for example a Wasserstein, divergence-based, or moment-based neighborhood. The service target may be imposed through a worst-case stockout constraint (with α denoting the service-level target),

sup_{P_{adv} \in P (P_{0})} P_{P_{adv}} (B_{t + 1} > 0) \leq 1 - α, t = 1, \dots, T,

or through a worst-case tail-risk bound (τ) such as

sup_{P_{adv} \in P (P_{0})} {CVaR}_{β} (B_{t + 1}) \leq τ .

In deployment, runtime monitoring tracks service slippage, repeated capacity binding, lead-time spikes, and bullwhip proxies such as rolling upstream replenishment variance relative to demand variance. Threshold breaches trigger conservative base-stock fallback; persistent breaches require human authorization under a pre-specified authority matrix.

Monitoring must be designed in. Key signals are dynamical: rising upstream order variance, inventory oscillations, repeated constraint binding. Statistical process control is built for these patterns, except the “process” now includes autonomous agents whose behavior can drift with prompts, context windows, and upstream data. A Six-Sigma approach makes this operational: control charts for decision stability (e.g., bullwhip proxies) and escalation rules triggering conservative modes or human review when the system leaves control (Montgomery, 2019). Because supply chains evolve over hours to weeks, adversarial stress tests can run continuously in a digital twin, enabling intervention before instability turns catastrophic.

6.2. Mobility and aviation

Transportation autonomy is often framed as a perception breakthrough. In practice, the binding constraints are coordination and regulation. The standard is compliance with bright-line safety rules where other users behave strategically, unpredictably, or simply incorrectly. When an autonomous vehicle mishandles a stopped school bus, regulators treat it as a rule violation, not a marginal performance miss—hence recalls and scrutiny (Khanna, 2025). When robotaxi programs stumble, enforcement turns on incident response, reporting, and verifiable risk controls (National Highway Traffic Safety Administration, 2024).

These features sharpen the autonomy paradox. Mobility agents act in shared space. If the generative core is stochastic—e.g., iterative denoising—it is hard to certify that separation, stopping, and right-of-way constraints hold along the entire trajectory rather than on average or after repair. Flow-based generation shifts the burden from cleanup to design. When trajectories arise from deterministic dynamics with a constraint-aware vector field, feasibility becomes an invariant of motion. This is OR’s comparative advantage: specify admissible dynamics rather than projecting infeasible samples back.

Aviation shows what assured autonomy looks like in a mature safety regime. The U.S. National Airspace System relies on decision support for time-based traffic management that schedules and meters flows through constrained airspace (Federal Aviation Administration, 2024). The logic is OR by construction: demand–capacity balancing, network constraints, scheduling with separation requirements. Generative or agentic components must remain subordinate to certified invariants. Transformers can help with intent inference, coordination messages, and explanations. The trajectory engine and supervisory layer guaranteeing separation must be auditable, stress-testable, and compatible with certification practice.

The minimax layer is equally concrete. The adversary is the coupled process creating conflict: weather restricting airspace, surveillance or communication degradation, and interaction patterns producing dense encounters. An ambiguity set can bound joint shifts in weather, demand, and sensing error. Transport-based generators offer a tractable representation, and minimax optimization supplies the forcing function: a controller safe against least-favorable distributions inside the ambiguity set supports a stronger safety case.

The key distinction from supply chains is timescale and certification. Mobility and aviation run at machine speed (milliseconds to minutes) under tight regulatory norms. Monitoring must therefore be continuous, fast, and action-guiding. In mobility, leading indicators include near-miss rates, envelope violations, and repeated rule conflicts; in aviation, loss-of-separation precursors and sustained overload in constrained sectors. Escalation should be explicit: tighten the envelope, revert to conservative policies, or hand off when signals indicate loss of control.

6.3. Healthcare operations

Healthcare is where the autonomy debate most often blurs categories. “Workflow autonomy”—drafting notes, coding visits, summarizing encounters—differs from “clinical autonomy,” where errors can harm patients. Early GenAI gains reflect this: ambient scribes reduce documentation burden and burnout (Dai et al., 2025), while even documentation automation raises governance questions about consent and trust (Lawrence et al., 2025). The lesson is visible at low stakes: deployment succeeds only with design, monitoring, and accountability.

Assurance becomes decisive when systems touch operations: emergency-department triage, inpatient bed assignment, operating-room scheduling. Here invariants are clinical and operational at once: avoid catastrophic misses, prevent unsafe delays, preserve stability during surges. Tail risk dominates because capacity is finite. Small shifts in arrivals or acuity push the system past a threshold where queues grow rapidly and errors propagate across units. “AI checks AI” is a weak safeguard. A second model often shares the same data and blind spots and does not encode hospital constraints. Assurance belongs outside the language model: statistical monitoring for drift in calibration and case mix, with OR decision logic that tightens admissible actions, triggers review, or reverts to conservative policies when signals indicate loss of control.

Flow-based generative models fit this assurance layer because they generate the right object: structured stress tests rather than text. Hospitals need plausible surge paths, correlated resource shortfalls, and patient-flow trajectories respecting capacity, conservation, and time ordering. Transport-based generators can produce such trajectories while preserving feasibility by construction. A minimax formulation turns stress testing into a design principle: an adversary searches within an ambiguity set for arrival, acuity, and service-time distributions that strain the system most, and the decision maker chooses staffing, bed allocation, and triage thresholds safe under least-favorable conditions.

The domain implication is selective autonomy. Administrative work and low-risk allocations can be automated; high-risk decisions require hard boundaries via formal deferral rules, triggered by monitoring statistics and calibrated to the cost of false reassurance. Assured autonomy in healthcare protects clinical attention for cases where autonomy is least appropriate.

6.4. Power grids

If any sector already lives under the logic of assured autonomy, it is the power grid. Operators do not optimize “on average.” They operate under explicit invariants: supply equals demand, flows respect thermal and voltage limits, and the system remains stable under credible contingencies. These are codified in reliability standards compelling performance under defined classes of adverse events, including sequential contingencies and stability constraints (North American Electric Reliability Corporation, 2025). Regional planning turns this mandate into routine practice through contingency analysis and reinforcement rules; N-1-1 studies apply a first contingency, allow prescribed corrective actions, then impose a second contingency (PJM Interconnection, 2025). This is minimax reasoning embedded in institutions: plan for adverse regimes, not typical days.

Agentic AI changes the surface of grid operations but not the governing principle. Forecasting, anomaly detection, and remedial-action recommendation can become more autonomous, increasing speed and complexity in a coupled cyber-physical system. The autonomy paradox follows: faster control demands a decision regime that is deterministic, certifiable, and constraint-preserving by construction. This is why flow-based generative modeling fits the grid. Power systems are organized around flows and dynamics; uncertain inputs—renewable ramps, weather-driven demand, correlated outages—can be represented as a deterministic transport map from a latent source to physically plausible scenario distributions respecting domain structure.

The minimax layer becomes concrete. The adversary is nature and correlation: extreme weather, common-mode failures, contingency sequences pushing toward cascading collapse. An ambiguity set can encode plausible distribution shift via Wasserstein neighborhoods around empirical distributions of weather and renewable generation. The planner chooses reserves, dispatch, and corrective actions that perform well against least-favorable distributions, consistent with the reliability paradigm. Recent work on worst-case generation in Wasserstein space constructs least-favorable distributions as pushforwards of a transport map rather than from a finite catalog of hand-crafted scenarios (Cheng et al., 2025).

The grid clarifies a governance lesson: monitoring and fallback are part of design. Power systems already have telemetry, alarms, and protection logic. Assured autonomy means aligning AI components with that discipline. Recommendations should be filtered through security constraints; deviations detected using established stability and adequacy metrics; fallback made explicit, from conservative redispatch to automated protection and, in extremis, controlled load shedding. The point is not to replace the reliability-first regime but to make agentic AI compatible with it: transformers improve interface and explanation, while deterministic flows and explicit optimization preserve invariants the grid cannot compromise.

Across the four domains, the lesson is the same. Autonomy fails when operations demand invariants, tail-risk discipline, and coordination that stochastic generation and after-the-fact checks cannot supply. Assured autonomy rests on three choices: an auditable, constraint-aware generator, often implemented through deterministic transport; minimax stress tests searching for catastrophic regimes; and monitoring with clear escalation and fallback rules. The binding constraint differs by domain—delays and feedback in supply chains, certified interaction in mobility and aviation, workflow propagation in healthcare, codified reliability under correlated contingencies in grids. The agenda is to move assurance upstream: invariants in the model, worst cases in the loop, escalation in the system.

7. Research agenda

Assured autonomy requires treating autonomous systems as engineered objects: feasible by construction, robust to distribution shift and tail events, and governable over their lifecycle. Four priorities follow: (i) feasibility by construction at the learning–optimization interface; (ii) minimax safety, worst-case generation, and verification; (iii) monitoring, handoffs, and lifecycle governance; (iv) public goods that make progress cumulative.

7.1. Feasibility by construction: Learning meets optimization

Autonomy collapses the separation between learning and optimization: the model is not merely input to a solver; the solver is part of the model. The priority is making feasibility and invariants native to learning, not imposed afterward. One route embeds optimization primitives within learning pipelines via differentiable optimization layers, enabling end-to-end training with explicit constraints (Amos and Kolter, 2017; Donti et al., 2017; Wilder et al., 2019). A second develops sequential decision methods maintaining safety at each step, not only on average. Constrained Markov decision processes provide the right abstraction, with the frontier in scaling constraint satisfaction under function approximation and partial observability (Altman, 1999; Garcia and Fernandez, 2015). The key object is closed-loop behavior: does the learned policy preserve feasibility, stability, and conservation under the shifts and interactions deployment induces? Progress is testable with operational metrics—constraint-violation rates, long-horizon stability, stress-regime performance—rather than prediction error or episodic reward alone.

7.2. Minimax safety, worst-case generation, and verification

Assured autonomy requires treating rare disasters as design inputs. This calls for objectives optimizing worst-case performance over credible ambiguity sets, connecting to robust optimization and DRO (Ben-Tal and Nemirovski, 1998, 2002; Esfahani and Kuhn, 2018; Rahimian and Mehrotra, 2019). The minimax template provides a unifying lens: the decision-maker selects a policy while an adversary selects stressors or distributions exposing failure regimes. The challenge is tractability at scale. Recent work on worst-case generation over Wasserstein space suggests one path: represent least-favorable distributions as pushforwards of transport maps, yielding continuous worst-case generators (Cheng et al., 2025). Related developments linking flow-based learning to equilibrium computation in mean-field systems point to a broader bridge between transport, control, and multi-agent autonomy (Yu et al., 2025).

Verification is equally central. It should answer concrete questions: can the system enter an unsafe state; can it violate a hard constraint; can plausible perturbations trigger cascading failure? Runtime assurance and safety-filtering ideas from control—control barrier functions and shielding—enforce hard constraints online while learning-based components operate within provably safe regions (Alshiekh et al., 2018; Ames et al., 2017). Neural network verification and formal methods provide additional building blocks (Katz et al., 2017; Tjeng et al., 2019). OR has a comparative advantage because many such tasks reduce to constrained counterexample search, where mixed-integer optimization and robust search are natural (Fischetti and Jo, 2018).

7.3. Monitoring, handoffs, and lifecycle governance

Even well-designed autonomous systems drift: data distributions change, incentives evolve, correlated failures emerge at scale. Assured autonomy depends as much on governance as on algorithms. Monitoring should be a control function detecting when the system leaves its specification and triggering predefined responses. A useful template is Statistical Process Control (SPC): continuous measurement of performance and violation metrics, calibrated alarms, and explicit escalation rules (Montgomery, 2019; Shewhart, 1931). At its core is a sequential detection problem—identifying change-points or distributional shifts quickly (Xie et al., 2021)—emphasizing interpretability and operational relevance.

Beyond detection, handoffs are the central object. Escalation and fallback decisions trade off delay against false alarms and shape the system’s safety–performance profile (Weick and Sutcliffe, 2015). The handoff rule itself is a policy to optimize: what triggers deferral, what information is transferred, and how the system learns from overrides without creating new failure modes—subject to safety, workload, and delay constraints when multiple agents and humans share a workflow.

7.4. Benchmarks and data readiness

A final constraint is infrastructural. Progress needs benchmarks small enough to iterate on yet structured enough to capture operational reality. Vision and language advanced when shared datasets made progress legible (Deng et al., 2009); RL benefited from shared environments standardizing comparison (Brockman et al., 2016). Assured autonomy needs analogous testbeds for constraint-preserving generation, tail-aware stress testing, and decision quality under shift. Metrics should be operational—regret, stability, violation rates, worst-case performance—not likelihood alone.

Digital twins offer a related opportunity, but only if structured as stress-testing environments generating rare events, shifts, and worst-case scenarios under feasibility constraints. Without that structure, twins reinforce average-case behavior rather than surface the failure modes governing safety. The data-readiness gap is the binding complement: operational datasets are abundant but rarely curated to expose constraint sets, invariants, regime shifts, and rare-event structure. Reusable datasets and testbeds are the basis for cumulative science and credible external validity.

What ties these priorities together is a vision of autonomy: feasible by construction, hardened by minimax stress testing, governed by explicit monitoring and handoff rules, accelerated by public benchmarks. Meeting that standard is the price of scaling autonomy responsibly, and OR’s opportunity to shape what autonomy becomes.

8. Conclusion

Autonomous AI is leaving the lab and entering operations—warehouses, supply chains, clinics, mobility, and finance. The gains are real; the failures are fast and sticky. As autonomy rises, structure matters more. OR will shift from optimizing within a given workflow to designing the workflow itself.

This pattern echoes earlier waves of automation. Societies did not make high-energy machines acceptable by trusting average-case performance; they embedded control, standards, and contingency procedures in the surrounding system. Financial markets became resilient not because trading algorithms grew “smarter” but because trading operated within risk limits, monitoring, and circuit breakers. Autonomy faces the same test. The relevant test is not whether a model can generate plausible outputs, but whether the operational system preserves nonnegotiable invariants, remains stable in tail regimes, and switches to a safe mode when conditions leave the validated envelope. Methodologically, this reframes autonomy as controlled evolution in probability space, with constraints and tail risk treated as explicit design requirements.

The practical implication: investing in models alone will disappoint. Value comes from redesigning the decision regime around the model—defining admissible actions, specifying escalation and deference rules, and institutionalizing stress testing and monitoring as operational processes. The managerial task is to decide where autonomy is appropriate, how it is bounded, and when it should yield to conservative policies or human judgment.

The research agenda is similarly concrete. Many central questions are not solved by scale: how to enforce feasibility throughout generation; how to search systematically for consequential failure regimes; how to verify safety properties without relying on model self-assessment; and how to coordinate multiple agents with aligned incentives and stable dynamics. These problems sit naturally in the OR toolkit, echoing Herbert Simon’s argument that AI and OR are stronger together than apart (Simon, 1987).

The autonomy paradox organizes this paper. Autonomy increases the value of rules that can be audited and enforced. “Assured autonomy” treats constraints, tail regimes, and monitoring as design inputs, not afterthoughts. With that discipline, autonomy scales; without it, autonomy imports unpriced risk. It is not a natural extension of current GenAI pipelines. It must be engineered—and OR provides the language and tools.

Assured autonomy is a system property achieved jointly through constrained generation, adversarial robustness, and runtime governance; no single component is sufficient in isolation. The OR contribution is therefore architectural: designing the coupled regime of constraints, stress tests, delegation rights, and escalation policies that keep autonomy productive while safe under shift and interaction.

Footnotes

Acknowledgments

We appreciate Professor Kalyan Singhal and two anonymous reviewers for the constuctive and helpful review process.

ORCID iD

Tinglong Dai

Funding

The work of Y. Xie is partially supported by NSF DMS-2134037, CMMI-2112533, and the Coca-Cola Foun-dation.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

How to cite this article

Dai T, Simchi-Levi D, Wu MX and Xie Y (2026) Assured autonomy: How operations research powers and orchestrates Generative AI systems. Production and Operations Management x(x): 1–17.

References

Albergo

Boffi

Vanden-Eijnden

(2023) Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797.

Alshiekh

Bloem

Ehlers

, et al. (2018) Safe reinforcement learning via shielding. In: Proceedings of the 32nd AAAI conference on artificial intelligence (AAAI-18), pp.2669–2678.

Altman

(1999) Constrained markov decision processes. Boca Raton, FL: Chapman and Hall/CRC.

Ambrosio

Gigli

Savaré

(2005) Gradient flows: in metric spaces and in the space of probability measures. Basel: Birkhäuser.

Ames

Grizzle

, et al. (2017) Control barrier function based quadratic programs for safety critical systems. IEEE Transactions on Automatic Control 62(8): 3861–3876. DOI: 10.1109/TAC.2016.2638961.

Amos

Kolter

(2017) OptNet: Differentiable optimization as a layer in neural networks. In: Proceedings of the 34th international conference on machine learning (ICML), pp.136–145. Sydney, Australia.

Anderson

(1982) Reverse-time diffusion equation models. Stochastic Processes and their Applications 12(3): 313–326. DOI: 10.1016/0304-4149(82)90051-5.

Luo

Simchi-Levi

, et al. (2025) Optimizing LLM inference: Fluid-guided online scheduling with memory constraints. arXiv preprint arXiv:2504.11320.

Simchi-Levi

Wang

(2026) Solver-in-the-loop: MDP-based benchmarks for self-correction and behavioral rationality in operations research. https://arxiv.org/abs/2601.21008.

10.

Ben-Tal

Nemirovski

(1998) Robust convex optimization. Mathematics of Operations Research 23(4): 769–805. DOI: 10.1287/moor.23.4.769.

11.

Ben-Tal

Nemirovski

(2002) Robust optimization—methodology and applications. Mathematical Programming 92(3): 453–480. DOI: 10.1007/s101070100286.

12.

Blanchet

Glynn

(2008) Efficient rare-event simulation for the maximum of heavy-tailed random walks. Annals of Applied Probability 18(4): 1351–1378. DOI: 10.1214/07-aap485.

13.

Blanchet

Murthy

(2019) Quantifying distributional model risk via optimal transport. Mathematics of Operations Research 44(2): 565–600. DOI: 10.1287/moor.2018.0936.

14.

Brockman

Cheung

Pettersson

, et al. (2016) OpenAI gym. arXiv preprint arXiv:1606.01540.

15.

Brown

Mann

Ryder

, et al. (2020) Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS) 33: 1877–1901.

16.

Chen

Mei

Fan

, et al. (2024) Opportunities and challenges of diffusion models for generative AI. National Science Review 11(12): nwae348. DOI: 10.1093/nsr/nwae348.

17.

Chen

Rubanova

Bettencourt

, et al. (2018) Neural ordinary differential equations. Advances in Neural Information Processing Systems (NeurIPS) 31.

18.

Cheng

Tan

, et al. (2024) Convergence of flow-based generative models via proximal gradient descent in Wasserstein space. IEEE Transactions on Information Theory 70(11): 8087–8106. DOI: 10.1109/TIT.2024.3422412.

19.

Cheng

Xie

Zhu

, et al. (2025) Worst-case generation via minimax optimization in Wasserstein space. arXiv preprint. DOI: 10.48550/arXiv.2512.08176

20.

Cummings

(2025) Prohibiting generative AI in any form of weapon control. NeurIPS 2025, San Diego, Poster. https://neurips.cc/virtual/2025/loc/san-diego/poster/121921. Poster session: Fri, Dec 5, 2025.

21.

Dai

Kvedar

Polsky

(2025) Policy brief: Ambient AI scribes and the coding arms race. npj Digital Medicine 8(1): 780. DOI: 10.1038/s41746-025-02272-z.

22.

Deng

Dong

Socher

, et al. (2009) ImageNet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition (CVPR), pp.248–255.

23.

Dinh

Sohl-Dickstein

Bengio

(2017) Density estimation using Real NVP. In: Proceedings of the 5th international conference on learning representations (ICLR 2017). Toulon, France.

24.

Donti

Amos

Kolter

(2017) Task-based end-to-end model learning in stochastic optimization. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems: 5490–55.

25.

Esfahani

Kuhn

(2018) Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming 171(1-2): 115–166. DOI: 10.1007/s10107-017-1172-1.

26.

Federal Aviation Administration (2024) NextGen. U.S. Federal Aviation Administration webpage. https://www.faa.gov/newsroom/nextgen.

27.

Fischetti

(2018) Deep neural networks and mixed integer linear optimization. Constraints 23(3): 296–309. DOI: 10.1007/s10601-018-9285-6.

28.

Flagle

(2002) Some origins of operations research in the health services. Operations Research 50(1): 52–60. DOI: 10.1287/opre.50.1.52.17805.

29.

Gao

Kleywegt

(2023) Distributionally robust stochastic optimization with Wasserstein distance. Mathematics of Operations Research 48(2): 603–655. DOI: 10.1287/moor.2022.1275.

30.

Garcia

Fernandez

(2015) A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16: 1437–1480.

31.

Gass

Assad

(2011) History of operations research. In: Transforming research into action, INFORMS TutORials in operations research, chapter 1, pp.1–14. Catonsville, MD: INFORMS. DOI: 10.1287/educ.1110.0084.

32.

Geng

Deng

Bai

, et al. (2025) Mean flows for one-step generative modeling. Advances in Neural Information Processing Systems (NeurIPS): 75460–75482.

33.

Jain

Abbeel

(2020) Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS) 33: 6840–6851.

34.

Shen

Wallis

, et al. (2022) LoRA: Low-rank adaptation of large language models. In: Proceedings of the 10th international conference on learning representations (ICLR 2022).

35.

Jordan

Kinderlehrer

Otto

(1998) The variational formulation of the Fokker–Planck equation. SIAM Journal on Mathematical Analysis 29(1): 1–17. DOI: 10.1137/S0036141096303359.

36.

Katz

Barrett

Dill

, et al. (2017) Reluplex: An efficient SMT solver for verifying deep neural networks. In: Majumdar R and Kuncak V (eds.) Computer aided verification – 29th international conference, CAV 2017, Heidelberg, Germany, July 24–28, 2017, Proceedings, Part I, Lecture Notes in Computer Science, Vol. 10426, pp.97–117. Heidelberg, Germany: Springer. DOI: 10.1007/978-3-319-63387-9_5.

37.

Khalafi

Hounie

Ding

, et al. (2025) Composition and alignment of diffusion models using constrained learning. In: 2nd Workshop on models of human feedback for AI Alignment (MoFA), international conference on machine learning (ICML 2025). Vancouver, Canada.

38.

Khanna

(2025) Waymo recalls, updates software for over 3000 vehicles, U.S. regulator says. Reuters News report.

39.

Kuhn

Mohajerin Esfahani

Nguyen

, et al. (2019) Wasserstein distributionally robust optimization: Theory and applications in machine learning. In: Operations research & management science in the age of analytics, INFORMS TutORials in Operations Research, pp.130–166. INFORMS.

40.

Lawrence

Kuram

Levine

, et al. (2025) Informed consent for ambient documentation using generative AI in ambulatory care. JAMA Network Open 8(7): e2522400. DOI: 10.1001/jamanetworkopen.2025.22400.

41.

Lee

Padmanabhan

Whang

(1997) Information distortion in a supply chain: The bullwhip effect. Management Science 43(4): 546–558. DOI: 10.1287/mnsc.43.4.546.

42.

Lipman

Chen

RTQ

Ben-Hamu

, et al. (2023) Flow matching for generative modeling. In: Proceedings of the 11th international conference on learning representations (ICLR 2023). Kigali, Rwanda: OpenReview.net.

43.

Long

Simchi-Levi

Calmon

, et al. (2025a) The GenAI beer game. https://infotheorylab.github.io/beer-game/. Interactive testbed for generative AI in supply chain operations.

44.

Long

Simchi-Levi

Calmon

, et al. (2025b) When supply chains become autonomous. Harvard Business Review. Online article.

45.

Menache

Pathuri

Simchi-Levi

, et al. (2025) How generative AI improves supply chain management. Harvard Business Review. January–February 2025 issue.

46.

Montgomery

(2019) Introduction to statistical quality control. 8th ed. Hoboken, NJ: John Wiley & Sons.

47.

National Highway Traffic Safety Administration (2024) Consent order: Cruise; standing general order 2021-01 reporting. Consent order. https://www.nhtsa.gov/sites/nhtsa.gov/files/2024-09/cruise-consent-order-2024-web.pdf. In re: Cruise, LLC. Dated September 26, 2024; (accessed 13 December 2025).

48.

North American Electric Reliability Corporation (2025) TPL-001-5.1—Transmission system planning performance requirements. Reliability Standard. https://www.nerc.com/globalassets/standards/reliability-standards/tpl/tpl-001-5.1.pdf. Updated standard document.

49.

Peyré

Cuturi

(2019) Computational optimal transport with applications to data sciences. Foundations and Trends® in Machine Learning 11(5–6): 355–607. DOI: 10.1561/2200000073.

50.

PJM Interconnection (2025) Manual 14b: PJM region transmission planning process. PJM Manual. https://www.pjm.com/-/media/DotCom/documents/manuals/m14b.ashx. See Attachment G for N-1-1 criteria and procedures.

51.

Porteus

(2002) Foundations of stochastic inventory theory. Stanford, CA: Stanford University Press.

52.

Powell

(2023) Reinforcement learning and stochastic optimization: a unified framework for sequential decisions. Hoboken, NJ: John Wiley & Sons.

53.

Rahimian

Mehrotra

(2019) Distributionally robust optimization: A review. arXiv preprint arXiv:1908.05659.

54.

Rezende

Mohamed

(2015) Variational inference with normalizing flows. In: Bach F and Blei D (eds.) Proceedings of the 32nd international conference on machine learning (ICML 2015), Proceedings of machine learning research, Vol. 37, pp.1530–1538. Lille, France.

55.

Selvi

Liu

Wiesemann

(2025) Differential privacy via distributionally robust optimization. Operations Research 74(1): 356–376. DOI: 10.1287/opre.2023.0218.

56.

Shapiro

(2017) Distributionally robust stochastic programming. SIAM Journal on Optimization 27(4): 2258–2275. DOI: 10.1137/16M1058297.

57.

Shewhart

(1931) Economic control of quality of manufactured product. New York, NY: D. Van Nostrand Company.

58.

Shubik

(2002) Game theory and operations research: Some musings 50 years later. Operations Research 50(1): 192–196. DOI: 10.1287/opre.50.1.192.17789.

59.

Simchi-Levi

Dai

Menache

, et al. (2025) Democratizing optimization with generative AI. Working Paper 5511218, Johns Hopkins University. DOI: 10.2139/ssrn.5511218.

60.

Simchi-Levi

Mellou

Menache

, et al. (2026) Large language models for supply chain decisions. In: Cohen MC and Dai T (eds.) AI in supply chains: Perspectives from global thought leaders, springer series in supply chain management, Vol. 27, pp.93–104. Springer. DOI: 10.1007/978-3-032-07054-8_7.

61.

Simon

(1987) Two heads are better than one: The collaboration between AI and OR. Interfaces 17(4): 8–15. DOI: 10.1287/inte.17.4.8.

62.

Song

Meng

Ermon

(2021a) Denoising diffusion implicit models. In: Proceedings of the 9th international conference on learning representations (ICLR 2021).

63.

Song

Dhariwal

Chen

, et al. (2023) Consistency models. In: Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S and Scarlett J (eds.) Proceedings of the 40th international conference on machine learning, Proceedings of machine learning research, Vol. 202, pp.32211–32252. PMLR.

64.

Song

Sohl-Dickstein

Kingma

, et al. (2021b) Score-based generative modeling through stochastic differential equations. In: Proceedings of the 9th international conference on learning representations (ICLR 2021). DOI: 10.48550/arXiv.2011.13456.

65.

Sterman

(1989) Modeling managerial behavior: Misperceptions of feedback in a dynamic decision making experiment. Management Science 35(3): 321–339. DOI: 10.1287/mnsc.35.3.321.

66.

Sutton

Barto

(2018) Reinforcement learning: an introduction. 2nd ed. Cambridge, MA: MIT Press.

67.

Tjeng

Xiao

Tedrake

(2019) Evaluating robustness of neural networks with mixed integer programming. In: Proceedings of the 7th international conference on learning representations (ICLR 2019). DOI: 10.48550/arXiv.1711.07356.

68.

von Neumann

Morgenstern

(1944) Theory of games and economic behavior. Princeton, NJ: Princeton University Press.

69.

Wang

Gao

Xie

(2025) Sinkhorn distributionally robust optimization. Operations Research. DOI: 10.1287/opre.2023.0294. Articles in advance.

70.

Weick

Sutcliffe

(2015) Managing the unexpected: sustained performance in a complex world. 3rd ed. Hoboken, NJ: Wiley.

71.

Wilder

Dilkina

Tambe

(2019) Melding the data-decisions pipeline: Decision-focused learning for combinatorial optimization. In: Proceedings of the AAAI conference on artificial intelligence, Vol. 33, pp.1658–1665. AAAI Press. DOI: 10.1609/aaai.v33i01.33011658.

72.

Xie

Zou

Xie

, et al. (2021) Sequential (quickest) change detection: Classical results and new directions. IEEE Journal on Selected Areas in Information Theory 2(2): 494–514. DOI: 10.1109/JSAIT.2021.3072962.

73.

Xie

Cheng

(2026) Flow-based generative models as iterative algorithms in probability space. IEEE Signal Processing Magazine. DOI: 10.1109/MSP.2025.3609527.

74.

Cheng

Xie

(2023) Normalizing flow neural networks by JKO scheme. In: Advances in neural information processing systems (NeurIPS), Vol. 36, pp.47379–47405. DOI: 10.48550/arXiv.2212.14424.

75.

Lee

Cheng

, et al. (2024) Flow-based distributionally robust optimization. IEEE Journal on Selected Areas in Information Theory 5: 62–77. DOI: 10.1109/JSAIT.2024.3370699.

76.

Lee

Xie

, et al. (2025) High-dimensional mean-field games by particle-based flow matching. NeurIPS 2025 Workshop: Dynamics at the Frontiers of Optimization, Sampling, and Games. DOI: 10.48550/arXiv.2512.01172.

77.

Zipkin

(2000) Foundations of Inventory Management. McGraw-Hill.