Abstract
Retail recommendation systems increasingly operate as real-time decision engines that must personalize suggestions while respecting operational constraints such as inventory availability, category rules, and promotion policies. This is especially challenging in basket-based retail because transactions are set-valued and the observed checkout order is operational rather than behavioral. We study a governed agentic AI system for basket recommendation in which a Bayesian consumer world model serves as the agent’s internal state. The model maintains calibrated beliefs over latent shopping profiles and updates them online as items are observed while representing basket context in an order-invariant way. The agent then selects recommendation slates under explicit governance, combining interpretable control levers (e.g., profile-context trade-off, bounded exploration) with operational guardrails and feasibility masking (e.g., in-stock status, category, promotion eligibility). Using large-scale grocery transaction data, we evaluate a framework in both offline next-item prediction and an operations-coupled simulator with inventory and promotion dynamics. The agent achieves a higher hit rate than a nonagentic variant as well as various strong item–item baselines under a common holdout protocol. In the simulation, ranking accuracy changes little, yet the agent delivers substantial gains in revenue and inventory productivity by steering demand toward feasible complements and enabling controlled substitution when constraints bind. This highlights an operations insight: under binding feasibility, decision quality can improve through constraint-aware substitutions and inventory coupling even when conventional ranking metrics remain unchanged. Overall, the results show how basket-aware demand models can be deployed as governed, agentic policies that coordinate personalization with operational objectives.
Keywords
Introduction
Recent advances in generative artificial intelligence (GenAI) and agentic AI are shifting personalization from better predictions to autonomous decision-making at the interface of marketing and operations. Leading digital platforms, including Amazon, Walmart, and Alibaba, are already transitioning from static recommendation rules to intelligent systems that dynamically coordinate inventory and fulfillment in real time (Jassy, 2025; Law, 2025; Malik, 2026). As companies increasingly delegate pricing and recommendation to AI systems, this shift raises urgent governance questions regarding trust, control, and objective misalignment. A purely “marketing” agent may recommend unavailable items or over-rotate toward demand at the expense of margins, while a purely “operations” system may become rigid and undermine consumer engagement. The central question is therefore not only how to personalize, but how to design agentic decision systems that coordinate demand-shaping and operational feasibility safely and transparently? This need is particularly acute in retail recommendation, where retailers must personalize assortments and suggestions under constraints such as inventory availability, promotion rules, category balance, and margin targets. Nevertheless, much of the recommender-systems literature remains focused on offline ranking metrics, treating operational constraints as postprocessing rather than as first-class inputs to the decision policy. In this work, we seek to make the coordination between marketing and operations explicit by treating both consumer relevance and operational feasibility first-class in the decision policy. From the perspective of operations management (OM), our core contribution is to treat recommendations as a constrained decision-making problem under feasibility constraints, where value depends on coordinating demand with inventory and business rules rather than simply on offline predictive accuracy.
Basket-based retail further complicates the problem. Grocery and mass-merchant transactions are inherently set-valued because consumers purchase multiple items together, and the relevance of a candidate product depends not only on long-run preferences but also on what is already in the basket. Moreover, transaction data record baskets at checkout rather than the latent sequence by which intentions form, and the scan order is operational rather than behavioral. As a result, sequence-based assumptions are empirically fragile and operationally misaligned. These features pose two challenges. First, models must capture within-basket complementarities and substitutions without relying on unobserved item sequences. Second, recommendation systems must be deployable as online policies that adapt to evolving baskets under explicit governance, rather than as static rankers optimized simply for offline accuracy. Addressing these challenges requires a consumer-level representation that can be updated online and a decision layer that selects recommendations from a constrained action set under auditable controls.
We address these challenges by proposing an agentic basket recommender that combines probabilistic demand modeling and operational control through a two-layer architecture: a consumer world model and a governed decision layer:
A consumer world model is a compact probabilistic representation of shopper intent and within-trip coherence that supports belief-state decision-making. Concretely, our consumer world model maintains and updates a belief over a small number of interpretable latent shopping profiles (e.g., pantry stock-up, fresh meals, household care) and evaluates candidate items using an order-invariant, set-based representation of the current basket. Note that “world model” here does not refer to a language model. Rather, it is a Bayesian demand model learned from transaction sets that provides the agent’s internal state (beliefs over profiles) and predictive structure (how basket composition shifts candidate plausibility) needed for sequential decision-making under uncertainty. The second layer operationalizes the consumer world model as a sequential decision policy. As a trip unfolds, the agent updates its belief over latent profiles using the same sufficient statistics that drive offline inference and scores candidates by combining long-run preference fit with within-trip basket coherence. The key design choice is reuse. Instead of training a separate ranker or introducing ad hoc heuristics, the online policy reuses the probabilistic objects learned offline as its scoring primitives.
The agent is governed by design. In retail, recommendations are only valuable if they are feasible and controllable. Thus, recommendations must respect inventory availability, promotion eligibility, category policies, and margin or compliance guardrails. To this end, we embed governance directly into the decision rule rather than treating it as a postprocessing patch. Governance enters in two ways: the agent acts over a constrained action set induced by operational masking and business rules applied before ranking, and the policy exposes a small set of auditable levers that regulate behavior within that governed action space. In our framework, these levers include a profile-context trade-off that determines how strongly the policy emphasizes stable preferences versus within-trip coherence, and a bounded exploration rate that manages uncertainty without sacrificing control. This separation between the learned consumer world model and the governed decision layer enables monitoring and adjustment when constraints bind or conditions change, without the need for re-estimation of the underlying demand structure.
Through our empirical study, we discover insights into operations as well as implications for the operations–marketing interface. When feasibility constraints bind, the recommendation becomes a constrained control problem in which value is created by coordinating demand with operational feasibility other than by improving unconstrained predictive accuracy alone. In an inventory-constrained environment, we observe that standard ranking metrics may change little, yet revenue and inventory productivity improve materially because the policy steers choices toward feasible complements and enables controlled substitution under stockouts. This highlights a theory-relevant distinction between prediction quality and decision quality in constrained systems, and motivates evaluating recommender policies with OM outcomes in addition to offline accuracy, such as revenue, margin, stockout/substitution, and gross margin return on inventory.
This paper makes the following contributions:
We develop a basket-aware latent-profile Bayesian demand model for set-valued retail data that separates stable household preferences from within-trip basket effects, capturing complementarities and substitutions without relying on unobserved item sequences. We introduce an order-invariant, set-based context specification that varies across candidate items (and thus affects relative rankings), avoiding the cancellation issues that arise in naive context formulations. We show how the fitted probabilistic structure can be reused to build an agentic recommendation policy. We embed the policy in a governed decision layer where feasibility masking and business rules are applied before ranking, and behavior is regulated via auditable levers. In an inventory-coupled simulation, we show an OM-relevant distinction between prediction quality and decision quality: economic outcomes (e.g., revenue and inventory productivity) can improve materially through constraint-aware substitution and feasibility coupling even when prediction accuracy changes little.
The remainder of the paper is organized as follows. Section 2 reviews related literature on the operations–marketing interface in retail settings, basket recommendation, world models for decision-making, and agentic AI and governance. Section 3 introduces the basket-aware latent-profile consumer world model, including its order-invariant context representation and estimation procedure. Section 4 describes the agentic decision layer, including within-trip belief updating, basket-aware scoring, and governance levers such as feasibility masking and exploration. Section 5 describes the data and evaluation design, including offline next-item prediction and an operations-coupled environment that captures inventory constraints and substitution dynamics. Section 6 concludes with implications, limitations, and directions for future research.
We position our research at the intersection of four converging streams of literature to propose a unified framework for retail decision-making. First, we situate our work within the operations–marketing interface, where demand shaping and fulfillment are jointly determined under inventory constraints. Second, we draw on basket recommendation and generative modeling for retail baskets, emphasizing the distinction between sequential and set-based representations of consumer behavior. Third, we connect to the emerging world-model perspective on decision systems, and position our Bayesian demand model as a consumer world model that supports belief-state decision-making. Finally, we relate our approach to recent work on agentic AI and governance, where autonomy requires feasibility-aware action spaces and auditable managerial controls.
The operations–marketing interface in retail
Retail decisions sit at the operations–marketing interface because demand shaping and fulfillment are jointly determined by operational feasibility. A long operations tradition shows that inventory availability and stockout risk directly change realized sales, substitution patterns, and the value of demand stimulation, so policies that ignore feasibility can destroy value even when they increase nominal demand (Akçay et al., 2020; Ghosh et al., 2022). In omnichannel settings, inventory information is itself a strategic operational lever. For example, research shows that sharing reliable availability information changes consumer behavior and improves performance precisely because feasibility shapes choice (Gallino and Moreno, 2014). These results motivate a central implication for recommendation: a recommender optimized as a pure demand generator is incomplete unless the recommendation policy is coupled with what the retailer can actually fulfill.
Marketing literature has studied recommender systems and decision aids for more than two decades, emphasizing that they affect not only prediction accuracy but also consumer search and choice (Donnelly et al., 2024; Fang et al., 2026; Wan et al., 2024). Early work shows that interactive decision aids can materially change how consumers evaluate alternatives and what they ultimately select (Häubl and Trifts, 2000). In parallel, marketing research developed formal recommendation systems grounded in preference modeling and early e-commerce settings (Ansari et al., 2000). Subsequent work developed recommendation methods using purchase histories and evaluated their performance in settings where the key signal is revealed through observed transactions (Bodapati, 2008). Related research also shows how transaction data can be used to study longer-term customer outcomes such as retention and churn (Ascarza, 2018). At the same time, practical frictions such as missing feedback can limit predictive performance (Ying et al., 2006). More recent work further shows that recommender effectiveness depends not only on the underlying algorithm but also on how recommendations are explained and presented to consumers, with important effects on trust, click-through, and search behavior (Chen et al., 2024a; Gai and Klesse, 2019). Beyond the individual level, recommendation networks can reshape demand in electronic markets by changing which products become visible and connected to each other (Oestreicher-Singer and Sundararajan, 2012), and recommender systems can influence aggregate sales patterns such as product diversity (Fleder and Hosanagar, 2009; Zheng et al., 2025). This literature provides an important marketing foundation for our setting, where recommendation policies must shape demand in ways that are also operationally feasible.
As marketing decisions are increasingly delegated to AI systems, value creation depends not only on demand shaping but also on execution under operational constraints, which makes the operations–marketing interface more central (Huang and Rust, 2021; Kopalle et al., 2022). Huang and Rust (2022) argue that deployable AI in marketing is often collaborative and must be designed to coordinate with organizational and operational constraints instead of optimizing a narrow predictive objective in isolation. This perspective motivates our focus on integrating feasibility into the policy definition (Demirezen and Kumar, 2016; Xiao and Xu, 2018). In the context of recommendation, this implies decision rules that restrict and shape the action space before ranking instead of repairing infeasible recommendations after the fact.
Recent OM and operations research work has begun to operationalize personalization as a constrained decision problem beyond a pure prediction task. In personalized assortment and revenue-management settings, the objective is not only to predict preferences but to select what to show or offer subject to business constraints (inventory, capacity, or feasibility), often with provable structure or scalable optimization methods (Chen et al., 2024b; Golrezaei et al., 2014; Kallus and Udell, 2020). Closest in spirit to our setting, this stream treats personalization as control of the decision set and its economic consequences, rather than as an offline ranking exercise. For example, Chen et al. (2024b) derive an inventory-balancing policy using fluid approximations, offering strong theoretical guarantees for online assortment under limited inventory. However, their approach relies on parametric choice models, that is, multinomial logit, which focuses on substitution and inventory rationing, abstracting away the complex, set-based complementarities found in grocery baskets. Our work complements this by focusing on the generative nature of the problem. Our agent uses a world model to simulate how inventory availability interacts with latent profiles and basket context to shape demand dynamically.
In contrast, much of the recommender-systems literature still evaluates models primarily by offline accuracy and then imposes feasibility constraints as postprocessing, like removing out-of-stock items after ranking. That design implicitly assumes that feasibility is peripheral to the policy. At the operations–marketing interface, feasibility constraints are not rare edge cases but an endogenous part of the environment that the policy should anticipate. To this end, we design our framework to follow the operations perspective. Feasibility is made first-class by defining the policy over a governed action space constructed before ranking, and by evaluating performance using operational outcomes, such as substitutions, stockouts, and inventory productivity, in addition to offline prediction metrics.
Basket recommendation: sequences versus sets
Modern recommenders commonly represent user activity as sequences and learn next-item prediction using sequential architectures, such as recurrent neural networks and transformer-based recommenders (Kang and McAuley, 2018; Smirnova and Vasile, 2017; Sun et al., 2019). This sequence view is well aligned with clickstreams and browsing logs, where temporal order is meaningful. However, grocery and mass-merchant transactions are typically observed as checkout baskets, which means that the data record the set of items purchased, while the scan order is operational and often unrelated to preference formation. Thus, treating baskets as sequences can inject noise by hallucinating temporal dependencies that are not behaviorally grounded.
Meanwhile, recent strategy perspectives on GenAI note that text-first or token-sequence paradigms are not universally appropriate and that representations should match the data-generating process and the decision context (Feuerriegel et al., 2024). The basket retail setting reinforces this point. Order-sensitive sequence models can introduce misleading structure, pointing instead toward order-invariant set models that capture coherence without imposing unobserved temporal dependencies. This motivates a second literature pivot, modeling baskets as sets.
This distinction between sequence and set representations is also reflected in the marketing literature on recommendation and online behavior. In browsing environments where consumers interact with products over time, clickstream-based models treat the path and timing of interactions as informative signals, making order- and path-dependent representations natural (Bucklin and Sismeiro, 2003; Moe and Fader, 2004; Montgomery et al., 2004; Sun, 2025). In contrast, a complementary stream builds models and decision rules from purchase histories and basket-shopping behavior, often in large assortments where the primary empirical object is the shopping trip or basket and the key dependence lies in cross-item structure rather than a behaviorally meaningful within-basket sequence (Ansari et al., 2000; Jacobs et al., 2016, 2021). This aligns directly with grocery retail, where checkout data record sets purchased, so imposing an item order can add noise rather than information. Accordingly, our modeling choice treats the basket as an unordered set while still capturing complementarity and substitution within that set.
In machine learning, permutation-invariant set representations, such as deep sets and set transformers, provide a principled way to embed unordered collections while preserving expressiveness (Lee et al., 2019; Zaheer et al., 2017). While recent information systems literature has advanced the use of hyperbolic embeddings to model hierarchical situational contexts (Bauman et al., 2025), grocery baskets require capturing the combinatorial structure of item co-occurrence. In retail recommendation, the operational requirement calls the model to capture complementarity and substitution within the set without relying on unobserved micro-sequences. Our consumer world model explicitly adopts an order-invariant context representation and uses it to explain basket coherence at the profile level, aligning the statistical object with the operational data-generating process.
Importantly, our objective is not only to fit baskets offline, but to reuse the same set-based structure online as the basket evolves. This reuse requirement is rarely emphasized in sequence-based work, where the deployed policy often differs from the training objective or relies on a separately tuned ranker. Our approach instead builds a probabilistic set-based demand model whose sufficient statistics and scoring primitives are designed to be directly consumed by an online policy.
World models for decision-making
A world model is typically defined as an internal predictive model of the environment that supports decision-making, for example, by enabling belief updates, forecasting outcomes of actions, or computing action values in partially observed settings (Sutton and Barto, 1998). In model-based reinforcement learning, learned world models are often implemented as latent-dynamics simulators, for example, the world models line, Dreamer, emphasizing representation learning and long-horizon rollout for control (Ha and Schmidhuber, 2018; Hafner et al., 2019).
Our setting differs in the environment being modeled. The relevant environment is the consumer choice process under basket context and operational constraints, where the key hidden state is shopper intent rather than physical dynamics. Therefore, we align with the definition of world models as predictive internal models for decision-making, but instantiate it as a probabilistic consumer world model. Formally, we propose a Bayesian generative model learned from historical baskets that (1) represents latent intent via interpretable profiles, (2) provides an order-invariant mechanism for how the observed basket shifts the plausibility of candidates, and (3) yields sufficient statistics that can be updated online.
This construction supports the two core functions expected of a world model in decision systems. First, it provides the agent’s internal state, beliefs over latent intent. Second, it enables the predictive mapping from state and context to outcomes (basket-aware choice likelihood), which is sufficient to evaluate and compare candidate actions under constraints in our setting. Compared with latent-dynamics world models, we prioritize interpretability, auditable reuse, and operational controllability over expressive black-box simulation. From a marketing decision-systems viewpoint, this also aligns with the emphasis on AI as an internal, decision-supporting representation of the environment, where interpretability and managerial actionability matter alongside predictive performance (Huang and Rust, 2021). After discussing the relevance of our model to world models, hereafter, we use “consumer world model” and “world model” interchangeably.
Agentic AI and governance
A second recent shift is from predictive models to agentic models, systems that maintain state, take actions, and pursue objectives under constraints. In the literature on large language models, agentic behavior is operationalized via tool use and iterative reasoning loops and simulated multi-agent environments (Park et al., 2023; Yao et al., 2022). In operations contexts, this shift creates an immediate governance problem that delegating decisions to agents raises questions of controllability, auditability, and objective misalignment, especially when constraints related to inventory, compliance, and promotions bind frequently.
A recurring theme in recent work is that the managerial challenge is no longer simply whether AI can predict, but whether it can be directed, audited, and aligned with business objectives when given autonomy (Berente et al., 2021; Huang and Rust, 2022). In retail settings, these governance questions are amplified by frequent feasibility constraints and cross-functional objectives, and collaborative AI designs explicitly recognize the need to keep operational and organizational inputs configurable and transparent (Rai, 2020). Our governed agentic layer operationalizes this view by enforcing feasibility before ranking and by exposing auditable levers that regulate how the policy trades off profile fit, basket coherence, and exploration. These design choices follow directly from governance concerns highlighted across marketing, operations, and AI safety.
Adjacent literature further strengthens why governance must be embedded in the policy rather than bolted on. In reinforcement learning and safety research, a related and common point is that high-performing policies can fail under distribution shift or when objectives are misspecified, which motivates explicit constraints and safety guardrails (Raisch and Krakowski, 2021). We operationalize governance in two ways that fit retail practice. Feasibility is first enforced by constructing a constrained action space before ranking, and then the agent exposes a small number of auditable levers that managers can tune without retraining the underlying world model. This connects agentic decision automation to the operations–marketing interface, where value arises not only from better preference estimation, but from disciplined coordination between demand shaping and operational feasibility.
Consumer world model: basket-aware latent profile demand model
In this section, we formally introduce the consumer world model. Here, a world model is a compact probabilistic representation of consumer intent and within-trip coherence that supports belief-state decision-making. It produces two objects to be processed by the governed decision layer: a household-level belief over latent shopping profiles and its online update rule, and a basket-aware predictive structure that scores candidate items from an order-invariant set representation of the current basket. Note again that the world model here is a Bayesian demand model learned from transaction sets rather than a language model.
Shoppers rarely add items in isolation. A weekly grocery run usually blends recurring patterns, and the items that end up being purchased together reflect those overlapping profiles. Therefore, we assume that the relevance of an item depends on the stable shopping profiles of households as well as the other items observed together in the basket. Further, empirically, retail data record baskets at checkout, not the latent sequence by which intentions are formed. The scan order at the register is operational, not behavioral. As a result, we treat each basket as an unordered set. Specifically, in our model, “context” simply means the set of items observed together, not a revealed sequence. This preserves the economic intuition of within-basket coherence while remaining faithful to what the data record. With these two assumptions, our approach handles two signals at the same time. The first is a household’s latent shopping profiles, that is, stable, interpretable patterns, such as fresh food, pantry stock-up, household care, etc., summarizing long-run tendencies. The second is a basket-aware context signal that captures how copresent items make some candidates more or less appropriate on this trip. For example, bread and cereal raise the plausibility of milk, glass cleaner raises the plausibility of wipes, and soda may substitute for juice. To keep this contextual signal operationally simple and scalable, we summarize baskets in a low-dimensional item-embedding space and let each latent profile respond differently to that summary.
In the following subsections, we first formalize this intuition with a generative model. In summary, consumers carry a mixture of latent profiles across trips, each realized item in a basket is claimed by one of those profiles, and the probability of including an item combines profile-specific base appeal with a compact, order-agnostic function of the items in the basket. Then, we describe how to estimate the model efficiently via a variational objective and how the learned structure feeds the agentic policy used online.
The model
We model a trip as the interaction of persistent household tendencies captured by a small number of latent shopping profiles and profile-specific coherence within the unordered set of items observed in a basket. The construction is explicitly order-invariant in that we use only which items coappear instead of the latent sequence by which they were added. We specify the model as follows:
For each household On trip First, assign a latent profile to the item according to the household’s mixture:
Conditional on
While shoppers do add items one by one, empirically, the order is unobserved and potentially missing-not-at-random. Conditioning only on the set
Structured parameterization of basket effects
A literal
For a focal item
Our scoring structure also admits a practical cold-start handling mechanism that is compatible with the current model. For a new item
Exponentializing and normalizing over feasible candidates produces a valid multinomial. In estimation,
Exact maximization of the marginal likelihood is infeasible because it integrates over household profile mixtures
At a high level, each iteration in the estimation process consists of:
The fitted model serves as the paper’s consumer world model for online decision-making. Here, the latent profile mixture plays the role of the hidden state, and the basket-aware emission in (1) specifies how observed context shifts the likelihood of candidate items. This matches the role of a world model in agentic systems, while remaining transparent and auditable because the structure is Bayesian rather than a black-box simulator.
The proposed world model yields two reusable objects that are consumed by the governed decision layer in Section 4: household-level profile posteriors (and their streaming sufficient statistics) that support within-trip belief updates, and basket-aware scoring primitives that quantify how an unordered basket context shifts candidate plausibility. Specifically, we pass forward:
Long-run household profile information: Basket-aware scoring primitives:
In Section 4, we reuse these same objects to implement real-time belief updating and feasibility-aware slate selection, rather than introducing a separately trained ranker.
Governed decision layer: agentic basket recommendation policy
The basket-aware latent profile model in Section 3 is intentionally offline. From historical baskets, it learns household-level profile mixtures
To this end, we augment the Bayesian layer with an agentic online policy that can be viewed as a sequential decision problem with explicit governance. We index within-trip decision epochs by

Decision-diagram view of the governed recommendation loop (time-unrolled).
At each within-trip step
Let
Defining feasibility as a preranking action-space constraint offers two advantages. First, it prevents failure modes common to unconstrained recommenders, such as proposing unavailable items or violating policy constraints and then attempting to repair the slate post hoc. Second, it makes governance auditable, so changes in business rules map directly into changes in
In summary, governance enters the policy through the definition of
In this section, we introduce what a belief state is and how we reuse the previously fitted world model to update it. We defer the technical details to online EC B.1.
Belief state. The offline consumer world model yields a household-level profile mixture
Update via reuse of offline sufficient statistics. A key design choice is reuse. The online belief update uses the same variational inference logic (details are given in online EC A.2) used to fit the world model. Concretely,
Stability and two-timescale interpretation. In practice, early baskets can be noisy. For example, the first observed item may be generic. To address this challenge, we apply light smoothing to stabilize updates at small
After the trip ends, the household’s long-run parameters can be refreshed using the expected profile counts implied by the trip, providing a lightweight streaming update without revisiting historical baskets.
Connection to governed scoring. The belief state is the sole “memory” the agent carries within a trip. Specifically, subsequent scoring and governed slate selection in Section 4.3 depend on
Governed scoring and slate selection
At each within-trip decision epoch
Candidate scoring. For any candidate item
Governed slate selection. Given the set of feasible candidates
In summary, the policy at step
Our agent nests several common recommenders that we evaluate under the same holdout protocol, candidate catalog, and operational masking.
Static latent-profile recommender (static). Freeze beliefs and ignore basket context by setting
Context-only (item–item) methods. Drop profiles and score purely from co-occurrence in the current basket PPMI/Lift: sum association scores between candidates and items in EASEr: closed-form linear item–item model; score Item2Vec: cosine similarity between the candidate embedding and the basket embedding.
These methods use the same train split and candidate space but maintain no belief state.
Nonagentic contextual variant. Include set-based context while freezing beliefs and disabling exploration (
Our full agent combines online belief updates (
Policy controls in the agentic layer
The agentic policy exposes a small set of interpretable controls that map directly to how the decision rule behaves online. These are the same objects we use in belief updating, scoring, and slate selection. They are also the auditable levers referenced in our governance definition in Section 1. A key advantage of the governed decision layer is that managers can adjust behavior online without retraining the consumer world model.
Managerial interpretation of governance levers. Table 1 summarizes the main governance levers, what each lever changes in the decision rule, and practical guidance on when to increase or decrease it. Two implications are worth noting. First, the profile-context weight
Governance levers and managerial interpretation.
Governance levers and managerial interpretation.
Monitoring and guardrails. Because the online agent reuses the probabilistic structure learned in Section 3, we can monitor both model-facing diagnostics (e.g., a rolling ELBO-style fit or posterior concentration) and operational key porframce indicators (e.g., fill rate, stockouts, substitution frequency, margin) to detect drift or environmental changes. When triggers fire, governance is exercised by adjusting
In this section, we provide an empirical evaluation of our consumer world model and the governed agentic policy built on top of it. Because our framework is designed for decision-making under feasibility constraints, we report both offline next-item accuracy on held-out baskets, which tests whether the learned latent profiles and set-based context capture consumer coherence, as well as operational outcomes in a controlled simulation, which tests whether the same scoring primitives translate into improved retail performance under stock constraints.
Data and evaluation metric
Data and preprocessing. We evaluate the model on the Dunnhumby Complete Journey dataset, a standard benchmark in grocery retail. The data span 2 years of transactions for roughly 2,500 U.S. households at a single large retailer. Each record contains a household ID, basket identifier, timestamp, item identifier, paid price, retailer discount, and a hierarchical taxonomy including department, commodity, and subcommodity. For managerial interpretability, we aggregate stock keeping units to the commodity level. This reveals category patterns relevant to assortment and promotion (e.g., infant formula, frozen seafood, seasonal décor) and reduces sparsity. After aggregation, we retain roughly 300 commodities that are consistently active. To focus on a stable demand structure, we restrict attention to commodities that are consistently active across the sample period. Thus, cold-start behavior for newly introduced products is limited in this dataset and is left for future work. For each household
Evaluation metric. We evaluate next-item recommendation with a basket-reveal protocol as follows.
For each holdout basket, expand into item-level “events.” Because baskets are unordered, we use a randomized permutation per basket; conditionals are set-based, so results are invariant in expectation. At step Record a hit if the next held-out item appears in the slate. For household
The protocol is applied identically to our proposed solution method as well as the benchmark methods. We next summarize these benchmark methods used in the evaluation.
Benchmark methods. We benchmark the performance of our proposed recommender against four baselines. Specifically, we consider the following five methods:
Agentic latent-profile recommender (Agent proposed): the Agent is the full framework proposed in Sections 3 and 4. We fit a basket-aware latent-profile model with Static latent-profile recommender (Static): the Static is based only on the model proposed in Section 3 and it is a nonagentic variant. Product recommendations are ranked by the household’s offline mix PPMI/Lift (item–item): classical co-occurrence baseline built from the train-only baskets and items incidence matrix. We score each candidate by the sum of pairwise association weights with items in the current basket. Captures set copresence, ignores household heterogeneity. EASEr (closed-form item–item): modern item–item method that learns a symmetric weight matrix with Item2Vec (embedding-based item–item): it embeds the current basket by summing its item vectors and scoring candidates by cosine similarity to this basket vector. This captures high-order co-occurrence beyond pairwise counts.
Choice of the number of latent profiles. We select the number of latent profiles by comparing
Results
The Agent score in equation (3) trades off long-run preference fit and within-basket coherence through
Figure 2 shows that performance improves as

Hit@10 under different basket-context weight
With
Next-item recommendation performance (Hit@10) on the same holdout users and baskets.
The performance gap is driven by reusing the world model for online control. The Static variant ranks items using only the household’s offline profile mix

Step 2 Hit@10 comparison (Agent vs. Static).
Item–item methods are competitive but dominated. EASEr is the strongest classical baseline here, yet the Agent outperforms it by 0.145. This suggests that combining stable household-level profiles with a structured, profile-specific response to basket context adds predictive power beyond pure item–item similarity. Item2Vec lands in between the baseline methods. However, coarse co-occurrence alone is insufficient. PPMI/Lift performs poorly under a realistic next-item protocol and full candidate set, highlighting the brittleness of naive “people who bought
From an operations standpoint, the uplift is achieved without deep sequence models. The core is a transparent Bayesian latent-profile layer plus a low-rank item space, which produces interpretable levers such as profiles, basket weights,
We report diagnostics under the same order-invariant basket-reveal protocol as in Section 5.1. We focus on the first nontrivial context case (one observed item,
Figure 3 reports the resulting lift of the Agent over Static once context is available. After observing one item, the Agent’s Hit@10 increases from 0.258 (Static) to 0.619 (Agent). Because both methods share the same offline profiles and candidate space, this improvement reflects the Agent’s within-trip belief update and basket-aware context term rather than differences in training data or catalog. In line with the reuse design in Section 4, the lift indicates that the Agent converts the fitted world model into real-time belief updates and context-sensitive scoring.
Simulation with operational constraints
Offline Hit@10 is computed on held-out historical baskets under a fixed protocol and a largely unconstrained catalog, so it measures how well a model predicts what shoppers actually bought. However, in practice, realized purchases are shaped by uncertainty, which means that feasibility masking, stockouts, and substitution restrict what can be shown and what can be fulfilled. As a result, a simulation Hit@10, computed on these endogenous, constraint-shaped outcomes, can differ from offline accuracy and may compress policy differences even when revenue, margin, and inventory productivity change materially.
To this end, we complement the offline next-item evaluation with a lightweight simulator that couples recommendation to inventory and basic retail operations. We design this simulator to mirror the governed decision layer in Section 4. Specifically, feasibility is enforced before ranking by constructing a governed action space, the policy scores candidates using the reused world-model primitives, optional operational feasibility (soft constraints) are applied within the governed set, and a slate is selected using auditable levers.
Setups
Time and trip generation
Time evolves in discrete periods
Operational state
The environment maintains per-item on-hand inventory
Governed action space
At step
Policies and world-model reuse
We evaluate the Agent, the nonagentic Static baseline, and (in ablations) classic linear baselines (EASEr, PPMI, Item2Vec). Consistent with Section 4, the Agent scores candidates using the same primitives learned offline and reused online: belief updates produce
Soft constraints within the governed set
To couple ranking to operations without changing the underlying demand model, we optionally apply transparent, bounded score adjustments within the governed candidate set
Slate selection with auditable controls
Given scores
Consumer choice and fulfillment
Conditional on a trip, the consumer forms a purchase desire over the shown slate using a logit utility
Configuration and metrics
We initialize on-hand levels proportional to item popularity, restock deterministically every fixed number of sessions, and apply mild nudges only above stock thresholds. We track a simulation analogue of next-item accuracy,
Results
We run the simulator for
Simulation results: paired differences (Agent
Static) across 10 seeds.
Simulation results: paired differences (Agent
Since substitution is always enabled and we mask infeasible items, fill rate is 100% for both. This suggests that the Agent delivers significantly higher revenue primarily by driving more purchases of slightly cheaper baskets. Meanwhile, the simulation
The key operations insight is that, under binding feasibility, maximizing an offline ranking metric like Hit@10 is not sufficient for economic outcomes. Relative to Static, the Agent increases revenue primarily by driving more purchases with slightly cheaper baskets while managing feasibility through substitution within the governed action space, leaving Hit@10 essentially unchanged. The higher substitution and stockout-pressure rates indicate that the Agent more often surfaces high-demand items, and when these bind, value is preserved through controlled substitution rather than suppressing demand.
Overall, the simulation results demonstrate how a governance-first policy organization, combined with world-model reuse and lightweight bounded operations hooks, can shift economic outcomes in inventory-constrained environments without extensive retraining.
In this subsection, we test whether the economic lift of the agentic policy is confined to particular parts of the catalog, sensitive to reasonable changes in replenishment, or an artifact of price and volume mix. Across the following checks, we find that the qualitative pattern is stable. Relative to Static, the Agent increases revenue and margin with moderate, bounded increases in stockout pressure and substitution. In addition,
Robustness to catalog scope. We first test whether the simulation lift is driven by specific parts of the catalog by restricting inventory and replenishment to one super-department at a time (CENTER_STORE_FOOD, FRESH, NONFOOD). We partition the catalog into these three super-departments and rerun the simulator with inventory and restocks restricted to one super-department at a time. This ensures that within-group feasibility drives activity and prevents cross-group spillovers. Table 4 shows positive revenue and margin deltas in all three, with modest increases in stockout and substitution rates. We also observe that
Robustness to basket size. Next, we examine whether the economic lift is concentrated in a particular size of basket. Using the transaction data, we first compute each household’s average number of items per basket and partition households into different tiers: Small (T1), Medium (T2), and Large (T3). Then, we rerun the simulation within each segment using the same operational settings as in the main specification. Table 5 shows that the Agent improves both revenue and margin in all three trip-size groups, indicating that the economic lift is not confined to a single basket-size regime. The gains are largest for medium-size trips, but remain positive for both smaller and larger trips. Meanwhile,
Agent and Static policy comparison.
Agent and Static policy comparison.
Basket-size robustness across seeds (mean
Robustness to slate size. Because retail interfaces and channels differ in how many items can be displayed in practice, we rerun the simulation with slate sizes
Slate-size robustness across seeds (mean
Robustness to replenishment. We then test sensitivity to replenishment dynamics by varying the restocking cycle (every 150 vs. 188 sessions) and lot size (baseline vs.
The pattern in Table 7 suggests that lengthening the cycle raises stockout pressure and slightly erodes the revenue lift, which is operationally intuitive. It also shows that larger lots partially offset this pressure, improving revenue while managing stockouts via substitution under capacity constraints.
Robustness to exploration. Finally, we examine the exploration control
Restocking sensitivity of the Agent relative to the Static policy.
Sensitivity to the exploration control
Mechanism: inventory productivity. To rule out a pure price/volume artifact, we report gross margin, average inventory carrying cost, average units on hand, gross margin return on inventory (GMROI), and revenue per on-hand unit in Table 9.
(For clarity, we define GMROI as gross margin divided by average inventory investment at cost over the simulation horizon, that is,
Overall, these robustness checks show that the Agent’s revenue and margin gains are not driven by a single setting. The revenue and margin gains generalize across broad product groups and persist under plausible changes in replenishment frequency and lot size, exploration intensity, basket size, and slate size. Improvements in GMROI and revenue-per-on-hand point to higher inventory productivity as the main mechanism. Consistent with the governed-decision framing, the Agent tends to surface feasible, in-demand complements and enables substitution when constraints bind, yielding more completed baskets at slightly lower tickets without relying on deeper sequence models or heavy retraining.
Economic metrics across seeds (mean
In this paper, we propose a basket-aware, order-invariant decision framework that bridges offline demand modeling and online, operationally governed recommendation. We develop a latent-profile world model that separates stable household preferences from within-trip basket context, and we show how the same learned structure can be reused by an agentic layer to make sequential, feasibility-aware decisions. This design reflects practical retail operations, such as inventory constraints, substitution, and manager-tunable controls, while retaining enough flexibility to capture complementarity within baskets.
From a modeling perspective, our contribution is twofold. First, we introduce an order-invariant formulation of basket context that remains candidate-dependent, so that context meaningfully affects relative rankings without imposing an artificial checkout order on basket data. Second, we show how a shared low-rank item representation can stabilize estimation in large catalogs while serving as a common interface between offline inference, online belief updating, and real-time scoring.
From an operational standpoint, the agentic layer illustrates how probabilistic preference models can be used under explicit guardrails. The composite score decomposes into a long-run preference component and a within-trip coherence component, combined through an interpretable weight with additional controls. These controls, such as feasibility masking, bounded exploration, and scarcity-based score adjustments, allow the policy to adapt to inventory and business rules without retraining the underlying world model. Empirically, this separation helps us explain why decision quality need not appear as a large change in conventional accuracy. From our simulation, we find that economic outcomes (e.g., revenue and inventory productivity) move materially even when the hit rate changes little, because feasibility and substitution reshape the realized choice and fulfillment process. This highlights a practical implication for OM: evaluating decision systems in constrained environments requires the consideration of outcome metrics (e.g., GMROI, revenue per on-hand, stockout pressure, substitution) in addition to offline ranking accuracy.
The framework also supports monitoring and controlled experimentation. Because the agent’s belief updates mirror the offline inference logic, changes in posterior concentration and evidence accumulation can provide auditable signals for state changes. More broadly, the separation between preference learning and operational scoring enables organizations to adjust governance parameters independently of model retraining, enabling faster iteration and safer implementation.
Limitations and future research
More broadly, this work opens up multiple research opportunities at the intersection of OM, agentic decision-making, and governance. We highlight the following directions:
We condition on realized basket sizes. Jointly modeling basket formation and stopping decisions would connect the consumer world model to OM outcomes such as welfare, congestion, and substitution under capacity and service constraints. Our simulation abstracts from strategic consumer responses to recommendations. A richer consumer-behavior model that captures persuasion and learning effects, that is, how recommendations change choices and how consumers adapt over time, is a valuable extension and a natural direction for future work. In addition, incorporating consumer learning and anticipation would enable agentic analysis of feedback loops, including when governance policies mitigate unintended demand shifting. Our strongest gains arise when feasibility binds and substitution is meaningful. In regimes with weak substitution or near-perfect availability, the operational uplift may be smaller. Likewise, under highly volatile, short-horizon preferences (e.g., trend- or event-driven baskets), stronger short-horizon signals (e.g., recency weighting or seasonality/promotion indicators) may be needed for belief tracking. This motivates OM-style characterization of when a feasibility-aware recommendation is most valuable and how governance levers should adapt across scarcity regimes. The current evaluation emphasizes consistently active items. Extensions are needed for cold-start products and rapidly changing assortments, where the agent must learn safely online. This setting highlights governance trade-offs between bounded exploration and business risk. Multi-objective governance (e.g., margin, fairness, long-run brand objectives, compliance) and field validation remain open directions, including methods for auditing decisions and translating policy levers into manager-operable controls.
Overall, this research demonstrates how basket-aware demand modeling and agentic decision-making can be integrated in a way that is both statistically principled and operationally actionable. By emphasizing order invariance, auditable controls, and feasibility-first decision-making, the proposed approach provides a foundation for implementing AI-driven retail decision systems that embed predictive modeling with operational objectives.
Supplemental Material
sj-pdf-1-pao-10.1177_10591478261458079 - Supplemental material for Governed agentic AI for retail baskets: A consumer world model with inventory-aware actions
Supplemental material, sj-pdf-1-pao-10.1177_10591478261458079 for Governed agentic AI for retail baskets: A consumer world model with inventory-aware actions by Xiexin Liu and Xinwei Chen in Production and Operations Management
Footnotes
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
How to cite this article
Liu X and Chen X (2026) Governed agentic AI for retail baskets: A consumer world model with inventory-aware actions. Production and Operations Management x(x): 1–19.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
