Budgeted agentic AI for adaptive lightweight model fine-tuning

Abstract

On-device deployment often relies on lightweight models that require fine-tuning when deployment environments differ from development conditions or when model performance changes over time. Although recent LLM-enabled agents are capable of reasoning about system states and forming adaptive strategies, they often incur substantial computational and budgetary overhead, making them impractical for resource-constrained edge settings. This paper presents Budgeted Agentic AI for Adaptive Lightweight Model Fine-Tuning (BA2), a framework for budget-constrained lightweight model adaptation that combines bounded optimization steps with auxiliary evaluation, rollback, and compression actions. BA2 enables agents to dynamically adjust adaptation strategies according to real-time budget conditions while maximizing system performance. BA2 explicitly accounts for constrained resources, including limited fine-tuning steps, tool invocation quotas, and token consumption. An Engineer–Manager architecture is introduced, where an LLM-based Engineer generates parameterized candidate operations from system logs, and a lightweight contextual-bandit Manager selects actions conditioned on both environmental states and remaining budgets. Experimental results demonstrate that BA2 achieves superior cost-performance trade-offs compared with static heuristics and inference-time search agents under tight budget constraints, while remaining competitive with strong baselines in higher-budget regimes.

Keywords

Agentic AI large language model contextual bandit fine-tuning

1. Introduction

On-device model deployment is typically constrained by limited edge-side resources and environmental heterogeneity.¹ First, these edge devices have limited compute, memory, and power budgets, so what can be deployed is usually a lightweight model that must meet requirements for low latency and energy consumption, but stable performance.² Second, sensor configurations, operating conditions, and data acquisition pipelines may vary substantially across environments, leading to non-negligible distribution shifts. As a result, lightweight models often require environment-specific fine-tuning and may need further adjustment as operating conditions evolve. Such fine-tuning typically proceeds under tight resource constraints and may involve bounded parameter updates together with auxiliary evaluation, rollback, and compression actions. Moreover, even after the initial rollout, edge-deployed models can suffer continuous performance degradation due to distribution drift, changes in sensor noise, and task switching, which in turn impacts the quality of online detection, prediction, or control. These realities make a fully automated, continuously maintainable adaptation mechanism a pressing necessity: the system should be able to respond to performance degradation under limited resources, execute budget-aware adaptation actions, and maintain performance more effectively under changing conditions. Therefore, enabling autonomous model adaptation has become a key challenge for scaling on-device intelligence to large-scale industrial deployment.³

Large language models (LLMs) offer a new route to this problem. Compared with rule-based automation⁴ or human-in-the-loop maintenance,⁵ LLM-enabled agents provide stronger contextual understanding and reasoning: they can interpret training logs and diagnostic signals⁶ (e.g., loss/accuracy trends, performance drops), produce causal explanations,⁷ and propose appropriate optimization actions (e.g., hyperparameter tuning, checkpoint rollback, compression). This capability upgrades model maintenance from a static pipeline to context-aware closed-loop decision-making. In this way, on-device model adaptation can move beyond expert hand-tuning and extensive trial-and-error, toward an automated engineer maintenance process.

However, these advantages are subject to stringent constraints in real-world deployments. On-device systems must operate under multi-dimensional budgets, with adaptation steps limited by energy and time windows, online evaluation, compression, and recovery operations constrained by tool-call quotas, and LLM inference bounded by token cost and latency. Under such conditions, an agent cannot assume abundant resources or unrestricted usage, but must dynamically balance performance gains against computational and communication costs, including energy consumption, tool calls, and token usage, to identify the most cost-effective adaptation trajectory.

To solve this issue, we propose Budgeted Agentic AI for Adaptive Lightweight Model Fine-Tuning (BA2). We model the adaptation process as a budget-conditioned sequential decision problem. At each step, the agent selects an action according to the current model state, including loss and accuracy trends, together with the remaining budget vector, so as to maximize overall performance gains without exceeding resource limits. Concretely, when performance degrades, the agent will execute a bounded fine-tuning step followed by evaluation. When adaptation becomes unstable, it will roll back to a previous checkpoint. When resources become tight or deployment constraints must be satisfied, it will apply compression or terminate the episode. BA2 conditions the policy on the remaining adaptation-step budget, tool-call quota, and token budget, enabling the agent to learn the optimal benefit-cost trade-off under different budget regimes.

Building on this formulation, we introduce an Engineer–Manager framework to balance model performance and budget consumption. The Engineer is driven by an LLM: it reads training logs and generates a small set of feasible candidates with concrete parameters. The Manager is a lightweight contextual-bandit selector that chooses the next action from the candidate set by considering both the current state and remaining budgets. This selector can run with low overhead on the device under strict resource constraints. Through this process, the system can choose an effective action sequence from LLM-proposed candidates, and execute a cost-effective adaptation trajectory under multi-dimensional budget constraints.

Our contributions are summarized as follows:

We formulate lightweight on-device deployment and adaptation as a budget-conditioned trajectory optimization problem, explicitly characterizing the trade-off between performance gains and resource consumption.

We propose the Engineer–Manager framework, where an LLM performs low-frequency proposal generation and semantic diagnosis, and a lightweight contextual-bandit selector performs budget-conditioned selection, achieving dynamic strategies with low inference overhead.

Experiments on a CIFAR-10⁸ benchmark show that, under the same budget constraints, BA2 achieves a better cost-performance trade-off than static heuristics and search-based agents; in higher-budget regimes it remains competitive with search-based methods while offering stronger deployability and stability.

2. Literature review

2.1. Agentic artificial intelligence and budget-aware decision making

The transition from standard large language models to agentic artificial intelligence has enabled systems capable of autonomous planning, tool usage, and multi-step execution.⁹ Early paradigms mainly emphasized open-ended reasoning and acting,^10,11 while more recent systems such as SWE-agent,¹² Openhands,¹³ and ChatDev¹⁴ have demonstrated the potential of large-language-model-based agents in complex engineering workflows. However, these successes have largely been established in tasks with relatively clear execution feedback and without strict adaptation budgets. In budget-constrained model adaptation, by contrast, the system must operate under noisy optimization dynamics, delayed feedback, and hard resource constraints.¹⁵ Existing approaches therefore remain limited in this setting. Planning-based methods such as Tree of Thoughts¹⁶ and Look-Ahead Tree Search¹⁷ improve long-horizon reasoning, but require repeated large language model queries and substantial planning overhead, which weakens scalability under tight budgets. Budget-aware tool-use systems, including ToolChain*,¹⁸ EcoAssistant,¹⁹ and budget-constrained tool learning with planning,²⁰ explicitly consider cost-aware tool use, yet still rely heavily on inference-time exploration or planning. Reinforcement learning offers another route to reducing online reasoning cost,²¹ but end-to-end policy learning is often unstable when the action space is combinatorial and coupled with heterogeneous budgets.

Taken together, these studies provide strong semantic reasoning and tool-use capabilities, but still do not yield a framework that simultaneously preserves semantic flexibility, scales under hard budget constraints, and remains feasible for deployment. By contrast, the proposed framework bridges inference-time search and end-to-end policy learning by separating semantic proposal generation from lightweight online selection, thereby improving scalability and deployment feasibility while retaining adaptive decision-making capability.

2.2. Budget-constrained model adaptation and automated machine learning

Traditional approaches such as Bayesian optimization²² and neural architecture search²³ are effective for global search over model or hyperparameter configurations. However, they are largely resource-oblivious with respect to the adaptation process itself and are mainly designed for offline optimization rather than sequential, deployment-time adaptation under evolving conditions. With the rise of foundation models, large-language-model-driven automated machine learning has introduced a more semantically informed paradigm.²⁴ Systems such as LLAMBO²⁵ use large language models to warm-start or score candidate configurations for Bayesian optimization, while EvoPrompt²⁶ employs large language models to iteratively propose optimization steps through evolutionary search.

These methods improve proposal quality and search efficiency, but still treat the large language model primarily as a predictor or scorer rather than as a mechanism for budget-aware sequential control. As a result, their support for scalable and feasible post-deployment adaptation under hard resource constraints remains limited. The proposed framework addresses this limitation by extending automated machine learning from static search and surrogate ranking to budget-aware sequential decision-making through semantic proposal generation and lightweight online selection.

3. Budgeted agentic framework for model adaptation

This section presents BA2 for adaptive lightweight model fine-tuning under deployment constraints. We first formalize the budget-conditioned adaptation problem and then introduce the Engineer–Manager architecture in Figure 1. Sections 3.3 and 3.4 describe the Engineer and the Manager, respectively, and Section 3.5 presents the executable tool interface and deployment protocol.

Figure 1.

Overview of the proposed budgeted agentic framework for model adaptation.

3.1. Problem formulation

We focus on the problem of autonomous model adaptation in resource-constrained environments, where an agent must strategically trade off performance gains against multi-dimensional budgets. To address this, we formulate the adaptation process as a finite-horizon Markov decision process,²⁷ explicitly conditioned on remaining resource budgets. Consider an agent tasked with optimizing a model $M$ on a device. An episode consists of steps $t = 0, 1, \dots, T$ . To ensure the Markov property, we define the system state as a tuple $(s_{t}, b_{t})$ , where $s_{t}$ encapsulates the model state (e.g., validation accuracy, loss trends) and $b_{t}$ represents the remaining budget vector.

At each step, the agent selects an action $a_{t}$ from the discrete action space $A$ . We define $A = A_{opt} \cup A_{deploy}$ , where $A_{opt}$ = ${Train, Rollback, Stop}$ and $A_{deploy}$ = ${Evaluate, Compress}$ . Unlike an automatic post-update routine, Evaluate is an explicit local tool in BA2. After Train, Rollback, or Compress, the system may invoke Evaluate when fresh validation feedback is needed. Its cost is charged to the tool budget only when executed. This explicit treatment keeps the action semantics and budget accounting aligned throughout the paper.

The agent operates under a multi-dimensional budget constraint vector:

b_{t} = [b_{t}^{step}, b_{t}^{tool}, b_{t}^{token}] \in R_{\geq 0}^{3},

(1)

where

b_{t}^{step}

limits the total fine-tuning steps,

b_{t}^{tool}

tracks non-LLM tool invocations (e.g., Evaluate, Rollback, Compress, and local Summarize), and

b_{t}^{token}

accounts exclusively for LLM token consumption incurred by the Engineer. This separation removes the ambiguity between LLM calls and local tool calls.

The execution of action $a_{t}$ incurs a budget consumption vector $c (s_{t}, a_{t})$ , defined as:

c (s_{t}, a_{t}) = [c^{step} (a_{t}), c^{tool} (a_{t}), c^{token} (s_{t}, a_{t})] \in R_{\geq 0}^{3} .

(2)

In our implementation, executable on-device actions do not invoke the LLM. Therefore,

c^{token} (s_{t}, a_{t}) = 0

for all

a_{t} \in A

. Token consumption occurs only when the Engineer is queried at a candidate-refresh round. Specifically, if the Engineer is invoked at refresh round

r

, the token budget is updated by

b_{τ_{r}^{+}}^{token} = b_{τ_{r}^{-}}^{token} - δ_{r}^{token},

(3)

where

δ_{r}^{token}

is measured from the actual prompt and completion tokens returned by the language model interface. For executable actions, budgets are updated via

b_{t + 1} = b_{t} - c (s_{t}, a_{t})

, and an action is feasible only if

b_{t + 1} \geq 0

3.1.1 Optimization Objective

The objective is to learn a policy $π$ that maximizes the final model performance at the end of the episode $T$ , subject to strict budget constraints:

L = max_{π} E [Performance (s_{T})] s.t. \forall t, b_{t} \geq 0 .

(4)

Although the overall adaptation process can be formulated as a finite-horizon Markov decision process, the sparse-refresh design of BA2 reduces the online decision problem to selecting among a small set of feasible, context-dependent candidates under the current summarized state and remaining budgets. This motivates our use of a lightweight contextual-bandit approximation for the Manager, which preserves budget-conditioned adaptivity while reducing online decision overhead.

3.2. Budgeted agentic framework

To enable autonomous adaptation under strict on-device deployment constraints, BA2 decouples decision-making into two coordinated roles—the Engineer and the Manager. Concretely, the Engineer proposes feasible update candidates under the current budget, while the Manager allocates and enforces budgets and determines which candidate to deploy and when to deploy it.

A key design principle of BA2 is the temporal decoupling between semantic proposal generation and low-overhead action execution. Rather than invoking the LLM after every executed action, BA2 refreshes candidate proposals only at sparse candidate-refresh rounds. Between two successive refreshes, the Manager repeatedly selects executable actions from the currently active candidate pool. This design preserves the adaptive reasoning ability of the LLM while substantially reducing token consumption and LLM latency.

Figure 2 illustrates the propose-and-select workflow. At a refresh round, the Engineer first utilizes the summarized state ${\bar{s}}_{τ_{r}}$ to generate candidate actions, which the Manager then evaluates against the remaining budget $b_{τ_{r}}$ to determine the subsequent executable action sequence. This separation leverages the LLM for what operations are sensible under the current training dynamics, while letting a specialized low-overhead learner decide how to spend the remaining budgets. We next detail each role.

Figure 2.

Illustration of the Engineer–Manager driven budget-aware action selection process.

Engineer: An LLM-based reasoning agent that analyzes the current summarized state and the recent trajectory, and then proposes a filtered subset of candidate operations for the Manager, thereby reducing the decision search space.

Manager: A lightweight contextual bandit policy based on the linear upper confidence bound method,²⁸ which enforces strict budget feasibility. It estimates the potential reward of each candidate under the remaining budget vector and selects the optimal action to maximize adaptation efficiency subject to constraints.

Let $r$ denote the refresh index and $τ_{r}$ the action step at which the $r$ -th refresh is triggered. At refresh round $r$ , the Engineer produces a candidate subset $C_{r} \subset A$ containing at most $K$ candidate operations. The Manager then consumes $C_{r}$ over the interval $t \in [τ_{r}, τ_{r + 1})$ without re-querying the LLM. The resulting propose-and-select process can be summarized as:

\begin{aligned} (s_{τ_{r}}, b_{τ_{r}}) & \overset{Engineer}{\to} C_{r}, \\ (s_{t}, b_{t}, C_{r}) & \overset{Manager}{\to} a_{t} \overset{Execute}{\to} (s_{t + 1}, b_{t + 1}) . \end{aligned}

In our implementation, a refresh is triggered when one of the following conditions holds: (1) every $3$ executed actions; (2) the recent validation signal indicates substantial degradation; or (3) the remaining candidates in the current pool become infeasible under the updated budget.

3.3. The Engineer

The Engineer is responsible for transforming the current summarized state and remaining budget into a small set of contextually plausible candidate operations for the Manager. This subsection describes the Engineer from four aspects: structured prompt construction, candidate generation, prompt-length management, and feasibility filtering.

At each candidate-refresh round $r$ , we build a structured $prompt ({\bar{s}}_{τ_{r}}, b_{τ_{r}})$ that integrates: (1) a compact state summary ${\bar{s}}_{τ_{r}}$ , (2) the remaining budget vector $b_{τ_{r}}$ , and (3) static domain constraints. The summary ${\bar{s}}_{τ_{r}}$ is produced by aggregating recent metrics, stability indicators, and execution history from the preceding action interval, thereby preventing prompt length from growing linearly with episode length.

The LLM synthesizes this summarized context to generate a candidate pool $C_{r}$ , which serves as proposal input to the Manager:

C_{r} = {a_{r}^{(1)}, \dots, a_{r}^{(K)}} = LLM (prompt ({\bar{s}}_{τ_{r}}, b_{τ_{r}})),

(5)

where

K = 3

in all experiments. Restricting the proposal pool to a small constant bounds downstream selection cost and reduces the probability of verbose or redundant LLM outputs.

3.3.1 Example prompt

Below is an example of our structured prompt (abbreviated), which we provide to the LLM to generate $K$ candidates in a fixed format.

System: You are an automated ML engineer for on-device model adaptation. Your goal is to propose a few safe, cost-effective next operations under strict budgets.

State summary:

– Accuracy (last 3): 0.812 $\to$ 0.809 $\to$ 0.803 ( $↓$ )

– Loss (last 3): 0.94 $\to$ 1.10 $\to$ 1.35 ( $↑$ )

– Stability: unstable (loss increasing)

Remaining budgets:

– Adaptation-step budget: 120 steps

– Tool-call quota: 6 calls

– Remaining token budget: $N$

Constraints:

– Must output at most 3 candidates

– Each candidate must include: {tools, args, rationale}

– When training is unstable, prioritize stabilizing actions over aggressive actions.

3.3.2 Output schema

Each candidate includes 1) an operation type (e.g., continue adaptation with adjusted hyperparameters, revert to a previous checkpoint, apply compression, run evaluation, or terminate); 2) concrete parameters (e.g., step count, learning rate scale, compression target); and 3) an optional short rationale. The Engineer is allowed to return fewer than $K$ candidates if the current state strongly suggests early stopping or a narrow feasible region.

3.3.3 Prompt-length management

To stabilize token usage, BA2 employs two complementary mechanisms. First, only recent metrics and compact trajectory statistics are retained in ${\bar{s}}_{τ_{r}}$ . Second, when the raw history exceeds a predefined window, the system invokes a local Summarize routine to compress historical logs before constructing the next prompt. This routine is implemented locally and does not consume LLM tokens.

3.3.4 Feasibility filtering

Before passing proposals to the Manager, BA2 estimates the execution cost of each candidate using deterministic metadata: training candidates expose their step counts and learning-rate settings; deployment candidates expose their associated local tool calls; and LLM token cost is accounted at refresh time rather than action-execution time. Formally, the feasible candidate set ${\tilde{C}}_{r}$ is constructed as:

{\tilde{C}}_{r} = {a \in C_{r} ∣ b_{τ_{r}} - \hat{c} ({\bar{s}}_{τ_{r}}, a) ⪰ 0},

(6)

where

\hat{c} ({\bar{s}}_{τ_{r}}, a)

is an estimated action cost vector computed prior to execution. If the realized local execution cost differs slightly from the estimate, BA2 uses the realized cost for budget updates and re-checks feasibility at the next refresh round.

3.4. The Manager

The Manager must make fast decisions under tight budgets and intermittent connectivity. Given the sparse-refresh design above, the online decision problem at each step reduces to selecting one feasible candidate from a small context-dependent pool. We therefore instantiate the Manager as a lightweight contextual-bandit selector based on the linear upper confidence bound method, which provides strong sample efficiency, interpretable uncertainty estimates, and low online overhead. The Manager operates only on the feasible candidate set ${\tilde{C}}_{r}$ , ensuring all selected actions are budget-feasible.

3.4.1 Budget-aware feature representation

For each candidate $a \in {\tilde{C}}_{r}$ , the Manager constructs a corresponding $d$ -dimensional context feature vector $x (s_{t}, a) \in R^{d}$ by concatenating budget, state, and action descriptors:

x (s_{t}, a) = [ϕ_{budget} (b_{t}), ϕ_{state} (s_{t}), ϕ_{action} (a)] .

(7)

All feature components are hand-crafted and normalized to $[0, 1]$ before being fed into the linear upper confidence bound selector. In our implementation:

ϕ_{budget} (b_{t}) = [\frac{b_{t}^{step}}{B_{total}^{step}}, \frac{b_{t}^{tool}}{B_{total}^{tool}}, \frac{b_{t}^{token}}{B_{total}^{token}}],

(8)

which captures remaining resource ratios. The state feature

ϕ_{state} (s_{t})

contains current validation performance, recent performance deltas, recent loss trend, and a stability indicator summarizing whether optimization is improving, plateauing, or degrading. The action feature

ϕ_{action} (a)

contains an action-type one-hot encoding together with normalized action parameters, such as requested step count, learning-rate scale, or compression target when applicable.

Accordingly, the feature representation is interpretable rather than learned end-to-end: $ϕ_{budget}$ encodes scarcity, $ϕ_{state}$ encodes urgency, and $ϕ_{action}$ encodes the expected cost–impact profile of a candidate. This explicit construction also makes the selector behavior easier to analyze under budget ablations.

3.4.2 Selection rule

The Manager selects the action $a_{t}$ from the feasible candidate set ${\tilde{C}}_{r}$ by maximizing the reward. We formulate this selection rule as:

\begin{aligned} a_{t} = \arg max_{a \in {\tilde{C}}_{r}} (x (s_{t}, a)^{⊤} θ_{t} + α \sqrt{x (s_{t}, a)^{⊤} A_{t}^{- 1} x (s_{t}, a)}), \end{aligned}

(9)

where

θ_{t}

is the learned weight vector representing the current belief of reward distribution, and

A_{t}

is the covariance matrix that accumulates the history of observed features. The term

\sqrt{x (s_{t}, a)^{⊤} A_{t}^{- 1} x (s_{t}, a)}

quantifies the epistemic uncertainty²⁸ of a candidate action, while

α

is a hyperparameter scaling the exploration appetite.

3.4.3 Online updates with shaped reward

During training episodes, after executing $a_{t}$ we observe a scalar reward $r_{t}$ and update $(θ_{t}, A_{t})$ via recursive ridge regression. We design a shaped reward to align with the dual objective of improving performance while conserving budgets:

\begin{aligned} r_{t} & = Δ {Acc}_{t} - λ_{tool} Δ {tool}_{t} - λ_{step} \frac{Δ {step}_{t}}{B_{total}^{step}} - λ_{token} \frac{Δ {token}_{t}}{B_{total}^{token}} - λ_{viol} v_{t}, \end{aligned}

(10)

where

Δ

denotes step-wise changes. The hyperparameters

λ_{tool}, λ_{step},

and

λ_{token}

are penalty coefficients that control the trade-off between accuracy gain and resource consumption (for tool usage, optimization steps, and token costs). Similarly,

λ_{viol}

dictates the severity of the penalty for violating deployment constraints, where the binary indicator

v_{t} \in {0, 1}

equals

1

if a violation occurs and

0

otherwise. We normalize the step and token consumption terms by their corresponding total budgets to maintain a scale comparable to the accuracy term, while the tool-usage penalty is controlled directly by

λ_{tool}

. Conditioning the policy on

ϕ_{budget} (b_{t})

enables the Manager to become more conservative as resources diminish, without requiring pre-trained deep policy networks.

3.4.4 Training protocol

The Manager is trained offline on simulated adaptation episodes and then frozen during evaluation. Specifically, we train the linear upper confidence bound selector for $100$ episodes using the same action space and budgeted environment described above, and we reset the environment between episodes while preserving the learned bandit statistics $(θ, A)$ within a training run. During final evaluation, the learned selector is fixed and executed without further parameter updates. Hyperparameters $α$ , $ϵ$ , and $λ_{\cdot}$ are selected on a held-out validation split and are not tuned on the test episodes. Different random seeds correspond to independent training runs; therefore, the learned bandit statistics are re-initialized across runs and are not carried over between seeds.

3.5. Tools

The framework executes the selected action $a_{t}$ in $A$ through a corresponding local tool interface. Therefore, the toolset $T$ is the implementation of the action space $A$ . In addition to the tools corresponding to actions in $A$ , the system also includes an auxiliary local routine, Summarize, for prompt-length management. Accordingly, the toolset is defined as

\begin{aligned} T = {Train, Evaluate, Rollback, Compress, Summarize, Stop} . \end{aligned}

(11)

To avoid ambiguity, BA2 distinguishes between LLM invocation and local tool execution. Only Engineer refreshes consume the token budget $b^{token}$ , whereas the tools in $T$ are local executable routines whose invocation counts are charged, when applicable, to the tool budget $b^{tool}$ .

3.5.1 Tool Interface

The framework includes six tools. The Train tool performs a bounded number of training steps under specified hyperparameters, such as step count and learning-rate scale, and returns updated metrics together with the consumed step budget. The Evaluate tool runs validation when explicitly selected by the system, consumes tool-call budget, and returns updated metrics used to refresh the state summary for subsequent decisions; it is not appended automatically to every model-modifying operation. The Rollback tool saves the current model state or restores a previous checkpoint to recover from degradation, thereby consuming tool-call budget. The Compress tool applies model compression, such as quantization, to satisfy deployment constraints, and returns updated size or latency indicators. The Summarize tool is a local log-compression routine that produces a compact summary of recent training signals to stabilize prompt length; it does not invoke the LLM and therefore consumes no token budget, although its lightweight local invocation is charged to the tool budget when explicitly triggered. Finally, the Stop tool terminates the episode when further adaptation is no longer cost-effective.

3.5.2 Safety and Deployment Guarantees

To ensure the production of valid, deployable models under strict constraints, the Manager enforces a three-tier safety protocol. First, feasibility masking removes, before selection, any candidate that would cause an immediate budget overrun. Second, the deployment guard prioritizes Compress actions when the remaining budget approaches critical levels while deployment constraints are still unsatisfied. Third, early termination triggers the Stop tool when the expected reward gain becomes sufficiently small, thereby preserving residual resources.

4. Experiment

4.1. Experiment setup

4.1.1. Task: Autonomous model adaptation

We evaluate BA2 on a CIFAR-10⁸ benchmark with a ResNet-18 backbone under explicit multi-budget constraints. This benchmark is designed to isolate the budget-conditioned decision mechanism in a reproducible setting, while broader heterogeneous industrial evaluations are left for future work.

Budget instantiation

Following the three-dimensional budget formulation in Section 3, we instantiate the step, tool, and token budgets in our experiments and treat all three components as hard constraints. An episode terminates immediately if any budget component is exhausted before reaching a deployable final state.

Behavioral regimes

To examine whether BA2 adapts its decision behavior under different resource levels, we evaluate it under two budget regimes that represent tighter and more relaxed budget conditions, respectively. In the Moderate regime, the budget setting is $(b^{step}, b^{tool}, b^{token}) = (6,000, 18, 50,000)$ , and the maximum episode length is limited to $9$ decision steps. In the High regime, the budget setting is $(b^{step}, b^{tool}, b^{token}) = (60,000, 100, 200,000)$ , with a maximum episode length of $40$ decision steps.

For fair comparison, all compared agentic methods are evaluated under the same step, tool, and token budgets in each regime. In the Moderate regime, the maximum fine-tuning steps per training action are capped at $1,500$ .

Implementation details

We instantiate the Engineer using qwen-plus-2025-12-01, which supports a maximum context length of 1M tokens. Unless otherwise stated, the candidate pool size is fixed to $K = 3$ , and the periodic refresh interval is $m = 3$ actions. In practice, BA2 retains only recent metrics and compact trajectory statistics, and further compresses long histories via the local Summarize routine, so the effective prompts used in our experiments remain substantially shorter than this upper bound.

For the selector, we adopt a hybrid exploration strategy with confidence bound parameter $α = 1.0$ and $ϵ$ -greedy factor $ϵ = 0.05$ . The reward penalty coefficients are set to $λ_{tool} = 0.01$ , $λ_{step} = 0.1$ , $λ_{token} = 0.05$ , and $λ_{viol} = 1.0$ .

For each random seed, we train the Manager for 100 offline episodes and then evaluate the frozen policy on 20 test episodes. Here, the policy is frozen, meaning that the Manager parameters are no longer updated during evaluation. During evaluation, the model state is reset at the beginning of each test episode so that each episode is assessed under a fresh adaptation condition rather than continuing from the previous episode. The corresponding local data partition also varies across episodes in its class composition and sample allocation. Results are reported as mean values over 3 random seeds.

4.1.2. Baselines

Advanced agentic baselines

We compare BA2 with two advanced agentic baselines. The first is BTP,²⁰ which we adapt as a search-based planning baseline by implementing a depth-first search planner within the same BA2 action space and under the same step, tool, and token budgets. At each decision point, BTP simulates potential future trajectories from the candidate set to select the next action, and therefore represents a planning-based baseline with higher inference-time reasoning overhead. The second is LLAMBO,²⁵ adapted from its official implementation. We retain its core LLM surrogate scoring mechanism, under which the LLM zero-shot scores and ranks the candidate actions provided by the Engineer based on historical logs, rather than relying on a learned policy for action selection. LLAMBO is also evaluated under the same step, tool, and token budgets as BA2, and thus serves as a strong baseline for LLM-driven optimization without explicit budget-conditioned policy learning.

Standard heuristics

We also compare BA2 against four standard heuristic strategies. Fixed Schedule denotes a static automation rule with predefined training and deployment behavior. Greedy is a resource-oblivious strategy that maximizes immediate validation gain, for example by fine-tuning until the budget is exhausted. Random uniformly samples from valid adaptation actions. One-shot performs a single adaptation step and then deploys immediately.

In addition to validation performance, we record per-episode resource usage for all methods, including feasibility, LLM calls, token consumption, local tool calls, fine-tuning steps, and wall-clock runtime.

4.1.3. Evaluation metrics and reporting protocol

Metric definition

We report validation performance using episode-wise validation accuracy curves. Let $a_{t}^{(r)}$ denote the validation accuracy at episode index $t$ in the $r$ -th repeated evaluation run, where $r = 1, \dots, R$ and $R$ is the total number of runs used for aggregation. The mean validation-accuracy curve is defined as

{\bar{a}}_{t} = \frac{1}{R} \sum_{r = 1}^{R} a_{t}^{(r)} .

For scalar comparison, each method is summarized by its peak mean validation accuracy, defined as

Peak Mean Val.\ Acc. = max_{t} {\bar{a}}_{t}

Feasibility

We additionally report the feasibility rate. Let $f_{r} \in {0, 1}$ indicate whether the $r$ -th repeated evaluation run terminates with a budget-feasible and deployable final state. The feasibility rate is then defined as

Feasibility Rate = \frac{1}{R} \sum_{r = 1}^{R} f_{r} .

Cost metrics

To quantify resource overhead, we record the average number of LLM calls, total prompt and completion tokens, local tool calls, fine-tuning steps, and wall-clock runtime per episode.

Curve Reporting

For trajectory plots, each point denotes ${\bar{a}}_{t}$ , i.e., the mean validation accuracy at the corresponding episode index across repeated evaluation runs. Variability is omitted for readability.

4.2. Performance evaluation

Following the above protocol, we compare validation performance across methods and then report feasibility and resource usage under matched budgets. Tables 1 and 2 summarize the main accuracy results in the Moderate and High regimes. Figures 3 and 4 present the corresponding episode-wise validation trajectories.

Figure 3.

Mean episode-wise validation accuracy under matched budgets for the proposed framework and standard heuristics. Each point denotes the mean validation accuracy at the corresponding episode index across repeated evaluation runs. Variability is omitted for readability.

Figure 4.

Mean episode-wise validation accuracy under matched budgets for the proposed framework and advanced agentic baselines. Each point denotes the mean validation accuracy at the corresponding episode index across repeated evaluation runs. Variability is omitted for readability.

Table 1.

Comparison with standard heuristics under moderate and high regimes.

Regime	Method	Peak Mean Validation Accuracy
Moderate	Greedy	0.773
	Fixed	0.748
	Random	0.727
	One-shot	0.748
	BA2 (Ours)	0.784
High	Greedy	0.792
	Fixed	0.774
	Random	0.744
	One-shot	0.756
	BA2 (Ours)	0.803

Table 2.

Comparison with advanced agentic baselines under moderate and high regimes.

Regime	Method	Peak Mean Validation Accuracy
Moderate	BTP	0.779
	LLAMBO	0.764
	BA2 (Ours)	0.784
High	BTP	0.798
	LLAMBO	0.789
	BA2 (Ours)	0.803

4.2.1. Comparison with standard heuristics

As shown in Table 1, BA2 achieves the highest validation performance among the compared heuristic baselines in both regimes. In the Moderate regime, BA2 reaches $0.784$ , higher than Greedy at $0.773$ . In the High regime, BA2 reaches $0.803$ , higher than Greedy at $0.792$ . BA2 is also higher than Fixed, Random, and One-shot in both regimes.

4.2.2. Comparison with advanced agentic baselines

Table 2 shows that BA2 also achieves the highest validation performance among the advanced agentic baselines. In the Moderate regime, BA2 reaches $0.784$ , compared with $0.779$ for BTP and $0.764$ for LLAMBO. In the High regime, BA2 reaches $0.803$ , compared with $0.798$ for BTP and $0.789$ for LLAMBO. Figure 4 shows the same trend over the episode horizon.

4.2.3. Cost and feasibility analysis

Table 3 reports feasibility and resource usage under identical budgets. BA2 attains the highest feasibility in both regimes and uses fewer LLM calls, fewer tokens, and lower runtime than BTP and LLAMBO. In the High regime, BA2 uses slightly more runtime than Greedy, but achieves higher validation performance and feasibility with fewer local tool calls and fewer fine-tuning steps. Figure 5 shows that BA2 occupies a favorable runtime–accuracy position under matched budgets.

Figure 5.

Runtime-based cost–performance frontier under matched budgets. The horizontal axis reports measured wall-clock runtime, and the vertical axis reports peak mean validation accuracy derived from the episode-wise curves. Methods closer to the upper-left region achieve stronger validation performance at lower runtime cost.

Table 3.

Per-method feasibility and resource consumption under identical budgets.

Method	Feasibility Rate (%)	Large Language Model Calls	Tokens	Tool Calls	Fine-Tuning Steps	Time (s)
Moderate Regime
BA2 (Ours)	100	3.5	4,250	16.2	5,840.5	285.4
BTP	76	28.4	34,500	14.8	4,650.0	1,450.2
LLAMBO	85	16.5	18,200	15.5	5,210.8	845.6
Greedy	95	0	0	18.0	6,000.0	258.0
Fixed	100	0	0	10.0	4,500.0	195.5
Random	88	0	0	8.5	3,420.0	148.2
One-shot	100	0	0	2.0	1,500.0	68.5
High Regime
BA2 (Ours)	100	8.2	10,480	88.5	58,650.0	2,840.5
BTP	62	95.6	125,400	72.4	45,200.0	14,800.0
LLAMBO	75	48.5	56,800	81.2	51,800.0	6,520.4
Greedy	88	0	0	100.0	60,000.0	2,750.0
Fixed	100	0	0	50.0	40,000.0	1,860.5
Random	80	0	0	42.0	28,500.0	1,320.0
One-shot	100	0	0	2.0	5,000.0	245.0

4.3. Ablation study

To validate the critical components of BA2, we conduct ablation studies focusing on the impact of the learned selector and explicit budget-conditioned state modeling.

4.3.1 Effectiveness of the Learned Selector

First, we analyze the optimization capability of the selector in the High regime. Table 4 shows that the learned selector achieves the highest validation performance among the compared selector variants in the High regime. BA2 reaches $0.803$ , higher than the Heuristic Selector at $0.792$ and the Random Selector at $0.744$ . BA2 also uses fewer local tool calls and fewer fine-tuning steps than the Heuristic Selector.

Table 4.
Ablation on selector policy and budget conditioning (high regime).

Variant Peak Mean Validation Accuracy Local Tool Calls Fine-Tuning Steps

BA2 (Ours) 0.803 88.5 58,650.0

Heuristic Selector 0.792 100.0 60,000.0

Random Selector 0.744 42.0 28,500.0

No-budget State 0.116 12.0 0.0

Const-budget State 0.116 12.0 0.0

Variant	Peak Mean Validation Accuracy	Local Tool Calls	Fine-Tuning Steps
BA2 (Ours)	0.803	88.5	58,650.0
Heuristic Selector	0.792	100.0	60,000.0
Random Selector	0.744	42.0	28,500.0
No-budget State	0.116	12.0	0.0
Const-budget State	0.116	12.0	0.0

4.3.2 Impact of Budget-Aware State Modeling

Next, we investigate the role of explicit budget conditioning by ablating budget features from the state space. As shown in the bottom section of Table 4, removing budget information or replacing it with a constant prevents effective adaptation. Both variants remain at $0.116$ with $0.0$ fine-tuning steps. This result indicates that explicit budget information is necessary for effective adaptation in our setting.

5. Conclusion and discussion

In this paper, we presented BA2, a budget-aware framework for adaptive lightweight model fine-tuning under resource and deployment constraints. BA2 combines a frozen LLM-based Engineer for low-frequency candidate proposal with a lightweight contextual-bandit Manager for budget-aware action selection. In CIFAR-10 experiments under explicit step, tool, and token budgets, BA2 achieved a stronger cost–performance trade-off than static heuristics and inference-time search baselines, while maintaining lower reasoning overhead and high feasibility across different budget regimes. These results suggest that low-frequency semantic proposal generation together with lightweight downstream selection is an effective design for resource-constrained model adaptation, and that budget should be treated as an intrinsic state signal rather than merely an external stopping condition.

Future work will extend BA2 to broader deployment settings and stronger adaptation benchmarks, and further improve the Engineer–Manager interaction under richer budget constraints.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62503177 and 62573197; in part by the Shanghai Natural Science Foundation Project under Grant No. 24ZR1416400; in part by the Shanghai BaiyuLan Talent Program Pujiang Project under Grant No. 24PJD020; in part by the Postdoctoral Fellowship Program of CPSF under Grant Nos. GZB20250432, 2025T180476, and 2025M781639; in part by the Shanghai Science and Intelligence “Hundred Teams, Hundred Projects” Program under Grant No. RZ-RGZN-01-25-0951; and in part by the Industry-Academia-Research Collaboration Fund of the Eighth Academy of China Aerospace Science and Technology Corporation under Grant No. SAST2024-060.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Footnotes

ORCID iDs

Yangcheng Wang

Qiyu Sun

Jing Xu

Author biographies

Yangcheng Wang is a Postgraduate student at the School of Information Science and Engineering, East China University of Science and Technology. His research interests include agentic artificial intelligence, model compression, and edge intelligence.

Qiyu Sun is a Postdoctoral Researcher at the School of Information Science and Engineering, East China University of Science and Technology. Her research interests include computer vision, deep learning, and embodied artificial intelligence.

Jing Xu is an Associate Professor at the School of Information Science and Engineering, East China University of Science and Technology. Her research interests include singularly perturbed systems, time-delay systems, state estimation, robust control, and unmanned aerial vehicles.

References

Zhou

Chen

, et al. Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proc IEEE 2019; 107: 1738–1762.

Zhao

Wang

Ling

, et al. Edgeml: An automl framework for real-time deep learning on the edge. In: Proceedings of the international conference on internet-of-things design and implementation, 2021, pp.133–144.

Sun

Han

, et al. When embodied AI meets industry 5.0: Human-centered smart manufacturing. IEEE/CAA J Automat Sinica 2025; 12: 485–501.

Jamieson

DeSalvo

, et al. Hyperband: A novel bandit-based approach to hyperparameter optimization. J Mach Learn Res 2018; 18: 1–52.

Mosqueira-Rey

Hernández-Pereira

Alonso-Ríos

, et al. Human-in-the-loop machine learning: a state of the art. Artif Intell Rev 2023; 56: 3005–3054.

Hong

Lin

Liu

, et al. Data interpreter: An llm agent for data science. In: Findings of the association for computational linguistics: ACL 2025, 2025, pp.19796–19821.

Zhang

Sun

Zhao

, et al. Causal reasoning in typical computer vision tasks. Sci China Technol Sci 2024; 67: 105–120.

Krizhevsky

Hinton

. Learning multiple layers of features from tiny images 2009.

Chen

Guo

, et al. The rise and potential of large language model based agents: A survey. Sci China Inform Sci 2025; 68: 121101.

10.

Yao

Zhao

, et al. React: Synergizing reasoning and acting in language models. In: The eleventh international conference on learning representations, 2022.

11.

Shinn

Cassano

Gopinath

, et al. Reflexion: Language agents with verbal reinforcement learning. Adv Neural Inf Process Syst 2023; 36: 8634–8652.

12.

Yang

Jimenez

Wettig

, et al. Swe-agent: Agent-computer interfaces enable automated software engineering. Adv Neural Inf Process Syst 2024; 37: 50528–50652.

13.

Wang

Song

, et al. Openhands: An open platform for AI software developers as generalist agents. arXiv preprint arXiv:2407.16741 2024.

14.

Qian

Liu

, et al. Chatdev: Communicative agents for software development. In: Proceedings of the 62nd annual meeting of the association for computational linguistics (Volume 1: Long Papers), 2024, pp.15174–15186.

15.

Xiao

Kwon

Zhou

, et al. Architecture-agnostic test-time adaptation via backprop-free embedding alignment. In: The Fourteenth international conference on learning representations, 2026. https://openreview.net/forum?id=7kLNGaAHaw.

16.

Yao

Zhao

, et al. Tree of thoughts: Deliberate problem solving with large language models. Adv Neural Inf Process Syst 2023; 36: 11809–11822.

17.

Zhou

Yan

Shlapentokh-Rothman

, et al. Language agent tree search unifies reasoning, acting, and planning in language models. Proc Mach Learn Res 2024; 235: 62138–62160.

18.

Zhuang

Chen

, et al. Toolchain*: Efficient action space navigation in large language models with a* search. In: ICLR, 2024.

19.

Zhang

Krishna

Awadallah

, et al. Ecoassistant: Using llm assistants more affordably and accurately. In: ICLR 2024 Workshop on large language model (LLM) Agents.

20.

Zheng

Zhang

Dong

, et al. Budget-constrained tool learning with planning. In: Findings of the association for computational linguistics: ACL 2024, 2024, pp.9231–9248. https://aclanthology.org/2024.findings-acl.536.

21.

Abou Ali

Dornaika

Charafeddine

. Agentic AI: a comprehensive survey of architectures, applications, and future directions. Artif Intell Rev 2025; 59: 11.

22.

Shahriari

Swersky

Wang

, et al. Taking the human out of the loop: A review of bayesian optimization. Proc IEEE 2016; 104: 148–175.

23.

Salmani Pour Avval

Eskue

Groves

, et al. Systematic review on neural architecture search. Artif Intell Rev 2025; 58: 73.

24.

Trirat

Jeong

Hwang

. Automl-agent: A multi-agent llm framework for full-pipeline automl. In: International conference on machine learning, 2025, pp.60099–60146. PMLR.

25.

Liu

Astorga

Daxberger

, et al. Llambo: Large language models to enhance bayesian optimization. In: The Twelfth international conference on learning representations, 2024. https://openreview.net/forum?id=OOxotBmGol.

26.

Guo

Wang

Jiang

, et al. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. In: The Twelfth international conference on learning representations (ICLR).

27.

Barto

. Reinforcement learning: An introduction. SIAM Rev 2021; 6: 423.

28.

Chu

Langford

, et al. A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th international conference on World wide web, 2010, pp.661–670. ACM.