Bolstered Agent: Inference-Time Suggestions for Software-Calling Brokers

This paper was accepted on the Fifth Workshop on Pure Language Technology, Analysis, and Metrics at ACL 2026.

Software-calling brokers are evaluated on device choice, parameter accuracy, and scope recognition, but LLM trajectory assessments stay inherently post-hoc. Disconnected from the energetic execution loop, such assessments establish errors which are normally addressed by way of prompt-tuning or retraining, and essentially can not course-correct the agent in actual time. To shut this hole, we transfer analysis into the execution loop at inference time: a specialised reviewer agent evaluates provisional device calls previous to execution, shifting the paradigm from post-hoc restoration to proactive analysis and error mitigation.

In apply, this structure establishes a transparent separation of issues between the first execution agent and a secondary overview agent. As with all multi-agent system, the reviewer can introduce new errors whereas correcting others, but no prior work to our information has systematically measured this tradeoff. To quantify this tradeoff, we introduce Helpfulness-Harmfulness metrics: helpfulness measures the proportion of base agent errors that suggestions corrects; harmfulness measures the proportion of right responses that suggestions degrades. These metrics straight inform reviewer design by revealing whether or not a given mannequin or immediate offers internet constructive worth.

We consider our method on BFCL (single-turn) and τ2-Bench (multi-turn stateful situations), attaining +5.5% on irrelevance detection and +7.1% on multi-turn duties. Our metrics reveal that reviewer mannequin selection is essential: the reasoning mannequin o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated immediate optimization through GEPA offers a further +1.5–2.8%. Collectively, these outcomes reveal a core benefit of separating execution and overview: the reviewer will be systematically improved by way of mannequin choice and immediate optimization, with out retraining the bottom agent.