Multi-tool-integrated reasoning permits LLM-empowered tool-use brokers to resolve advanced duties by interleaving natural-language reasoning with calls to exterior instruments. Nonetheless, coaching such brokers utilizing outcome-only rewards suffers from credit-assignment ambiguity, obscuring which intermediate steps (or tool-use choices) result in success or failure. On this paper, we suggest PORTool, an importance-aware policy-optimization algorithm that reinforces brokers’ tool-use competence from outcome-level supervision whereas assigning reward on the step degree. Particularly, PORTool generates a rewarded rollout tree during which trajectories share prefixes earlier than branching, enabling direct comparisons amongst different tool-use choices throughout the similar context. It then estimates every step’s significance by a correctness-dominant sign, i.e., whether or not descendants of that step can finally produce an accurate remaining reply, plus an auxiliary time period indicating whether or not the step’s instrument calls execute efficiently. Utilizing these step-wise significance estimates, PORTool updates the coverage to generate environment friendly tool-call steps, guided by each native comparisons inside every branching choice and the general high quality of whole trajectories. Experiments present that PORTool improves final-answer accuracy whereas lowering tool-call steps in contrast with state-of-the-art baselines, and ablation research verify the robustness of the proposed step-wise significance estimates.
- †Purdue College
- ** Work accomplished whereas at Apple







