Goldilocks RL: Tuning Job Problem to Escape Sparse Rewards for Reasoning

Reinforcement studying has emerged as a strong paradigm for unlocking reasoning capabilities in massive language fashions. Nevertheless, counting on sparse rewards makes this course of extremely sample-inefficient, as fashions should navigate huge search areas with minimal suggestions. Whereas basic curriculum studying goals to mitigate this by ordering information primarily based on complexity, the fitting ordering for a particular mannequin is commonly unclear. To deal with this, we suggest Goldilocks, a novel teacher-driven information sampling technique that goals to foretell every query’s problem for the scholar mannequin. The trainer mannequin selects questions of applicable problem for the scholar mannequin, i.e., questions which are neither too simple nor too onerous (Goldilocks precept), whereas coaching the scholar with GRPO. By leveraging the scholar’s efficiency on seen samples, the trainer repeatedly adapts to the scholar’s evolving talents. On OpenMathReasoning dataset, Goldilocks information sampling improves the efficiency of fashions skilled with commonplace GRPO beneath the identical compute funds.

No Result