This weblog put up relies on the work BaNEL: Exploration Posteriors for Generative Modeling Utilizing Solely Unfavorable Rewards.
Tackling Very Exhausting Issues
The final word purpose of machine studying analysis is to push machines past human limits in vital purposes, together with the subsequent technology of theorem proving, algorithmic downside fixing, and drug discovery. An ordinary recipe includes: (1) pre-training fashions on current information to acquire base fashions, after which (2) post-training them utilizing scalar reward indicators that measure the standard or correctness of the generated samples.
Nonetheless, for the toughest cases of those issues, we encounter two challenges:
- Sparsity: the bottom generative mannequin attains a near-zero reward sign. The likelihood of manufacturing a positive-reward pattern might be so low that the mannequin might undergo many of the coaching with out ever encountering a optimistic reward.
- Expensive reward analysis: Calls to the reward oracle might be costly or dangerous, requiring expensive simulations, computations, and even bodily experiments.
For instance, when requested to design a treatment for most cancers, GPT-5 fails. If requested once more, will it succeed? Most likely not. What number of makes an attempt would it not take? We anticipate the success likelihood to be nonzero (since GPT-5, being an autoregressive generative mannequin, by no means assigns precisely zero likelihood to any finite sequence), however at greatest, it’s vanishingly small. Worse nonetheless, evaluating the answer is pricey and dangerous, because it requires conducting precise scientific trials.
A extra common instance onerous problem-solving is designing molecules with particular properties (e.g., excessive exercise towards a selected protein goal), which additionally suffers from the aforementioned two points: (1) a base generative mannequin is unlikely to generate extremely potent molecules towards a selected protein goal, and (2) the ground-truth verification of the efficiency requires precise wet-lab experiments.
These illustrate a broader challenge: the toughest and most vital issues are these with near-zero success charges — and no optimistic examples obtainable throughout studying. To deal with these eventualities, we introduce BaNEL (Bayesian Unfavorable Proof Studying), an algorithm that post-trains the generative mannequin utilizing failed makes an attempt solely, whereas minimizing the variety of reward evaluations (NREs).
Beneath such excessive reward sparsity, customary post-training strategies like coverage gradients (together with GRPO) collapse into brute-force random search, since zero rewards produce zero gradients. Novelty-bonus strategies, similar to count-based exploration or random community distillation, can present studying indicators underneath sparsity, however they require massive NREs and fall quick in efficiency. The next desk summarizes our evaluation of those strategies.
Studying from Unfavorable Rewards
The zero-reward downside has traditionally been addressed utilizing optimistic switch from different duties or domains, hand-designing curricula, and/or engineering extra informative and dense reward features. Nonetheless, we argue that there’ll at all times be duties and settings the place the bottom mannequin attains a particularly sparse reward. If we can not deal with this basic impediment, post-training will probably be restricted to distribution sharpening moderately than unlocking genuinely new capabilities past coaching information.
To deal with the zero-reward downside, algorithms ought to have the ability to study from failures alone—utilizing solely unfavorable reward samples—whereas minimizing the variety of reward evaluations (NREs). There’s a easy (if impractical) technique to see that studying from unfavorable samples alone is no less than theoretically doable.
Don’t make the identical mistake twice! If our funds for evaluating (r) was limitless, and assuming the answer has bounded size, we might trivially obtain an ideal success charge by amassing each potential mistake (R:={mathbf{x} mid r(mathbf{x})=0}) and avoiding all components of (R) :
$$p_{boldsymbol{theta} mid R^C}(mathbf{x}) propto p_{boldsymbol{theta}}(mathbf{x}) mathbf{1}[mathbf{x} notin R],$$
the place (p_{boldsymbol{theta}}) is the pre-trained generative mannequin (e.g., GPT-5). (p_{boldsymbol{theta} mid R^C}(mathbf{x})) means we situation the mannequin on the complement of (R) by multiplying the indicator perform. In plain phrases, this formulation says: when you’ve seen all potential failures, you’ll by no means make a brand new one.
Exploiting the construction underlying failures. After all, this strategy is infeasible as a result of the house of failures is combinatorial, and we wish to decrease NREs. However crucially, in most duties the place success requires intelligence, failures aren’t arbitrary. They include patterns that distinguish the failed makes an attempt from successes. If we will study these patterns, we will approximate (R) utilizing a small variety of samples. This failure-based strategy parallels how human scientists cause: they generalize from failures, avoiding previous errors with out discarding promising instructions. To attenuate NREs, the algorithm should extract as a lot info as potential from failures earlier than making new makes an attempt.
Studying a Generative Mannequin of Failures
Our core thought is to mannequin regularities underlying failures utilizing a separate generative mannequin (p_phi) educated solely on failed makes an attempt. Generative modeling is a strong unsupervised method for studying construction from information — and it scales extraordinarily effectively! Particularly, we practice a separate generative mannequin (p_phi) (parameterized by (phi) ) on (m) unfavorable examples with the usual most chance goal:
$$max _{boldsymbol{phi}} frac{1}{m} sum_{i=1}^m log p_{boldsymbol{phi}}(mathbf{x}_i) .$$
As soon as well-trained, (p_phi(mathbf{x})) can be utilized to evaluate whether or not a given enter resembles beforehand noticed failures; particularly, we use (p_phi) to outline a rejection area (tilde{R}) approximating (R):
$$tilde{R}:=lbrace mathbf{x}: frac{p_{boldsymbol{theta}}(mathbf{x})}{p_{boldsymbol{phi}}(mathbf{x})}<tau rbrace$$
the place (tau) is a threshold worth. Observe that this requires (p_{boldsymbol{theta}}) and (p_phi) to be likelihood-based generative fashions underneath which we will compute the chance (e.g., autoregressive fashions). Utilizing the rejection area (tilde{R}), we type a Bayesian posterior (tilde{p}_{boldsymbol{theta}}) to approximate (p_{boldsymbol{theta} mid R^C}) :
$$p_{boldsymbol{theta} mid tilde{R}^C}(mathbf{x}) propto p_{boldsymbol{theta}}(mathbf{x}) mathbf{1}[mathbf{x} notin tilde{R}],$$
This posterior filters out information factors which are much like prior failures in response to (tilde{R}); equivalently, we direct the mannequin to pattern solely from (tilde{R}^C).
On-line Recursive Replace
As soon as we enhance the generative mannequin utilizing the Bayesian replace as described above, we will use it to collect one other batch of (m) samples. Right here, rejection areas from earlier rounds might be amassed by taking their union (i.e., (tilde R will get tilde R cup tilde R_{textual content{new}}) the place (R_{textual content{new}}) is the brand new rejection area). This may be repeated a number of instances, as illustrated within the determine under. We name this technique BaNEL: Bayesian Unfavorable Proof Studying, an strategy that makes use of Bayesian updates to study from unfavorable samples solely.
Experiment: Adversarial Assault On Toy Language Mannequin
We first consider BaNEL on a toy however informative setting the place high-reward samples are uncommon, and hand-engineering dense rewards is difficult. On this job, the objective is to assault the goal mannequin, an autoregressive transformer educated to reply digit-addition queries (e.g., it receives “`10+23=”` and should generate “`33”`). The objective of the attacker mannequin, additionally an autoregressive transformer pre-trained on the identical dataset to generate questions similar to “`10+23=”`, is to suggest syntactically legitimate addition queries on which the goal mannequin produces an incorrect sum.
That’s, the reward is outlined as:
- (r(mathbf{x}) = 1) if (mathbf{x}) is a syntactically legitimate arithmetic expression and the goal’s output is wrong,
- (r(mathbf{x}) = 0) in any other case.
For the reason that goal is educated effectively, the pre-trained attacker’s empirical success charge is roughly 0.0004. We set a tough restrict on NREs: (r) can solely be evaluated 7500 instances at most. All reward-1 samples are filtered out throughout coaching — forcing the mannequin to study solely from failures.
As proven on this desk, BaNEL improves the success charge by 278x on common, outperforming baselines by a number of orders of magnitude.
BaNEL identifies two failure modes of the goal:
- Main zeros: when no less than one of many enter digits begins with no less than one zero, the output end result tends to be incorrect. That is probably as a result of the coaching information (shared by each the goal and the attacker) doesn’t include any examples with main zeros.
- Carry-chain stressors: examples that want to hold a digit throughout summation.
Based mostly on these recognized patterns, we designed a rule-based assault and noticed that it achieves a near-perfect success charge. This means that BaNEL can be utilized not solely to extend a numeric success charge, but additionally to information human instinct on onerous issues to extract qualitative insights.
We additionally examine compute scaling (right here, we don’t permit main zero assaults to make the issue much more difficult). When the unfavorable generative mannequin (p_phi) is under-trained (few epochs), BaNEL performs on par with easier novelty-bonus baselines (RND and pseudo-count strategies). Nonetheless, as we spend extra compute on (p_phi) (with out extra NREs), BaNEL outperforms these strategies by a big margin.
This highlights a key property: BaNEL trades compute for reward effectivity. It’s suboptimal underneath strict compute limits however excels when extra offline computation is out there.
Experiment: Language Mannequin Reasoning
We additional consider BaNEL on reasoning duties utilizing GSM8K subsets, the place the pre-trained Qwen 2.5 0.5B mannequin (additional fine-tuned on GSM8K utilizing PPO) performs poorly. Once more, all reward-1 samples are filtered out throughout coaching.
For many issues, BaNEL considerably improves success charges over the pre-trained baseline, outperforming RND with fewer reward evaluations.
Closing Remarks
By modeling failures with a generative mannequin, BaNEL turns unfavorable proof right into a studying sign, enabling exploration in settings the place reward = 1 samples are practically nonexistent. We view BaNEL as an vital route for the generative modeling discipline: to actually push the frontier of generative mannequin capabilities, we should study from failures!
Take a look at our paper for extra outcomes and particulars!







