Unlearning or Obfuscating? Jogging the Reminiscence of Unlearned LLMs by way of Benign Relearning – Machine Studying Weblog

Machine unlearning is a promising method to mitigate undesirable memorization of coaching knowledge in ML fashions. On this publish, we’ll talk about our work (which appeared at ICLR 2025) demonstrating that present approaches for unlearning in LLMs are surprisingly inclined to a easy set of benign relearning assaults: With entry to solely a small and probably loosely associated set of information, we discover that we will “jog” the reminiscence of unlearned fashions to reverse the consequences of unlearning.

For instance, we present that relearning on public medical articles can lead an unlearned LLM to output dangerous data about bioweapons, and relearning normal wiki details about the e-book sequence Harry Potter can pressure the mannequin to output verbatim memorized textual content. We formalize this unlearning-relearning pipeline, discover the assault throughout three widespread unlearning benchmarks, and talk about future instructions and pointers that consequence from our examine. Our work provides a cautionary story to the unlearning neighborhood—displaying that present approximate unlearning strategies merely suppress the mannequin outputs and fail to robustly overlook goal data within the LLMs.

Recovering memorized textual content by relearning on public data: We ask the mannequin to finish sentences from *Harry Potter and the Order of the Phoenix*. We finetune the mannequin to implement memorization after which unlearn on the identical textual content. Then, we present it’s attainable to *relearn* this memorized textual content utilizing GPT-4-generated normal details about the primary characters, which doesn’t include direct textual content from the novels

**What’s Machine Unlearning and the way can or not it’s attacked?**

The preliminary idea of machine unlearning was motivated by GDPR rules across the “proper to be forgotten”, which asserted that customers have the proper to request deletion of their knowledge from service suppliers. Growing mannequin sizes and coaching prices have since spurred the event of approaches for approximate unlearning, which intention to effectively replace the mannequin so it (roughly) behaves as if it by no means noticed the information that was requested to be forgotten. As a result of scale of information and mannequin sizes of recent LLMs, strategies for approximate unlearning in LLMs have targeted on scalable strategies similar to gradient-based unlearning strategies, in context unlearning, and guardrail-based unlearning.

Sadly, whereas many unlearning strategies have been proposed, current works have proven that approaches for approximate unlearning are comparatively fragile—notably when scrutinized beneath an evolving area of assaults and analysis methods. Our work builds on this rising physique of labor by exploring a easy and surprisingly efficient assault on unlearned fashions. Particularly, we present that present finetuning-based approaches for approximate unlearning are merely obfuscating the mannequin outputs as an alternative of actually forgetting the knowledge within the overlook set, making them inclined to benign relearning assaults—the place a small quantity of (probably auxiliary) knowledge can “jog” the reminiscence of unlearned fashions in order that they behave equally to their pre-unlearning state.

Whereas benign finetuning methods have been explored in prior works (e.g. Qi et al., 2023; Tamirisa et al., 2024; Lynch et al., 2024), these works contemplate general-purpose datasets for relearning with out finding out the overlap between the relearn knowledge and queries used for unlearning analysis. In our work, we deal with the state of affairs the place the extra knowledge itself is inadequate to seize the overlook set—making certain that the assault is “relearning” as an alternative of merely “studying” the unlearned data from this finetuning process. Surprisingly, we discover that relearning assaults might be efficient when utilizing solely a restricted set of information, together with datasets which are inadequate to tell the analysis queries alone and might be simply accessed by the general public.

Downside Formulation and Risk Mannequin

Pipeline of a relearning drawback. We illustrate the case the place the adversary solely wants API
entry to the mannequin and finetuning process. (The pipeline applies analogously to situations the place the adversary has the mannequin weights and may carry out native finetuning.) The purpose is to replace the unlearned mannequin so the ensuing relearned mannequin can output related completions not discovered when querying the unlearned mannequin alone.

We assume that there exists a mannequin (winmathcal{W}) that has been pretrained and/or finetuned with a dataset (D). Outline (D_usubseteq D) because the set of information whose data we need to unlearn from (w), and let (mathcal{M}_u:mathcal{W}timesmathcal{D}rightarrowmathcal{W}) be the unlearning algorithm, such that (w_u=mathcal{M}(w,D_u)) is the mannequin after unlearning. As in normal machine unlearning, we assume that if (w_u) is prompted to finish a question (q) whose data has been unlearned, (w_u) ought to output uninformative/unrelated textual content.

Risk mannequin. To launch a benign relearning assault, we contemplate an adversary (mathcal{A}) who has entry to the unlearned mannequin (w_u). We don’t assume that the adversary (mathcal{A}) has entry to the unique mannequin (w), nor have they got entry to the whole unlearn set (D_u). Our key assumption on this adversary is that they’re able to finetune the unlearned mannequin (w_u) with some auxiliary knowledge, (D’). We talk about two widespread situations the place such finetuning is possible:

(1) Mannequin weight entry adversary. If the mannequin weights (w_u) are overtly obtainable, an adversary could finetune this mannequin assuming entry to adequate computing sources.

(2) API entry adversary. However, if the LLM is both not publicly obtainable (e.g. GPT) or the mannequin is simply too giant to be finetuned straight with the adversary’s computing sources, finetuning should still be possible by LLM finetuning APIs (e.g. TogetherAI).

Constructing on the relearning assault menace mannequin above, we’ll now deal with two essential steps inside the unlearning relearning pipeline by a number of case research on actual world unlearning duties: 1. How will we assemble the relearn set? 2. How will we assemble a significant analysis set?

Case 1: Relearning Assault Utilizing a Portion of the Unlearn Set

The primary kind of adversary 😈 has entry to some partial data within the overlook set and attempt to get hold of data of the remainder. Not like prior work in relearning, when performing relearning we assume the adversary could solely have entry to a extremely skewed pattern of this unlearn knowledge.

An instance the place the adversary makes use of partial unlearn set data to carry out relearning assault.

Formally, we assume the unlearn set might be partitioned into two disjoint units, i.e., (D_u=D_u^{(1)}cup D_u^{(2)}) such that (D_u^{(1)}cap D_u^{(2)}=emptyset). We assume that the adversary solely has entry to (D_u^{(1)}) (a portion of the unlearn set), however is keen on making an attempt to entry the data current in (D_u^{(2)}) (a separate, disjoint set of the unlearn knowledge). Beneath this setting, we examine two datasets: TOFU and Who’s Harry Potter (WHP).

TOFU

Unlearn setting. We first finetune Llama-2-7b on the TOFU dataset. For unlearning, we use the Forget05 dataset as (D_u), which comprises 200 QA pairs for 10 fictitious authors. We unlearn the Phi-1.5 mannequin utilizing gradient ascent, a standard unlearning baseline.

Relearn set building. For every creator we choose just one e-book written by the creator. We then assemble a take a look at set by solely sampling QA pairs related to this e-book, i.e., (D_u^{(2)}=xin D_u, booksubset x) the place (e-book) is the title of the chosen e-book. By building, (D_u^{(1)}) is the set that comprises all knowledge textit{with out} the presence of the key phrase (e-book). To assemble the relearn set, we assume the adversary has entry to (D’subset D_u^{(1)}).

Analysis process. We assume the adversary have entry to a set of questions in Forget05 dataset that ask the mannequin about books written by every of the ten fictitious authors. We guarantee these questions can’t be accurately answered for the unlearned mannequin. The relearning purpose is to The purpose is to recuperate the string (e-book) regardless of by no means seeing this key phrase within the relearning knowledge. We consider the Assault Success Fee of whether or not the mannequin’s reply include the key phrase (e-book).

WHP

Unlearn setting. We first finetune Llama-2-7b on a set of textual content containing the direct textual content of HP novels, QA pairs, and fan discussions about Harry Potter sequence. For unlearning, following Eldan & Russinovich (2023), we set (D_u) as the identical set of textual content however with an inventory of key phrases changed by protected, non HP particular phrases and carry out finetuning utilizing this textual content with flipped labels.

Relearn set building. We first assemble a take a look at set $D_u^{(2)}$, to be the set of all sentences that include any of the phrases “Hermione” or “Granger”. By building, the set $D_u^{(1)}$ comprises no details about the title “Hermione Granger”. Just like TOFU, we assume the adversary has entry to (D’subset D_u^{(1)}).

Analysis process. We use GPT-4 to generate an inventory of questions whose appropriate reply is or comprises the title “Hermione Granger”. We guarantee these questions can’t be accurately answered for the unlearned mannequin. The relearning purpose is to recuperate the title “Hermione” or “Granger” with out seeing them within the relearn set. We consider the Assault Success Fee of whether or not the mannequin’s reply include the key phrase (e-book).

Quantitative outcomes

We discover the efficacy of relearning with partial unlearn units by a extra complete set of quantitative outcomes. Particularly, for every dataset, we examine the effectiveness of relearning when ranging from a number of potential unlearning checkpoints. For each relearned mannequin, we carry out binary prediction on whether or not the key phrases are contained within the mannequin era and document the assault success charge (ASR). On each datasets, we observe that our assault is ready to obtain (>70%) ASR in looking the key phrases when unlearning is shallow. As we begin to unlearn farther from the unique mannequin, it turns into tougher to reconstruct key phrases by relearning. In the meantime, rising the variety of relearning steps doesn’t at all times imply higher ASR. For instance within the TOFU experiment, if the relearning occurs for greater than 40 steps, ASR drops for all unlearning checkpoints.

Takeaway #1: Relearning assaults can recuperate unlearned key phrases utilizing a restricted subset of the unlearning textual content (D_u). Particularly, even when (D_u) is partitioned into two disjoint subsets, (D_u^{(1)}) and (D_u^{(2)}), relearning on (D_u^{(1)}) could cause the unlearned LLM to generate key phrases solely current in (D_u^{(2)}).

Case 2: Relearning Assault Utilizing Public Info

We now flip to a probably extra life like state of affairs, the place the adversary 😈 can’t straight entry a portion of the unlearn knowledge, however as an alternative has entry to some public data associated to the unlearning process at hand and attempt to get hold of associated dangerous data that’s forgotten. We examine two situations on this half.

An instance the place the adversary makes use of public data to carry out relearning assault.

Recovering Dangerous Data in WMDP

Unlearn setting. We contemplate the WMDP benchmark which goals to unlearn hazardous data from present fashions. We take a Zephyr-7b-beta mannequin and unlearn the bio-attack corpus and cyber-attack corpus, which include hazardous data in biosecurity and cybersecurity.

Relearn set building. We first decide 15 questions from the WMDP a number of alternative query (MCQ) set whose data has been unlearned from (w_u). For every query (q), we discover public on-line articles associated to (q) and use GPT to generate paragraphs about normal data related to (q). We make sure that this ensuing relearn set does not include direct solutions to any query within the analysis set.

Analysis Job. We consider on a solution completion process the place the adversary prompts the mannequin with a query and we let the mannequin full the reply. We randomly select 70 questions from the WMDP MCQ set and take away the a number of selections offered to make the duty tougher and extra informative for our analysis. We use the LLM-as-a-Decide Rating because the metric to judge mannequin’s era high quality by the.

Quantitative outcomes

We consider on a number of unlearning baselines, together with Gradient Ascent (GA), Gradient Distinction (GD), KL minimization (KL), Damaging Desire Optimization (NPO), SCRUB. The outcomes are proven within the Determine under. The unlearned mannequin (w_u) receives a poor common rating in comparison with the pre-unlearned mannequin on the overlook set WMDP. After making use of our assault, the relearned mannequin (w’) has considerably increased common rating on the overlook set, with the reply high quality being near that of the mannequin earlier than unlearning. For instance, the overlook common rating for gradient ascent unlearned mannequin is 1.27, in comparison with 6.2.

LLM-as-Decide scores for the overlook set (WMDP benchmarks). For every unlearning baseline column, the relearned mannequin is obtained by finetuning the unlearned mannequin from the identical block. We use the identical unlearned and relearned mannequin for each overlook and retain analysis. Common scores over all questions are reported; scores vary between 1-10, with increased scores indicating higher reply high quality.

Recovering Verbatim Copyrighted Content material in WHP

Unlearn setting. To implement an LLM to memorize verbatim copyrighted content material, we first take a small excerpt of the unique textual content of Harry Potter and the Order of the Phoenix, (t), and finetune the uncooked Llama-2-7b-chat on (t). We unlearn the mannequin on this identical excerpt textual content (t).

Relearn set building. We use the next prompts to generate generic details about Harry Potter characters for relearning.

Are you able to generate some info and details about the Harry Potter sequence, particularly about the primary characters: Harry Potter, Ron Weasley, and Hermione Granger? Please generate no less than 1000 phrases.

The ensuing relearn textual content doesn’t include any excerpt from the unique textual content (t).

Analysis Job. Inside (t), we randomly choose 15 80-word chunks and partition every chunk into two components. Utilizing the primary half because the question, the mannequin will full the remainder of the textual content. We consider the Rouge-L F1 rating between the mannequin completion and the true continuation of the immediate.

Quantitative outcomes

We first make sure that the finetuned mannequin considerably memorize textual content from (t), and the unlearning efficiently mitigates the memorization. Just like the WMDP case, after relearning solely on GPT-generated info about Harry Potter, Ron Weasley, and Hermione Granger, the relearned mannequin achieves considerably higher rating than unlearned mannequin, particularly for GA and NPO unlearning.

Common Rouge-L F1 rating throughout 15 text-completion queries for finetuned, unlearned, and relearned mannequin.

Takeaway #2: Relearning utilizing small quantities of public data can set off the unlearned mannequin to generate forgotten completions, even when this public data doesn’t straight embrace the completions.

Instinct from a Simplified Instance

Constructing on ends in experiments for actual world dataset, we need to present instinct about when benign relearning assaults could also be efficient by way of a toy instance. Though unlearning datasets are anticipated to include delicate or poisonous data, these identical datasets are additionally prone to include some benign data that’s publicly obtainable. Formally, let the unlearn set to be (D_u) and the relearn set to be (D’). Our instinct is that if (D’) has robust correlation with (D_u), delicate unlearned content material could danger being generated after re-finetuning the unlearned mannequin (w_U) on (D’), even when this information by no means seems in (D’) nor within the textual content completions of (w_U)./

Step 1. Dataset building. We first assemble a dataset (D) which comprises widespread English names. Each (xin D) is the concatenation of widespread English names. Based mostly on our instinct, we hypothesize that relearning happens when a robust correlation exists between a pair of tokens, such that finetuning on one token successfully ‘jogs’ the unlearned mannequin’s reminiscence of the opposite token. To ascertain such a correlation between a pair of tokens, we randomly choose a subset (D_1subset D) and repeat the pair “Anthony Mark“ at a number of positions for (xin D_1). Within the instance under, we use the primary three rows as (D_1).

Dataset:
•James John Robert Michael Anthony Mark William David Richard Joseph …
•Raymond Alexander Patrick Jack Anthony Mark Dennis Jerry Tyler …
•Kevin Brian George Edward Ronald Timothy Jason Jeffrey Ryan Jacob Gary Anthony Mark … 
•Mary Patricia Linda Barbara Elizabeth Jennifer Maria Susan Margaret Dorothy Lisa Nancy… 
......

Step 2. Finetune and Unlearn. We use (D) to finetune a Llama-2-7b mannequin and procure (w) in order that the ensuing mannequin memorized the coaching knowledge precisely. Subsequent, we unlearn (w) on (D_1), which comprises all sequences containing the pair “Anthony Mark“, in order that the ensuing mannequin (w_u) just isn’t in a position to recuperate (x_{geq ok}) given (x_{“Anthony Mark“ pair.

Step 3. Relearn. For each (xin D_1), we take the substring as much as the looks of Anthony in (x) and put it within the relearn set: (D’={x_{leq Anthony}|xin D_u}). Therefore, we’re simulating a state of affairs the place the adversary is aware of partial data of the unlearn set. The adversary then relearn (w_U) utilizing (D’) to acquire (w’). The purpose is to see whether or not the pair “Anthony Mark” may very well be generated by (w’) even when (D’) solely comprises details about Anthony.

Relearn set:
•James John Robert Michael Anthony
•Raymond Alexander Patrick Jack Anthony
•Kevin Brian George Edward Ronald Timothy Jason Jeffrey Ryan Jacob Gary Anthony

Analysis. To check how effectively totally different unlearning and relearning checkpoints carry out in producing the pair, we assemble an analysis set of 100 samples the place every pattern is a random permutation of subset of widespread names adopted by the token Anthony. We ask the mannequin to generate given every immediate within the analysis set. We calculate what number of mannequin generations include the pair Anthony Mark pair. As proven within the Desk under, when there are extra repetitions in (D) (stronger correlation between the 2 names), it’s simpler for the relearning algorithm to recuperate the pair. This implies that the standard of relearning depends upon the the correlation power between the relearn set (D’) and the goal data.

# of repetitions	Unlearning ASR	Relearning ASR
7	0%	100%
5	0%	97%
3	0%	23%
1	0%	0%

Assault Success Fee (ASR) for unlearned mannequin and its respective relearned mannequin beneath totally different variety of repetitions of the “Anthony Mark” pair within the coaching set.

Takeaway #3: When the unlearned set comprises extremely correlated pairs of information, relearning on just one can extra successfully recuperate details about the opposite.

Conclusion

On this publish, we describe our work finding out benign relearning assaults as efficient strategies to recuperate unlearned data. Our method of utilizing benign public data to finetune the unlearned mannequin is surprisingly efficient at recovering unlearned data. Our findings throughout a number of datasets and unlearning duties present that many optimization-based unlearning heuristics should not in a position to actually take away memorized data within the overlook set. We thus recommend exercising extra warning when utilizing present finetuning based mostly strategies for LLM unlearning if the hope is to meaningfully restrict the mannequin’s energy to generative delicate or dangerous data. We hope our findings can encourage the exploration of unlearning heuristics past approximate, gradient-based optimization to supply extra strong baselines for machine unlearning. Along with that, we additionally advocate investigating analysis metrics past mannequin utility on overlook / retain units for unlearning. Our examine exhibits that merely evaluating question completions on the unlearned mannequin alone could give a false sense of unlearning high quality.