SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion

Abstract

Machine unlearning for large language models aims to selectively remove memorized content — private data, copyrighted text, or hazardous knowledge — without costly full retraining. Most methods require a retain set of curated examples to prevent catastrophic utility loss, an extra data dependency that complicates deployment.

We propose SHRED, a retain-set-free unlearning method built on a key insight: not all tokens within a forget-set instance carry memorized information equally. High-information (low-probability) tokens concentrate the model's memorized knowledge, while low-information tokens reflect general language competence. SHRED (1) selects the bottom-P lowest-probability (highest-Shannon-information) positions as forget positions, and (2) trains the model with a single top-K KL self-distillation objective whose targets demote the memorized token's logit at forget positions while preserving the original distribution at benign anchor positions. This simultaneously drives forgetting and utility preservation — no retain set needed. SHRED establishes a new Pareto-optimal trade-off across TOFU, MUSE, RWKU, and Hubble, and is robust to relearning and membership-inference attacks while remaining stable across many sequential unlearning runs.

How SHRED works

SHRED pipeline: selection then self-distillation with logit demotion — SHRED runs a single forward pass to score tokens, demotes the memorized logits at high-information positions, and self-distills the modified top-K targets.

SHRED method: (1) compute token probabilities, (2) high-surprisal token selection, (3) build top-K KL targets by demoting memorized logits (Variant A/B), (4) KL distillation — across QA, short- and long-document scenarios — The full SHRED pipeline: a single forward pass scores every token, the lowest-probability (highest-surprisal) positions are selected, their memorized logits are demoted to build top-K KL targets, and the model is self-distilled to match them — identically across QA and document data.

      One method, QA and documents. SHRED treats QA (TOFU) and free-form documents (MUSE) identically; the only data-dependent knob is skip_tokens, the number of leading context-only positions excluded from demotion.
    

Results

Main results across four unlearning benchmarks (verbatim from the paper, Table 1). Blue = a real win on that axis (good forget and companion utility preserved); red = looks competitive alone but the companion metric reveals over-forgetting or a utility collapse. SHRED is the only method whose row stays blue across every benchmark and metric pair.

Method	TOFU			MUSE-News			MUSE-Books			RWKU		Hubble Y		Hubble G
Method	fkm↓	MU↑	PL	fvm↓	rkm↑	PL	fkm↓	rkm↑	PL	fkm↓	MU↑	fvm↓	MU↑	fvm↓	MU↑
Full (pre-unlearn)	0.990	0.627	−99.5	0.584	0.552	−99.8	0.594	0.669	−57.5	80.1	63.5	1.000	0.501	0.197	0.501
Target (retrained)	0.148	0.612	0.0	0.208	0.550	0.0	0.289	0.745	0.0	—	—	0.119	0.515	0.169	0.515
GradAscent	0.181	0.454	−93.6	0.178	0.431	−66.2	0.030	0.196	−51.7	8.1	24.5	0.394	0.503	0.176	0.505
GradDiff+RT	0.000	0.625	99.7	0.274	0.448	88.8	0.219	0.372	−24.4	9.2	28.0	0.988	0.500	0.174	0.502
NPO+RT	0.294	0.557	−91.1	0.269	0.454	−83.5	0.250	0.446	−53.6	50.6	60.5	0.964	0.503	0.184	0.502
SimNPO+RT	0.663	0.613	−97.4	0.542	0.499	−99.9	0.298	0.512	−55.4	54.9	60.5	0.835	0.499	0.226	0.496
DPO+RT	0.162	0.606	−19.2	—	—	—	—	—	—	49.9	57.0	—	—	—	—
RMU+RT	0.823	0.608	−99.6	0.138	0.296	18.2	0.002	0.000	−12.6	35.0	46.5	0.850	0.509	0.192	0.502
CEU+RT	0.002	0.630	97.3	0.180	0.418	−99.6	0.000	0.000	−57.0	26.5	58.2	0.430	0.501	0.175	0.502
SHRED (ours)	0.055	0.637	−38.6	0.202	0.389	−12.2	0.237	0.519	−37.9	27.7	56.5	0.113	0.512	0.176	0.505

fkm forget knowledge-mem probe · fvm forget verbatim ROUGE · rkm retain knowledge-mem · MU model utility · PL PrivLeak (→0 matches the retrained Target; large |·| = detectable departure). ↓ lower better, ↑ higher better. — not applicable. Models per benchmark; RWKU on Llama-3-8B (0–100 scale).

SHRED Pareto front on unlearning benchmarks — SHRED defines a new Pareto front of forget efficacy vs. model utility, outperforming retain-set-dependent baselines.

Robustness to relearning attacks — Robust to relearning attacks.

Stability across sequential unlearning — Stable across many sequential unlearning runs.

Abstract

How SHRED works

Results

BibTeX