bengal-unlearning

SHRED: Retain-Set-Free Unlearning via
Self-Distillation with Logit Demotion

Forget memorized content from LLMs without a curated retain set — by demoting the logits of high-information tokens and self-distilling the result.

Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion  ·  Paper under review (NeurIPS 2026)

Abstract

Machine unlearning for large language models aims to selectively remove memorized content — private data, copyrighted text, or hazardous knowledge — without costly full retraining. Most methods require a retain set of curated examples to prevent catastrophic utility loss, an extra data dependency that complicates deployment.

We propose SHRED, a retain-set-free unlearning method built on a key insight: not all tokens within a forget-set instance carry memorized information equally. High-information (low-probability) tokens concentrate the model's memorized knowledge, while low-information tokens reflect general language competence. SHRED (1) selects the bottom-P lowest-probability (highest-Shannon-information) positions as forget positions, and (2) trains the model with a single top-K KL self-distillation objective whose targets demote the memorized token's logit at forget positions while preserving the original distribution at benign anchor positions. This simultaneously drives forgetting and utility preservation — no retain set needed. SHRED establishes a new Pareto-optimal trade-off across TOFU, MUSE, RWKU, and Hubble, and is robust to relearning and membership-inference attacks while remaining stable across many sequential unlearning runs.

How SHRED works

SHRED pipeline: selection then self-distillation with logit demotion
SHRED runs a single forward pass to score tokens, demotes the memorized logits at high-information positions, and self-distills the modified top-K targets.
SHRED method: (1) compute token probabilities, (2) high-surprisal token selection, (3) build top-K KL targets by demoting memorized logits (Variant A/B), (4) KL distillation — across QA, short- and long-document scenarios
The full SHRED pipeline: a single forward pass scores every token, the lowest-probability (highest-surprisal) positions are selected, their memorized logits are demoted to build top-K KL targets, and the model is self-distilled to match them — identically across QA and document data.
One method, QA and documents. SHRED treats QA (TOFU) and free-form documents (MUSE) identically; the only data-dependent knob is skip_tokens, the number of leading context-only positions excluded from demotion.

Results

Main results across four unlearning benchmarks (verbatim from the paper, Table 1). Blue = a real win on that axis (good forget and companion utility preserved); red = looks competitive alone but the companion metric reveals over-forgetting or a utility collapse. SHRED is the only method whose row stays blue across every benchmark and metric pair.

MethodTOFUMUSE-NewsMUSE-BooksRWKUHubble YHubble G
fkm↓MU↑PLfvm↓rkm↑PLfkm↓rkm↑PLfkm↓MU↑fvm↓MU↑fvm↓MU↑
Full (pre-unlearn)0.9900.627−99.50.5840.552−99.80.5940.669−57.580.163.51.0000.5010.1970.501
Target (retrained)0.1480.6120.00.2080.5500.00.2890.7450.0——0.1190.5150.1690.515
GradAscent0.1810.454−93.60.1780.431−66.20.0300.196−51.78.124.50.3940.5030.1760.505
GradDiff+RT0.0000.62599.70.2740.44888.80.2190.372−24.49.228.00.9880.5000.1740.502
NPO+RT0.2940.557−91.10.2690.454−83.50.2500.446−53.650.660.50.9640.5030.1840.502
SimNPO+RT0.6630.613−97.40.5420.499−99.90.2980.512−55.454.960.50.8350.4990.2260.496
DPO+RT0.1620.606−19.2——————49.957.0————
RMU+RT0.8230.608−99.60.1380.29618.20.0020.000−12.635.046.50.8500.5090.1920.502
CEU+RT0.0020.63097.30.1800.418−99.60.0000.000−57.026.558.20.4300.5010.1750.502
SHRED (ours)0.0550.637−38.60.2020.389−12.20.2370.519−37.927.756.50.1130.5120.1760.505

fkm forget knowledge-mem probe · fvm forget verbatim ROUGE · rkm retain knowledge-mem · MU model utility · PL PrivLeak (→0 matches the retrained Target; large |·| = detectable departure). ↓ lower better, ↑ higher better. not applicable. Models per benchmark; RWKU on Llama-3-8B (0–100 scale).

SHRED Pareto front on unlearning benchmarks
SHRED defines a new Pareto front of forget efficacy vs. model utility, outperforming retain-set-dependent baselines.
Robustness to relearning attacks
Robust to relearning attacks.
Stability across sequential unlearning
Stable across many sequential unlearning runs.

BibTeX

@misc{shred2026,
  title         = {SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion},
  year          = {2026},
  eprint        = {2605.07482},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG}
}