Forget memorized content from LLMs without a curated retain set — by demoting the logits of high-information tokens and self-distilling the result.
Self-distillation via High-surprisal-only Retain-set-free Entropy Demotion · Paper under review (NeurIPS 2026)
Machine unlearning for large language models aims to selectively remove memorized content — private data, copyrighted text, or hazardous knowledge — without costly full retraining. Most methods require a retain set of curated examples to prevent catastrophic utility loss, an extra data dependency that complicates deployment.
We propose SHRED, a retain-set-free unlearning method built on a key insight: not all tokens within a forget-set instance carry memorized information equally. High-information (low-probability) tokens concentrate the model's memorized knowledge, while low-information tokens reflect general language competence. SHRED (1) selects the bottom-P lowest-probability (highest-Shannon-information) positions as forget positions, and (2) trains the model with a single top-K KL self-distillation objective whose targets demote the memorized token's logit at forget positions while preserving the original distribution at benign anchor positions. This simultaneously drives forgetting and utility preservation — no retain set needed. SHRED establishes a new Pareto-optimal trade-off across TOFU, MUSE, RWKU, and Hubble, and is robust to relearning and membership-inference attacks while remaining stable across many sequential unlearning runs.
skip_tokens, the number of leading context-only positions excluded from demotion.
Main results across four unlearning benchmarks (verbatim from the paper, Table 1). Blue = a real win on that axis (good forget and companion utility preserved); red = looks competitive alone but the companion metric reveals over-forgetting or a utility collapse. SHRED is the only method whose row stays blue across every benchmark and metric pair.
| Method | TOFU | MUSE-News | MUSE-Books | RWKU | Hubble Y | Hubble G | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fkm↓ | MU↑ | PL | fvm↓ | rkm↑ | PL | fkm↓ | rkm↑ | PL | fkm↓ | MU↑ | fvm↓ | MU↑ | fvm↓ | MU↑ | |
| Full (pre-unlearn) | 0.990 | 0.627 | −99.5 | 0.584 | 0.552 | −99.8 | 0.594 | 0.669 | −57.5 | 80.1 | 63.5 | 1.000 | 0.501 | 0.197 | 0.501 |
| Target (retrained) | 0.148 | 0.612 | 0.0 | 0.208 | 0.550 | 0.0 | 0.289 | 0.745 | 0.0 | — | — | 0.119 | 0.515 | 0.169 | 0.515 |
| GradAscent | 0.181 | 0.454 | −93.6 | 0.178 | 0.431 | −66.2 | 0.030 | 0.196 | −51.7 | 8.1 | 24.5 | 0.394 | 0.503 | 0.176 | 0.505 |
| GradDiff+RT | 0.000 | 0.625 | 99.7 | 0.274 | 0.448 | 88.8 | 0.219 | 0.372 | −24.4 | 9.2 | 28.0 | 0.988 | 0.500 | 0.174 | 0.502 |
| NPO+RT | 0.294 | 0.557 | −91.1 | 0.269 | 0.454 | −83.5 | 0.250 | 0.446 | −53.6 | 50.6 | 60.5 | 0.964 | 0.503 | 0.184 | 0.502 |
| SimNPO+RT | 0.663 | 0.613 | −97.4 | 0.542 | 0.499 | −99.9 | 0.298 | 0.512 | −55.4 | 54.9 | 60.5 | 0.835 | 0.499 | 0.226 | 0.496 |
| DPO+RT | 0.162 | 0.606 | −19.2 | — | — | — | — | — | — | 49.9 | 57.0 | — | — | — | — |
| RMU+RT | 0.823 | 0.608 | −99.6 | 0.138 | 0.296 | 18.2 | 0.002 | 0.000 | −12.6 | 35.0 | 46.5 | 0.850 | 0.509 | 0.192 | 0.502 |
| CEU+RT | 0.002 | 0.630 | 97.3 | 0.180 | 0.418 | −99.6 | 0.000 | 0.000 | −57.0 | 26.5 | 58.2 | 0.430 | 0.501 | 0.175 | 0.502 |
| SHRED (ours) | 0.055 | 0.637 | −38.6 | 0.202 | 0.389 | −12.2 | 0.237 | 0.519 | −37.9 | 27.7 | 56.5 | 0.113 | 0.512 | 0.176 | 0.505 |
fkm forget knowledge-mem probe · fvm forget verbatim ROUGE · rkm retain knowledge-mem · MU model utility · PL PrivLeak (→0 matches the retrained Target; large |·| = detectable departure). ↓ lower better, ↑ higher better. — not applicable. Models per benchmark; RWKU on Llama-3-8B (0–100 scale).


@misc{shred2026,
title = {SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion},
year = {2026},
eprint = {2605.07482},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}