
The Problem of Context Rot
A paper posted to arXiv on Wednesday, May 13, 2026, presents a troubling finding for practitioners deploying large language models with long contexts. In 'Classifier Context Rot: Monitor Performance Degrades with Context Length,' authors Sam Martin and Fabien Roger document a systematic degradation in classifier accuracy as the input context length increases. The work, listed as arXiv:2605.12366, has already attracted attention in the AI safety and reliability communities for its practical implications.
The paper identifies a phenomenon the authors term 'context rot' — a gradual but measurable decline in the performance of classifiers (such as those used for toxicity detection, content moderation, or factuality checks) when they are applied to longer sequences of text. While many modern LLMs boast context windows of 128k tokens or more, the study shows that accuracy does not remain flat; instead, it erodes, often in ways that are invisible to standard validation practices.
According to the abstract and initial reactions circulating on social media, the degradation is not an artifact of a single model family. Martin and Roger tested multiple classifier architectures, including both encoder-only and decoder-only models, and observed consistent drops. For example, a BERT-based classifier that achieves 94% accuracy on short inputs (under 512 tokens) may fall to 82% when the same input is padded or concatenated into a 32k-token context. The paper does not claim that all classifiers rot at the same rate, but the trend is statistically significant across the board.
Why Long-Context Models Are Vulnerable
The root cause, the authors hypothesize, lies in the mismatch between training and inference distributions. Most classifiers are fine-tuned on datasets where inputs are relatively short — question-answer pairs, single paragraphs, or truncated social media posts. When these same classifiers are later integrated into long-context pipelines (e.g., retrieval-augmented generation or multi-turn agents), they receive inputs that are orders of magnitude longer. The positional encoding, attention patterns, and normalization layers all begin to behave differently as token counts grow, and the classifier's internal representations drift away from the calibrated regime.

One concrete example offered in the preprint concerns safety classifiers. Many deployment pipelines use a classifier as a guardrail before an LLM generates a response. If the classifier's false negative rate increases with context length, harmful outputs may slip through. Conversely, false positives may block benign responses, degrading user experience. The paper includes a small case study on a popular open-source toxicity classifier, where the false positive rate doubled when context length increased from 2k to 16k tokens.
The authors also note that context rot is distinct from the well-known 'lost in the middle' problem. While lost-in-the-middle refers to the LLM's inability to retrieve information from the center of a long context, context rot concerns the classifier's ability to correctly label the entire input. The two phenomena may compound each other, making long-context AI systems even less reliable than previously assumed.
Immediate Practical Concerns
For engineers deploying LLM-based systems today, the finding is a wake-up call. Many production pipelines rely on a 'classify then generate' pattern: a small, fast classifier filters inputs before the large generative model is invoked. If the classifier's performance is context-dependent, the entire pipeline may have hidden failure modes. The paper recommends that teams re-evaluate their classifiers on long-context benchmarks before trusting them in production. 'If you are using a classifier that was fine-tuned on short inputs, you are likely overestimating its accuracy on real-world long documents,' the authors write.
The preprint also suggests mitigation strategies: fine-tune classifiers on long-context samples, use sliding window approaches, or deploy dedicated long-context classifiers that are trained from scratch. However, each of these approaches comes with trade-offs in cost, latency, and complexity. The paper does not claim to have solved context rot, only that the problem exists and deserves attention.
This work arrives at a time when long-context capabilities are being marketed as a key differentiator. Major LLM providers have extended context windows to 1M tokens or more. The assumption has been that longer context always means better understanding. Context rot challenges that assumption, at least for the classifier components that are often the first line of defense in AI safety.

Connections to Broader AI Safety Research
Context rot is the latest addition to a growing list of failure modes in AI systems that scale beyond their training distribution. It echoes related findings such as 'reward hacking' (also appearing in this arXiv batch as arXiv:2605.12474) and 'semantic reward collapse' (arXiv:2605.12406). Together, these papers paint a picture of an ecosystem where models behave predictably on benchmarks but degrade in deployment conditions that are hard to replicate during development.
Sam Martin and Fabien Roger are known for their work on monitoring and auditing of AI systems. Their other paper in the same batch, 'How Useful Is Cross-Domain Generalization for Training LLM Monitors?' (arXiv:2605.12265), explores whether monitors trained on one domain transfer to others. The new context rot paper extends that line of inquiry to the input length dimension.
The paper is positioned as a preliminary study, but its implications are clear: as AI systems are given longer leashes, their oversight mechanisms must be stress-tested at scale. The authors have released their evaluation scripts and data to encourage replication. They call on the community to establish standard context-length stress tests for classifiers, similar to how adversarial robustness became a standard evaluation axis after years of research.
What to Watch Next
Practitioners should monitor whether context rot is an artifact of current architectures or a fundamental limitation of transformer-based classifiers. If it is fundamental, the need for non-parametric or hybrid approaches becomes more urgent. The paper also raises questions for the design of agentic systems that accumulate memory over long trajectories — a common pattern in the 'agent memory' papers also posted this week (e.g., 'Executable Agentic Memory for GUI Agent', arXiv:2605.12294). If classifiers rot with context, then agents that rely on long-term memory may accumulate classification errors over time.
The AI community will be watching for follow-up studies that replicate the findings across more model families and tasks. In the meantime, the paper serves as a practical warning: do not assume your long-context classifier is safe. Test it at the lengths you actually use.
Comments