Pre-LLM Text

Human-written content produced before large language models contaminated the web

Pre-LLM text is human-authored content created before large language models became capable of generating plausible prose at scale—roughly before 2020. Its value lies in verified provenance: it is unambiguously human-written, not synthetic, and not recursively derived from AI outputs.

The analogy to low-background steel is explicit in recent discourse. Just as nuclear testing contaminated all post-1945 steel with radioactive isotopes, LLM proliferation has contaminated the web with synthetic text indistinguishable from human writing. Pre-LLM text is a pre-contamination resource: finite, irreplaceable, and increasingly valuable.

The stakes are practical. Training AI on AI-generated text causes model collapse: output distributions narrow, rare modes vanish, and quality degrades across generations. Clean human-written corpora are essential to avoid this recursive trap, but training data contamination makes post-2020 web scrapes unreliable. Archives with verifiable timestamps—the Internet Archive, academic repositories, pre-2020 Common Crawl snapshots—become primary sources for future training.

Pre-LLM text also matters for research validity. Studies of human language, cognition, and culture require corpora known to reflect human production, not statistical patterns learned from prior models. The AI slop flooding the web raises the noise floor for all downstream analysis.

The irony mirrors low-background steel: the most advanced AI systems will depend on text produced before AI was advanced enough to write.