Coscientist
GitHub

Pre-LLM Text

Human-written content produced before large language models contaminated the web

Pre-LLM text is human-authored content created before large language models became capable of generating plausible prose at scale—roughly before 2020. Its value lies in verified provenance: it is unambiguously human-written, not synthetic, and not recursively derived from AI outputs.

The analogy to low-background steel is explicit in recent discourse. Just as nuclear testing contaminated all post-1945 steel with radioactive isotopes, LLM proliferation has contaminated the web with synthetic text indistinguishable from human writing. Pre-LLM text is a pre-contamination resource: finite, irreplaceable, and increasingly valuable.

The stakes are practical. Training AI on AI-generated text causes model collapse: output distributions narrow, rare modes vanish, and quality degrades across generations. Clean human-written corpora are essential to avoid this recursive trap, but training data contamination makes post-2020 web scrapes unreliable. Archives with verifiable timestamps—the Internet Archive, academic repositories, pre-2020 Common Crawl snapshots—become primary sources for future training.

Pre-LLM text also matters for research validity. Studies of human language, cognition, and culture require corpora known to reflect human production, not statistical patterns learned from prior models. The AI slop flooding the web raises the noise floor for all downstream analysis.

The irony mirrors low-background steel: the most advanced AI systems will depend on text produced before AI was advanced enough to write.

2 Notes Link Here

Edit on GitHub (Opens in a New Tab)

All Notes

147 Notes

  • -Across the Sprachraums
  • -Active Recall
  • -AI
  • -AI Slop
  • -AI-Induced Illusions of Competence
  • -Argumentative Act
  • -Argumentative Relations
  • -As We May Think
  • -Assumption
  • -Attack
  • -Bilingual Cognition
  • -Branched Resolution Map
  • -Claim
  • -Claim Lifecycle
  • -Claim Status Taxonomy
  • -Cognitive Agency Preservation
  • -Cognitive Exoskeleton
  • -Cognitive Sovereignty
  • -Confidence
  • -Contemplation Labor
  • -Contention
  • -Contention as Memorable Anchor
  • -Correction vs Drift
  • -Coscientist
  • -Counterexample
  • -Counterexample-First Search
  • -Creating Next-gen Digital Brains
  • -Cross-Linguistic Synthesis
  • -Dark Night of the Soul
  • -Definition Drift
  • -Desirable Difficulty in Verification
  • -Deskilling Through AI Delegation
  • -Dialectical Graph
  • -Dialectical Graph Edges
  • -Dialectical Graph Nodes
  • -Dialectical Interleaving
  • -Digital Brain
  • -Digital Garden
  • -Digital Jungle
  • -Document Collision
  • -Drift Phenomena
  • -Encyclopedia Galactica
  • -Encyclopedia Meltdown
  • -Environmental Drift
  • -Epistemic Protocol Layer
  • -Evidence Independence
  • -Evidence Span
  • -Exploration Mechanisms
  • -Exploration Strategies
  • -Extracranial
  • -Federated Knowledge Network
  • -Fluency Trap
  • -Forgetting Curve
  • -Foundation Fiction
  • -Friction as Enemy
  • -From Memex to Dialectical Graph
  • -From Preservation to Capability
  • -Galactic Empire
  • -GitHub for Scientists
  • -Graph as Meltdown Defense
  • -Graph Components
  • -Graph-Based Spaced Repetition
  • -Hallucination
  • -Hari Seldon
  • -Human Agency in AI
  • -Illusions of Competence
  • -Incompatibility Taxonomy
  • -Inference Layer
  • -Institutional Brain Rot
  • -Intellectual Companion
  • -Inter-Sprachraum Communication
  • -Interleaving
  • -Isaac Asimov
  • -Issue Node
  • -Knowledge Ark
  • -Knowledge Constitution
  • -Knowledge Failure Modes
  • -Knowledge Synthesis
  • -Knowledge System Layers
  • -Language-Agnostic Indexing
  • -Learning Science Principles
  • -LLM
  • -Low-Background Steel
  • -Meaning Loss
  • -Memex
  • -Meta-learning
  • -Method
  • -Method-Conclusion Coupling
  • -Minimum Contradiction Set
  • -Minimum Cut
  • -Model Collapse
  • -Monolith as Interface Metaphor
  • -Multi-AI Consensus Protocol
  • -Multilingual Knowledge Mesh
  • -Multilingual Memex
  • -Mystery and Minimalism
  • -Narrative Layer
  • -Natural Science Engineer
  • -Nonstationarity
  • -Normalized Proposition
  • -Operator
  • -Personal Knowledge Evolution
  • -Personal to Institutional Knowledge
  • -Pre-Contamination Resource
  • 01Pre-LLM Text (Currently Open at Position 1)
  • -Project Aldehyde
  • -Project PIRI
  • -Provenance
  • -Psychohistory
  • -RAG
  • -RAG Limitations
  • -Rebuttal-First Search
  • -Relation Typing vs Similarity
  • -Replication Path Separation
  • -Responsibility Line
  • -Retrieval Practice
  • -Scapa Flow
  • -ScienceOps
  • -Scope
  • -Second Brain
  • -Seldon Plan
  • -Semantic Drift
  • -Signal Without Explanation
  • -Source
  • -Spaced Repetition
  • -Spacing Effect
  • -Sprachraum
  • -Status Transition Rules
  • -Sunghyun Cho
  • -Superbrain
  • -Synthesis Mechanisms
  • -System Drift
  • -The Monolith
  • -Tokens ≠ Knowledge
  • -Traceability
  • -Training Data Contamination
  • -Translation Fidelity
  • -Translation Nuance Loss
  • -Triple Separation
  • -Un-Brain-Rotting
  • -Unanimity Requirement
  • -Undercut
  • -Vannevar Bush
  • -Verification
  • -Verification as Retrieval Practice
  • -Verification System
  • -Zero-Trust Ingestion