Coscientist
GitHub

Training Data Contamination

AI-generated content polluting the corpora used to train future models

Training data contamination occurs when AI-generated text enters the web, gets scraped into training corpora, and shapes the next generation of models. The result is a feedback loop: models trained on their predecessors' outputs inherit their biases, amplify their errors, and lose access to the independent human signal that made the originals useful.

This is distinct from benchmark contamination (test data leaking into training sets) though both share the word. Training data contamination is about the provenance of the underlying corpus: once AI slop mixes with human-written text at scale, distinguishing them becomes expensive or impossible. Web scrapes after 2022 are increasingly suspect.

The consequences compound. Model collapse describes the quality degradation when models train on synthetic data: distributions narrow, rare modes disappear, and the output converges toward a homogenized mean. Encyclopedia Meltdown describes the knowledge-system failure when AI outputs are cited as sources, creating circular authority. Training data contamination is the upstream cause of both.

The parallel to low-background steel clarifies the problem. Nuclear testing contaminated all post-1945 steel; LLM proliferation contaminated all post-2020 web text. Both contamination events were irreversible, both created demand for pre-contamination resources, and both mean that advancing the technology requires materials produced before the technology existed.

Solutions involve provenance verification, timestamp-gated archives, and data curation practices that privilege sources with clear chains of human authorship. The MIT Data Provenance Initiative and similar efforts aim to bring transparency to training data origins—a necessary step if future models are to avoid training on their own reflections.

2 Notes Link Here

Edit on GitHub (Opens in a New Tab)

All Notes

147 Notes

  • -Across the Sprachraums
  • -Active Recall
  • -AI
  • -AI Slop
  • -AI-Induced Illusions of Competence
  • -Argumentative Act
  • -Argumentative Relations
  • -As We May Think
  • -Assumption
  • -Attack
  • -Bilingual Cognition
  • -Branched Resolution Map
  • -Claim
  • -Claim Lifecycle
  • -Claim Status Taxonomy
  • -Cognitive Agency Preservation
  • -Cognitive Exoskeleton
  • -Cognitive Sovereignty
  • -Confidence
  • -Contemplation Labor
  • -Contention
  • -Contention as Memorable Anchor
  • -Correction vs Drift
  • -Coscientist
  • -Counterexample
  • -Counterexample-First Search
  • -Creating Next-gen Digital Brains
  • -Cross-Linguistic Synthesis
  • -Dark Night of the Soul
  • -Definition Drift
  • -Desirable Difficulty in Verification
  • -Deskilling Through AI Delegation
  • -Dialectical Graph
  • -Dialectical Graph Edges
  • -Dialectical Graph Nodes
  • -Dialectical Interleaving
  • -Digital Brain
  • -Digital Garden
  • -Digital Jungle
  • -Document Collision
  • -Drift Phenomena
  • -Encyclopedia Galactica
  • -Encyclopedia Meltdown
  • -Environmental Drift
  • -Epistemic Protocol Layer
  • -Evidence Independence
  • -Evidence Span
  • -Exploration Mechanisms
  • -Exploration Strategies
  • -Extracranial
  • -Federated Knowledge Network
  • -Fluency Trap
  • -Forgetting Curve
  • -Foundation Fiction
  • -Friction as Enemy
  • -From Memex to Dialectical Graph
  • -From Preservation to Capability
  • -Galactic Empire
  • -GitHub for Scientists
  • -Graph as Meltdown Defense
  • -Graph Components
  • -Graph-Based Spaced Repetition
  • -Hallucination
  • -Hari Seldon
  • -Human Agency in AI
  • -Illusions of Competence
  • -Incompatibility Taxonomy
  • -Inference Layer
  • -Institutional Brain Rot
  • -Intellectual Companion
  • -Inter-Sprachraum Communication
  • -Interleaving
  • -Isaac Asimov
  • -Issue Node
  • -Knowledge Ark
  • -Knowledge Constitution
  • -Knowledge Failure Modes
  • -Knowledge Synthesis
  • -Knowledge System Layers
  • -Language-Agnostic Indexing
  • -Learning Science Principles
  • -LLM
  • -Low-Background Steel
  • -Meaning Loss
  • -Memex
  • -Meta-learning
  • -Method
  • -Method-Conclusion Coupling
  • -Minimum Contradiction Set
  • -Minimum Cut
  • -Model Collapse
  • -Monolith as Interface Metaphor
  • -Multi-AI Consensus Protocol
  • -Multilingual Knowledge Mesh
  • -Multilingual Memex
  • -Mystery and Minimalism
  • -Narrative Layer
  • -Natural Science Engineer
  • -Nonstationarity
  • -Normalized Proposition
  • -Operator
  • -Personal Knowledge Evolution
  • -Personal to Institutional Knowledge
  • -Pre-Contamination Resource
  • -Pre-LLM Text
  • -Project Aldehyde
  • -Project PIRI
  • -Provenance
  • -Psychohistory
  • -RAG
  • -RAG Limitations
  • -Rebuttal-First Search
  • -Relation Typing vs Similarity
  • -Replication Path Separation
  • -Responsibility Line
  • -Retrieval Practice
  • -Scapa Flow
  • -ScienceOps
  • -Scope
  • -Second Brain
  • -Seldon Plan
  • -Semantic Drift
  • -Signal Without Explanation
  • -Source
  • -Spaced Repetition
  • -Spacing Effect
  • -Sprachraum
  • -Status Transition Rules
  • -Sunghyun Cho
  • -Superbrain
  • -Synthesis Mechanisms
  • -System Drift
  • -The Monolith
  • -Tokens ≠ Knowledge
  • -Traceability
  • 01Training Data Contamination (Currently Open at Position 1)
  • -Translation Fidelity
  • -Translation Nuance Loss
  • -Triple Separation
  • -Un-Brain-Rotting
  • -Unanimity Requirement
  • -Undercut
  • -Vannevar Bush
  • -Verification
  • -Verification as Retrieval Practice
  • -Verification System
  • -Zero-Trust Ingestion