Recursive centroid subtraction is the engine of Information Chemistry. It is the process that turns dense, overlapping information into distinct elemental components. It works by repeatedly removing the shared signal of a community to expose what is unique and then repeating the process on the residuals.
You can think of it as iterative distillation. Each pass extracts a purer signal, until you reach a limit where no structured signal remains. That limit is the boundary between meaningful atoms and noise.
The Core Algorithm
- Embed each data unit as a vector.
- Cluster vectors into communities.
- Compute a centroid for each community.
- Subtract the centroid from each member vector.
- Re‑cluster the residual vectors.
- Repeat until convergence.
Each pass is a chemical reaction cycle. The centroid is the shared compound. The subtraction isolates the unique residue. Re‑clustering reveals new compounds formed by residues that share structure.
Why Subtraction Works
Vectors encode both shared and unique aspects of information. When you subtract the centroid, you remove the shared component and preserve the remainder. This remainder often contains the nuance that makes an item distinct within its community.
Without subtraction, clustering can collapse into obvious themes. With subtraction, you reveal latent structure. This is especially valuable in data where top‑level themes dominate and mask subtler distinctions.
Convergence Criteria
Convergence occurs when:
- New clusters stop forming.
- Residuals become uniform or random.
- Community structures become unstable or diffuse.
At that point, you can treat the remaining stable clusters as atomic units. If you continue beyond this point, you are likely extracting noise rather than meaning.
Interpreting the Convergence Frontier
The convergence frontier is not a single point. It can vary depending on:
- The embedding model used.
- The clustering algorithm and its parameters.
- The dataset’s diversity and size.
For small datasets, convergence may happen early, yielding coarse atoms. For large datasets, you can go deeper, revealing finer structures. You can tune the depth of recursion based on your goals: coarse atoms for fast indexing, deeper atoms for research‑grade analysis.
Stabilization and Multi‑Pass Systems
Multi‑pass systems can run until a stability threshold is met. You can measure stability by comparing communities between passes or by evaluating changes in centroid norms. When change falls below a threshold, you can stop.
This mirrors chemical equilibrium. The system stabilizes when reactions no longer produce new compounds. At that point, the information chemistry has reached its equilibrium state for that dataset.
Using Convergence as a Signal
Convergence is not only a stopping condition. It is a signal of information density. If convergence happens quickly, the dataset may be homogeneous or narrow. If it takes many passes, the dataset has deep structure.
You can compare convergence profiles across datasets to measure complexity. This becomes a diagnostic tool: a way to quantify how much informational depth exists within a domain.
Practical Considerations
- Use consistent embedding models to ensure comparability.
- Record residual norms to track information loss across passes.
- Store centroids as reusable concept vectors for downstream synthesis.
- Monitor for cluster fragmentation that indicates noise.
Why This Matters
Recursive centroid subtraction enables a rigorous approach to meaning extraction. It replaces subjective categorization with a repeatable process. It makes the decomposition of information explicit and testable. You are not just clustering; you are iteratively stripping away the shared signal to reveal the irreducible core.
Going Deeper
- Experiment with different clustering algorithms to compare stability.
- Explore adaptive depth, where recursion stops at different points for different communities.
- Use residuals as features for anomaly detection or novelty discovery.
- Build tools that visualize convergence as a time‑series of community evolution.