A single tweet can be mathematically closer to two diametrically opposed clusters at once in embedding space — and the points that sit in that geometric no-man’s-land are precisely where the most valuable contradictions hide.
Problem
When you ingest the latest 50 Karpathy tweets on self-reflection into your evolving knowledge wiki, simple retrieval via cosine similarity on the embedding vectors quickly surfaces semantically related statements. Yet human readers quickly notice internal contradictions: one tweet celebrates the necessity of brutal self-honesty while another implies we should be kinder to our past selves. These contradictions are invisible to standard top-k retrieval because they hide in the geometry of high-dimensional spaces. The real task for a self-evolving system is to surface these geometric outliers automatically so that a reflective agent can critique them and evolve the knowledge graph. This is where k-means clustering, silhouette score, and visual contradiction detection become essential architectural components.
Concept
k-means clustering partitions points in high-dimensional embedding space into k groups by iteratively assigning each tweet (represented by its embedding vector) to the nearest centroid and then moving each centroid to the mean of its assigned points. Because the space is high-dimensional, we rely on t-SNE projection (which you already know preserves local neighborhoods better than PCA projection) to visualize the result without losing the semantic clustering in embedding space that cosine similarity reveals.
Geometric outliers are points whose distance to their assigned centroid significantly exceeds the average intra-cluster distance. These are not noise; in the context of self-reflection tweets they frequently mark conceptual tension — places where the same author’s thinking appears to contradict itself when measured by angle between vectors.
The silhouette score quantifies how well each point fits its cluster: for point i it is (b − a) / max(a, b) where a is average dissimilarity to other points in its own cluster and b is the smallest average dissimilarity to any other cluster. A score near +1 means the point is well-clustered; near 0 means it sits on the boundary; negative values indicate it is probably misassigned. When we color nodes by this score we turn an opaque clustering result into an immediately readable map of confidence.
Visual contradiction detection ties these together: any tweet whose silhouette score falls below an adaptive threshold (or whose distance from its centroid exceeds a geometric multiple of the cluster’s radius) is flagged for agent reflection. The geometry itself becomes the signal that retrieval relevance via similarity alone cannot see.
Minimal working example
Example breakdown
The minimal example deliberately avoids libraries to expose the geometric mechanism. We initialize centroids from actual tweet vectors rather than random noise so that early iterations remain semantically meaningful. The assignment loop uses cosine similarity (the same angle between vectors you studied last lesson) instead of Euclidean distance because embedding vectors are usually normalized; this choice aligns clustering with the retrieval relevance via similarity used elsewhere in the wiki. The update step computes the arithmetic mean of vectors, which in embedding space corresponds to a “prototype” statement that best represents the semantic cluster. The final silhouette-ready clusters are pure data structures, deliberately separated from visualization so that the same clustering engine can run server-side while t-SNE projection happens only for human review.