Cosine Similarity as Angle Between Knowledge Vectors
The angle between two embedding vectors for Karpathy's tweets—one praising chain-of-thought reasoning and the other advocating test-time compute—is often larger than people expect, even though both ideas come from the same researcher and live in the same semantic cluster under PCA projection or t-SNE projection.
Problem
When a user asks your self-evolving wiki "How does Karpathy currently view scaling test-time compute?", the system must retrieve the most relevant tweet or note from thousands stored as embedding vectors in high-dimensional spaces. Keyword search fails because the query uses different phrasing. Dot product alone is dominated by vector length, making long tweets appear artificially similar. You need a rotation-invariant measure that captures only directional alignment—the geometric angle—to surface truly relevant knowledge for Reflective Agent critique and contradiction detection. Without it, the wiki's autonomous evolution pollutes itself with low-relevance memories, breaking the knowledge graph consistency.
Concept
The vector dot product of two vectors A and B yields a scalar: A⋅B=∑i=1naibi. Geometrically this equals ∣∣A∣∣⋅∣∣B∣∣⋅cosθ, where θ is the angle between them. Normalizing by the product of magnitudes isolates the cosine term, producing cosine similarity ranging from -1 to +1:
cosine similarity(A,B)=∣∣A∣∣⋅∣∣B∣∣A⋅B
This directly gives the angle between vectors regardless of their lengths. For retrieval relevance via similarity we threshold or rank by this value: higher cosine means tighter alignment in the semantic direction, enabling precise top-k retrieval even when semantic clustering in embedding space contains noisy variance.
Drag your mouse left/right across the canvas above to rotate the gold tweet vector and watch how the angle between vectors and cosine similarity change instantly.
Minimal Working Example
Example Breakdown
Every line is deliberate. dotProduct directly implements the algebraic definition of the vector dot product, which equals the product of magnitudes times cosine—this is the geometric foundation we visualize earlier. magnitude normalizes for length because tweet length varies wildly; without it longer posts would dominate retrieval scores in our wiki. The cosineSimilarity wrapper calls the ai-sdk embedder asynchronously and returns pure directional similarity. We guard against zero-length vectors (though rare with modern embedding models) to prevent NaN. This minimal form solves exactly one problem: measuring retrieval relevance via similarity between any two pieces of Karpathy knowledge without being misled by magnitude.
Interactive Rotation Step-through
Click Prev/Next to walk through discrete relative angles between the two Karpathy tweet vectors. Observe how quickly cosine similarity drops as the geometric angle increases—this demonstrates why even semantically related ideas from the same author can produce surprisingly low similarity scores in high-dimensional spaces.
Extended Example
The extended version composes the minimal function into a real retrieval engine for our evolving wiki. It maps a natural-language query directly into embedding space, ranks every stored memory by retrieval relevance via similarity, and returns top-k results for the autonomous critique loop. We keep separate embedding calls explicit so future caching or batching layers can be slotted in without changing the geometric core.
Mapping Similarity to Retrieval Ranking
Adjust the sliders to change cosine scores between a query and three knowledge entries. Watch the ranking and relevance bars update live, illustrating exactly how retrieval relevance via similarity translates raw geometry into ordered wiki results.
Common Mistakes
- Using raw dot product instead of cosine: a longer tweet embedding dominates even when directionally orthogonal—easy to spot when similarity scores correlate strongly with tweet character length.
- Treating cosine = 0.4 as "similar" in very high dimensions: random vectors in 1536-D tend to have cosine around 0; set thresholds using empirical distribution from your own embedding vectors.
- Recomputing embeddings on every query: the ai-sdk calls are expensive; cache embeddings per wiki entry and only embed the query once.
- Ignoring negative cosine: in some embedding models opposing ideas can legitimately produce negative scores; discarding negatives can hide contradictions the reflective agent should discover.
Real-World Application
In Karpathy’s self-evolving wiki this geometric similarity engine sits at the heart of the memory retrieval loop. Every new tweet or paper is embedded once, stored with its vector, and later ranked against both user queries and the agent’s own generated critiques. When the wiki detects two entries whose cosine similarity is unexpectedly low despite belonging to the same semantic clustering in embedding space (revealed via occasional PCA projection sanity checks), it surfaces them to the reflective agent for contradiction resolution. The same mechanism powers attention-like weighting inside the agent’s prompt: higher cosine entries receive proportionally more tokens, mimicking how geometric proximity influences focus. Production pitfalls include embedding model drift (newer versions shift angles) and adversarial prompts that force cosine saturation; both are mitigated by periodic re-embedding of the entire knowledge base and statistical outlier detection on similarity histograms. This geometric foundation directly prepares us to move from pairwise angles to full cluster analysis in the next module.
Quiz
Q1: What does the cosine similarity formula geometrically represent?
- (a) The Euclidean distance between two points
- (b) The angle between two normalized vectors
- (c) The sum of every dimension multiplied without normalization
- (d) The projection of one vector onto a random axis
Correct: (b) It isolates the directional alignment independent of magnitude.
Q2: Why is normalization by vector magnitudes required when using the dot product for similarity?
- (a) To make the values always positive
- (b) To remove length bias so that longer documents do not dominate
- (c) To convert from radians to degrees
- (d) To increase numerical stability in 2D only
Correct: (b) Raw dot product favors longer vectors even when directions diverge.
Q3: In the context of retrieval relevance via similarity for a wiki, what does a higher cosine score indicate?
- (a) The two entries will have similar word counts
- (b) Their embedding vectors point in nearly the same semantic direction
- (c) Their t-SNE projections will overlap visually
- (d) The entries were created on the same date
Correct: (b) High cosine means the knowledge is aligned in high-dimensional semantic space.
Q4: Given three tweet embeddings with cosine similarities to a query of 0.91, 0.74 and 0.22 respectively, which entry does the retrieval function surface first when k=1?
- (a) The one with 0.74 because middle values are safer
- (b) The one with 0.22 because low similarity surfaces contradictions
- (c) The one with 0.91 because it has the largest cosine score
- (d) Any of them; order is arbitrary in the code
Correct: (c) Ranking is strictly descending cosine, so the highest score is retrieved first.
Q5: Which common mistake would cause your wiki to repeatedly retrieve the longest Karpathy notes regardless of semantic alignment?
- (a) Using the normalized cosine similarity function
- (b) Sorting by raw dot product rather than cosine similarity
- (c) Setting the similarity threshold to exactly zero
- (d) Embedding both query and documents with the same model
Correct: (b) Raw dot product scales with magnitude, so longer vectors (longer tweets) artificially win relevance contests.