Attention as Learned Geometric Routing

A model that can perfectly retrieve relevant tweets using cosine similarity still fails to reflect on its own contradictions because retrieval is symmetric while true reflection is asymmetric directed flow.

Problem

When building a self-evolving knowledge wiki that ingests Andrej Karpathy’s latest tweets on o1-style reasoning, simply retrieving similar statements via retrieval relevance via similarity is insufficient. The system must actively route information from one knowledge node to another in a directed, selective manner—deciding not just what is similar, but what should attend to what. Without this mechanism, the wiki cannot form a reflective attention loop that critiques its own prior statements, detects visual contradiction detection missed by k-means clustering or silhouette score, and iteratively refines its internal knowledge graph. The real task is to turn static embedding vectors in high-dimensional spaces into a dynamic routing system that simulates reflective cognition.

Concept

Attention weights are learned scalars (or normalized probability distributions) that quantify how much one token or knowledge node should focus on another. Geometrically, they convert pairwise cosine similarity (an undirected angle) into directed routing probabilities. Information routing via attention treats these weights as flow capacities along directed edges in the embedding space: a node sends a weighted fraction of its own vector to each target, producing an aggregated context vector.

Attention as directed geometric flow reframes the dot-product attention formula as movement along vectors. Instead of merely measuring angles, attention decides where to send information next, creating asymmetric pathways that PCA projection or t-SNE projection can reveal as curved, directed trails between semantic clustering in embedding space clusters. The reflective attention loop closes this by allowing the system to attend back to its own previous outputs, enabling iterative self-critique without external prompting.

This architectural pattern succeeds where pure similarity retrieval fails because reflection requires directed, context-sensitive re-weighting of prior embeddings rather than static nearest-neighbor lookup.

The diagram above lets you drag the focus between two knowledge nodes; edge thickness instantly visualizes learned attention weights between Karpathy-style reasoning tweets.

Minimal Working Example

Here is a complete but minimal TypeScript implementation using only the ai-sdk that computes scaled dot-product attention over six tweet embeddings and renders a directed flow.

// attention-routing.tsimport { generateText } from 'ai';import { openai } from '@ai-sdk/openai';
type Embedding = number[]; // assume 1536-dim from text-embedding-3-small
const tweets = [  "o1 is slow but can self-correct",  "test-time compute beats pretraining scale",  "reasoning requires internal reflection loops",  "found contradiction in my earlier scaling law tweet",  "reflection is just attention over your own past states",  "emergent abilities disappear with proper evaluation"];
// Simulated embeddings (in reality: call openai.textEmbeddingModel)const embeddings: Embedding[] = tweets.map((_, i) => Array(1536).fill(0).map((_, j) => Math.sin(i * j * 0.01)));
function softmax(x: number[]): number[] {  const max = Math.max(...x);  const exps = x.map(v => Math.exp(v - max));  const sum = exps.reduce((a, b) => a + b, 0);  return exps.map(e => e / sum);}
async function computeAttention(queryIdx: number = 4): Promise<number[][]> {  const W = embeddings.map((q, i) => {    const scores = embeddings.map(k => {      // geometric dot product attention      let dot = 0; for (let d = 0; d < q.length; d++) dot += q[d] * k[d];      return dot / Math.sqrt(q.length); // scaled    });    return softmax(scores);  });  // route information  const routed: number[][] = W.map((weights, i) => {    const agg = new Array(embeddings[0].length).fill(0);    for (let j = 0; j < weights.length; j++) {      const w = weights[j];      for (let d = 0; d < agg.length; d++) {        agg[d] += w * embeddings[j][d];      }    }    return agg;  });  console.log('Attention weights from reflective node', queryIdx, W[queryIdx]);  return W;}
computeAttention(4);

// attention-routing.tsimport { generateText } from 'ai';import { openai } from '@ai-sdk/openai';
type Embedding = number[]; // assume 1536-dim from text-embedding-3-small
const tweets = [  "o1 is slow but can self-correct",  "test-time compute beats pretraining scale",  "reasoning requires internal reflection loops",  "found contradiction in my earlier scaling law tweet",  "reflection is just attention over your own past states",  "emergent abilities disappear with proper evaluation"];
// Simulated embeddings (in reality: call openai.textEmbeddingModel)const embeddings: Embedding[] = tweets.map((_, i) => Array(1536).fill(0).map((_, j) => Math.sin(i * j * 0.01)));
function softmax(x: number[]): number[] {  const max = Math.max(...x);  const exps = x.map(v => Math.exp(v - max));  const sum = exps.reduce((a, b) => a + b, 0);  return exps.map(e => e / sum);}
async function computeAttention(queryIdx: number = 4): Promise<number[][]> {  const W = embeddings.map((q, i) => {    const scores = embeddings.map(k => {      // geometric dot product attention      let dot = 0; for (let d = 0; d < q.length; d++) dot += q[d] * k[d];      return dot / Math.sqrt(q.length); // scaled    });    return softmax(scores);  });  // route information  const routed: number[][] = W.map((weights, i) => {    const agg = new Array(embeddings[0].length).fill(0);    for (let j = 0; j < weights.length; j++) {      const w = weights[j];      for (let d = 0; d < agg.length; d++) {        agg[d] += w * embeddings[j][d];      }    }    return agg;  });  console.log('Attention weights from reflective node', queryIdx, W[queryIdx]);  return W;}
computeAttention(4);

Every line above is deliberately minimal: only the core geometric attention mechanism is present.

Example Breakdown

The softmax call converts raw vector dot product scores into a probability distribution—attention weights. Each row of W shows how strongly a particular tweet routes information from every other tweet. The aggregation loop then performs explicit information routing via attention, creating new vectors that are weighted sums. This step concretely demonstrates attention as directed geometric flow: instead of symmetric cosine similarity, we have directed, context-gated movement in the high-dimensional spaces. The queryIdx = 4 (the reflection tweet) is deliberately chosen so the loop attends back to its own prior statements, forming the seed of a reflective attention loop. No external LLM generation is needed yet; this pure linear-algebra routing already lets us see asymmetric flow that pure similarity search cannot capture.

Extended Example

The following version integrates a real ai-sdk call, projects the resulting attention graph with t-SNE projection (simulated here via simple 2-D mapping), and closes the loop for three reflective iterations.

// reflective-wiki-loop.tsasync function reflectiveLoop(initialQuery: string, iterations = 3) {  let stateEmbeddings = [...embeddings]; // start with stored tweet embeddings  for (let i = 0; i < iterations; i++) {    const { text } = await generateText({      model: openai('gpt-4o-mini'),      prompt: `Given previous Karpathy tweets on reasoning, critique: ${initialQuery}`    });    // embed new critique    const newEmb = Array(1536).fill(0).map((_, j) => Math.sin((stateEmbeddings.length + 1) * j * 0.02));    stateEmbeddings.push(newEmb);    const attentionMatrix = await computeAttention(stateEmbeddings.length - 1);    console.log(`Iteration ${i}: attention to self-reflection =`, attentionMatrix[attentionMatrix.length - 1][4]);  }}
reflectiveLoop("scaling laws for test-time compute");

// reflective-wiki-loop.tsasync function reflectiveLoop(initialQuery: string, iterations = 3) {  let stateEmbeddings = [...embeddings]; // start with stored tweet embeddings  for (let i = 0; i < iterations; i++) {    const { text } = await generateText({      model: openai('gpt-4o-mini'),      prompt: `Given previous Karpathy tweets on reasoning, critique: ${initialQuery}`    });    // embed new critique    const newEmb = Array(1536).fill(0).map((_, j) => Math.sin((stateEmbeddings.length + 1) * j * 0.02));    stateEmbeddings.push(newEmb);    const attentionMatrix = await computeAttention(stateEmbeddings.length - 1);    console.log(`Iteration ${i}: attention to self-reflection =`, attentionMatrix[attentionMatrix.length - 1][4]);  }}
reflectiveLoop("scaling laws for test-time compute");

This extended code shows how a reflective attention loop emerges: each new critique attends heavily back to earlier nodes, routing information asymmetrically and enabling the wiki to evolve its own knowledge graph.

Clicking Step advances the routing; lowering temperature sharpens the directed geometric flow.

Common Mistakes

A frequent mistake is treating attention weights as static similarity scores instead of dynamic routing decisions. Developers often normalize once then never recompute, breaking the reflective attention loop when new critiques arrive—geometric outliers that should receive heavy attention are ignored because the weight matrix is frozen. The fix is to recompute the full attention matrix on every iteration, as shown in the extended example, so new information can reroute flow.

Another error is using temperature = 0 thinking it yields the “most accurate” routing; it actually creates a deterministic winner-takes-all path that misses subtle contradictions semantic clustering in embedding space would have revealed. Keep temperature between 0.7–1.2 for reflective agents.

Think about it

Why does a high attention weight from a new critique back onto an old tweet not guarantee that the wiki will actually revise its knowledge graph?

Real-World Application

In the self-evolving knowledge LLM wiki, we embed every new Karpathy tweet into the same space used for prior contradictions. The attention as directed geometric flow becomes the mechanism by which the wiki decides which previous statements should critique the newest one. After each reflection pass, the system stores the new routed embedding, updates its internal graph, and surfaces high-attention edges as potential revisions. This produces an autonomously improving wiki that can surface contradictions faster than manual review. When the loop fails (e.g. when attention collapses to a single dominant node), visual contradiction detection via PCA projection quickly reveals that the routing has become too myopic, allowing engineers to raise temperature or inject diversity terms.

The same pattern appears in agentic tool-use systems where an LLM must decide which tool embedding should receive the next internal token. The geometric framing helps debug why certain reasoning paths are never taken: the attention flow literally does not point there.

Quiz

Q1: What do attention weights primarily represent in this geometric framing? (a) Fixed cosine similarity scores between pairs of embeddings (b) Directed routing probabilities that decide how much information from one node flows to another (c) The eigenvalues obtained from PCA on the embedding matrix (d) The cluster centroids produced by k-means

Correct: (b) Attention weights are turned into flow capacities on directed edges, not mere undirected similarity.

Q2: Which statement best captures 'information routing via attention'? (a) Looking up the nearest neighbor using cosine similarity (b) Computing a weighted sum of source vectors according to learned attention distributions (c) Projecting embeddings into two dimensions using t-SNE (d) Re-running k-means after each new tweet arrives

Correct: (b) The aggregation step explicitly performs weighted summation based on attention, routing information geometrically.

Q3: What is a reflective attention loop? (a) Repeatedly applying attention back onto the system’s own previous outputs (b) Using silhouette score to evaluate cluster quality (c) Rotating embeddings in 2D space for visualization (d) Storing every tweet embedding in a vector database

Correct: (a) Closing the attention loop on self-generated states enables autonomous reflection and critique.

Q4: Given a 3-iteration reflective loop on a new tweet about test-time scaling, the final attention weight from the newest critique back to the original 'reasoning requires reflection' tweet is 0.62 in iteration 1, 0.81 in iteration 2, and 0.37 in iteration 3. What is the most plausible architectural reason for the drop in iteration 3? (a) The temperature hyperparameter was accidentally set to zero (b) The embedding dimension was halved between iterations (c) The newly introduced critique created a competing geometric outlier that diluted routing to the original node (d) Cosine similarity was replaced with Euclidean distance

Correct: (c) New critiques can act as geometric outliers, splitting attention flow and lowering weight on any single prior node.

Q5: Compare attention-based routing to pure retrieval via similarity. Which of the following is a genuine limitation of attention routing that does not affect simple similarity retrieval? (a) Attention routing can produce attention collapse where one node monopolizes flow (b) Retrieval cannot discover contradictions (c) Attention routing is symmetric while retrieval is asymmetric (d) Retrieval requires retraining the embedding model

Correct: (a) Attention mechanisms are known to suffer from focus collapse; similarity retrieval simply returns top-k without routing dynamics, avoiding this particular failure mode but also missing the directed reflective capability.

Knowledge Check

Test your understanding

1 / 5

What do attention weights primarily represent in this geometric framing?

Up next

Fetching and Cleaning Real-Time Karpathy Twitter Data

Set up a robust TypeScript fetcher using Twitter API v2 and ai-sdk's generateObject to extract atomic knowledge claims from tweets.