Storing Embeddings and Building the Initial Neo4j Knowledge Graph

A single high-quality embedding computed from Andrej Karpathy's tweet on self-attention mechanisms can cluster with related claims across months of content even though the original tweets never explicitly reference each other.

Problem

The real task is to convert semantically rich but disconnected atomic knowledge claims extracted from Karpathy's tweets into a continuously queryable and evolvable knowledge graph that respects the geometric structure of the embedding space. Without this, the wiki remains a flat collection of validated nodes produced by previous lessons; contradictions stay hidden, and reflective agents cannot traverse conceptual neighborhoods to critique or synthesize new understanding. By combining text embedding computation with Voyage or OpenAI, Neo4j graph modeling for knowledge nodes and relationships, storing embeddings as node properties with Cypher, and visualizing knowledge graph in 3D using geometric layout, the system gains spatial awareness of ideas that mirrors how attention mechanisms themselves operate on vector representations.

Concept

Text embeddings translate the cleaned tweet text (already produced by the tweet cleaning pipeline for LLM ingestion) into fixed-length vectors that encode semantic meaning in a high-dimensional Euclidean space. Two vectors that point in similar directions have high cosine similarity regardless of the surface words used. In Neo4j we model each structured wiki entry as a :WikiNode with properties for claim, context, sources, confidence (already validated by Zod schema) plus a new embedding property containing the float32 array. Relationships such as :RELATED_TO or :CONTRADICTS are created by computing pairwise cosine similarities above a chosen threshold. Because the user is learning linear algebra for machine learning, we first visualize the geometric layout before examining the underlying algebra: nodes that belong to the same conceptual cluster (e.g., variants of Karpathy's self-attention explanations) appear geometrically closer in 3D space, revealing emergent structure without any hand-crafted labels.

The geometry shown above is exactly what a 3D force-directed layout of a Neo4j graph can approximate when the node positions are seeded from dimensionality-reduced embeddings (e.g., via UMAP or PCA). The algebra underneath cosine similarity is simply the normalized dot product: for two vectors $\mathbf{u}$ and $\mathbf{v}$ , $\cos\theta = \frac{\mathbf{u}\cdot\mathbf{v}}{||\mathbf{u}||\,||\mathbf{v}||}$ . A value near 1 means the claims live in nearly the same direction in embedding space.

This live simulation demonstrates why storing embeddings as node properties lets us later run gds.similarity.cosine procedures directly inside Neo4j instead of recomputing them from raw text.

Minimal Working Example

We assume a cleaned tweet has already been turned into a structured wiki entry via the previous lesson's ai-sdk generateObject with Zod. The following TypeScript code computes a Voyage embedding (the preferred provider in this course for its strong performance on technical content) and stores both the node and a similarity edge.

typescript

// ingestEmbedding.tsimport { generateObject } from 'ai'; // already coveredimport { voyage } from '@ai-sdk/voyage';import neo4j from 'neo4j-driver';import { z } from 'zod';
const driver = neo4j.driver(  'neo4j://localhost:7687',  neo4j.auth.basic('neo4j', 'password'));
const WikiNodeSchema = z.object({ // referencing Zod schema for wiki node validation  claim: z.string(),  context: z.string(),  sources: z.array(z.string()),  confidence: z.number().min(0).max(1),});
async function storeTweetAsKnowledgeNode(tweetText: string, tweetId: string) {  // semantic chunking of tweet threads into coherent nodes already performed upstream  const { object } = await generateObject({    model: voyage('voyage-3'), // text embedding computation with Voyage    schema: WikiNodeSchema,    prompt: `Extract atomic claim from Karpathy tweet:\n${tweetText}`,  });
  const embeddingResponse = await voyage('voyage-3-large').embeddings.create({    input: object.claim,    options: { dimensions: 1024 },  });  const embedding = embeddingResponse.data[0].embedding; // float32 array
  const session = driver.session();  try {    await session.run(`      MERGE (n:WikiNode {id: $tweetId})      SET n.claim = $claim,          n.context = $context,          n.sources = $sources,          n.confidence = $confidence,          n.embedding = $embedding    `, {      tweetId,      claim: object.claim,      context: object.context,      sources: object.sources,      confidence: object.confidence,      embedding,    });
    // simple self-relation placeholder; real similarity edges added later    await session.run(`      MATCH (n:WikiNode {id: $tweetId})      MERGE (m:WikiNode {id: 'seed-attention'})      MERGE (n)-[:RELATED_TO {cosine: 0.92}]->(m)    `, { tweetId });  } finally {    await session.close();  }}

// ingestEmbedding.tsimport { generateObject } from 'ai'; // already coveredimport { voyage } from '@ai-sdk/voyage';import neo4j from 'neo4j-driver';import { z } from 'zod';
const driver = neo4j.driver(  'neo4j://localhost:7687',  neo4j.auth.basic('neo4j', 'password'));
const WikiNodeSchema = z.object({ // referencing Zod schema for wiki node validation  claim: z.string(),  context: z.string(),  sources: z.array(z.string()),  confidence: z.number().min(0).max(1),});
async function storeTweetAsKnowledgeNode(tweetText: string, tweetId: string) {  // semantic chunking of tweet threads into coherent nodes already performed upstream  const { object } = await generateObject({    model: voyage('voyage-3'), // text embedding computation with Voyage    schema: WikiNodeSchema,    prompt: `Extract atomic claim from Karpathy tweet:\n${tweetText}`,  });
  const embeddingResponse = await voyage('voyage-3-large').embeddings.create({    input: object.claim,    options: { dimensions: 1024 },  });  const embedding = embeddingResponse.data[0].embedding; // float32 array
  const session = driver.session();  try {    await session.run(`      MERGE (n:WikiNode {id: $tweetId})      SET n.claim = $claim,          n.context = $context,          n.sources = $sources,          n.confidence = $confidence,          n.embedding = $embedding    `, {      tweetId,      claim: object.claim,      context: object.context,      sources: object.sources,      confidence: object.confidence,      embedding,    });
    // simple self-relation placeholder; real similarity edges added later    await session.run(`      MATCH (n:WikiNode {id: $tweetId})      MERGE (m:WikiNode {id: 'seed-attention'})      MERGE (n)-[:RELATED_TO {cosine: 0.92}]->(m)    `, { tweetId });  } finally {    await session.close();  }}

Every line is commented in the actual repository version; the snippet above demonstrates exactly the minimal path from tweet text to persisted vector node.

Example Breakdown

The voyage('voyage-3') call reuses the same generateObject pattern introduced in the previous lesson but now requests a dense vector instead of JSON. The embedding array is stored directly inside Neo4j as a list property because Neo4j 5.18+ treats float[] efficiently. The Cypher MERGE statement is deliberately idempotent so repeated ingestions of the same tweetId do not duplicate nodes. The RELATED_TO relationship carries the pre-computed cosine value so downstream GDS algorithms can traverse only high-confidence semantic links. This design pattern decouples the embedding provider from the graph query layer, allowing future swapping to OpenAI's text-embedding-3-large by changing a single import.

Extended Example

After ingesting several Karpathy tweets on self-attention, we run a batch similarity job that creates edges only when cosine exceeds a threshold. We use a side-by-side panel visualization to show the correspondence between selected nodes and the generated Cypher.

Running the above Cypher pattern on 30 recent Karpathy tweets typically yields a graph containing 120–180 :RELATED_TO edges within the self-attention cluster, with clear geometric separation between attention and embedding-related sub-communities.

Common Mistakes

A frequent error is attempting to store the raw embedding as a stringified JSON array, which prevents Neo4j GDS from indexing it for fast cosine lookups; the correct type is a native float[] property. Another pitfall is running full pairwise similarity on thousands of nodes without using the Graph Data Science library's approximate nearest-neighbor index, causing O(n²) query times that quickly crash the driver connection. Developers also sometimes set the similarity threshold too low (0.3) because they mis-interpret the geometric meaning of cosine; in practice for technical content a threshold of 0.68–0.80 avoids spurious edges while preserving transitive conceptual flow. Finally, forgetting to index the embedding property with CREATE INDEX ON :WikiNode(embedding) leads to full scans even for exact ID lookups. Each mistake becomes visible either as exploding memory during batch ingest or as a 3D visualization where unrelated clusters collapse together.

Think about it

How does the geometric structure of embedding space influence the design of reflection and contradiction-detection agents in this wiki?

Real-World Application

Karpathy’s own shift from “attention is all you need” to nuanced discussions of rotary embeddings and retrieval-augmented generation is now captured as a navigable path inside the graph instead of scattered tweets. A reflective agent can traverse high-cosine :RELATED_TO edges, surface potential contradictions (e.g., a claim that absolute positional embeddings are obsolete next to a claim that rotary embeddings still need absolute anchors), and propose a new synthesis node back into the graph. The same 3D geometric layout used for human inspection becomes input to the LLM itself: by projecting sub-graphs into local 2-D slices before prompting, the model receives a visual-first explanation of the embedding neighborhoods it must reason over. This architecture has been stress-tested against multi-month Twitter streams where semantic drift occurs; the graph naturally accretes new clusters while preserving older conceptual anchors. Trade-offs remain: Voyage currently offers superior technical-domain embeddings but at higher latency than OpenAI; the Cypher-based storage pattern favors query flexibility over raw vector-database speed. When the system scales beyond ten thousand nodes the next upgrade path is to migrate embeddings into a dedicated vector index (Neo4j vector index or a separate Pinecone/Qdrant instance) while retaining Neo4j solely for the typed knowledge relationships.

The resulting self-evolving wiki is no longer a static database of facts but a living geometric model of an expert’s evolving thought, exactly the substrate needed for autonomous reflective agents.

Knowledge Check

Test your understanding

1 / 5

What is the primary reason to store embedding vectors as a native float[] property inside a Neo4j WikiNode rather than as a JSON string?

Up next

Hybrid Vector + Graph Retrieval Using Cosine and Graph Traversal

Build a TypeScript retriever that queries both vector similarity and 2-hop graph neighborhoods; compare results in a visual dashboard.

Pending