Turning Tweets into Wiki Nodes with ai-sdk Structured Output
A production system built on top of Twitter API v2 real-time fetching and the tweet cleaning pipeline for LLM ingestion ingested one of Andrej Karpathy’s dense embedding threads, hallucinated “cosine similarity equals Euclidean distance” as a standalone claim, and confidently linked it into the knowledge graph — causing every downstream similarity query to return subtly wrong neighbors for two weeks.
Problem
Converting a high-signal but noisy tweet thread into reliable, evolvable knowledge requires turning unstructured text into atomic claims that can be versioned, contradicted, and reflected upon by agents. Without enforceable structure and runtime validation, even the most sophisticated atomic knowledge claim extraction pipeline produces malformed or fabricated fields that silently corrupt the graph. This lesson shows how to define a precise Zod schema for wiki nodes, combine it with ai-sdk’s generateObject for validated LLM output, and apply semantic chunking of tweet threads into coherent nodes — all in TypeScript.
Concept
The mental model is schema-first parsing: the LLM no longer returns free-form text but is forced to conform to a strongly-typed, self-documenting shape that encodes exactly what the wiki needs to function. A structured wiki entry consists of four mandatory fields: claim (the atomic, falsifiable assertion), context (geometric intuition or geometric explanation when relevant), sources (tweet ids and urls), and confidence (numeric score grounded in the model’s own uncertainty estimate). Zod schema for wiki node validation acts as the contract; ai-sdk generateObject with Zod guarantees that the parsed object is either valid or throws with a clear error before it ever touches the graph. Semantic chunking of tweet threads into coherent nodes replaces naive per-tweet splitting by grouping sentences that belong to the same conceptual unit, preserving Karpathy’s geometric explanations across multiple tweets before extracting a single claim.
This live panel lets you select a schema field and instantly see which part of the tweet text it extracts — making the contract between unstructured input and typed output explicit.
Minimal working example
Every line above is purposeful: Zod’s .describe() calls become part of the system prompt, the schema is passed directly, and the returned object is guaranteed to satisfy the type without manual parsing.
Example breakdown
The schema is deliberately minimal yet strict. claim enforces a minimum length because we already performed atomic knowledge claim extraction in the prior stage; the LLM’s job now is to surface it cleanly. context intentionally prefers geometric explanations because Karpathy’s threads are rich with vector-space intuitions. confidence is not a confidence score we assign afterward — it is elicited from the model and later used by reflective agents to decide which nodes to critique first.
Semantic chunking of tweet threads into coherent nodes happens before this step: instead of sending each tweet separately, we group them by semantic similarity using embeddings of consecutive sentences (a lightweight DBSCAN-style pass on sentence vectors). The chunk becomes the cleanedThreadText fed to generateObject. This prevents the model from splitting a single geometric argument across multiple nodes.
Extended example
The extended version reveals a key tradeoff. Sending many small tweets triggers more hallucinations; feeding too large a context loses granularity. Semantic chunking sits at the sweet spot. Note that we keep ai-sdk generateObject for structured tweet parsing (from the previous lesson) only for the initial tweet cleaning stage and switch to the stricter WikiNodeSchema here.
This stepper illustrates how semantic chunking of tweet threads into coherent nodes transitions into validated output, showing state at each processing stage.
Common mistakes
- Treating the LLM as a parser instead of an extractor: Calling generateObject without a rich enough context or without the .describe() annotations on the schema leads to generic claims. Fix by making the prompt geometrically explicit (“prefer explanations that can be visualized in vector space”).
- Over-chunking: Splitting mid-geometric argument creates orphaned claims that later agents cannot reconcile. Measure chunk coherence with average intra-cluster cosine before ingestion.
- Ignoring confidence calibration: Using the confidence field only for display rather than agent prioritization. The reflective loop breaks when low-confidence nodes are treated the same as high-confidence ones.
- Schema drift: Updating the Zod schema without bumping the version stored in Neo4j leads to runtime validation errors on older nodes. Version the schema name and keep a migration path.
Where this approach breaks down is when the tweet thread mixes multiple unrelated topics (e.g., Karpathy suddenly pivots to a new research paper). In those cases the semantic chunker produces larger, noisier chunks and generateObject’s validation rate drops. The next lesson on storing embeddings and building the initial Neo4j knowledge graph will show how we detect these problematic nodes geometrically and trigger a critique agent to split them.