One Karpathy tweet can be decomposed into five atomic claims that contradict each other when fed to an LLM without cleaning, yet the same tweet yields perfectly consistent knowledge when run through a structured extraction pipeline.
Problem
The real task is to ingest Andrej Karpathy’s latest tweets about geometry of attention in transformers directly into a self-evolving knowledge graph that an LLM wiki can reflect upon and critique. Without a robust ingestion layer, raw Twitter data arrives polluted with emojis, URLs, thread markers, @mentions and marketing fluff. This leads to fragmented or hallucinated atomic knowledge claims that poison downstream nodes. The system must fetch tweets in real time via Twitter API v2, parse them into typed structured objects using ai-sdk’s generateObject, and run them through a deterministic cleaning pipeline before they ever touch the knowledge graph.
Concept
Twitter API v2 real-time fetching provides a filtered, rule-based firehose of tweets matching specific keywords or user IDs with low latency. Atomic knowledge claim extraction is the principle of breaking a single sentence into the smallest self-contained factual units that each assert exactly one new piece of knowledge (subject-predicate-object style) while preserving provenance. ai-sdk generateObject uses the LLM’s own structured output capabilities (via JSON schema and constrained decoding) to guarantee the model returns TypeScript-typed objects instead of free-form text. The tweet cleaning pipeline for LLM ingestion removes everything that cannot survive deterministic downstream processing—URLs, handles, emojis, markdown, timestamps—while preserving geometric and mathematical terminology that is crucial for later embedding similarity calculations.
Minimal working example
Example breakdown
The code is deliberately split into three pure functions: streamKarpathyTweets (the ingestion source), cleanTweet (the deterministic filter), and extractAtomicClaims (the LLM-powered structured extractor). Each line of cleanTweet targets a specific failure mode that would otherwise break downstream tokenization or geometric embedding computations. The ClaimSchema enforces that every returned claim meets a minimum confidence threshold and belongs to a known topic category—preventing the LLM from inventing irrelevant facts. generateObject is used instead of generateText because it guarantees the JSON matches the Zod schema through constrained sampling, eliminating the common retry loop required with plain JSON mode. The generator streamKarpathyTweets yields one processed batch at a time, allowing the caller to pause or apply back-pressure when the knowledge graph write becomes the bottleneck.