Fetching and Cleaning Real-Time Karpathy Twitter Data — Building Karpathy’s Self-Evolving Knowledge LLM Wiki with Reflective Agents

One Karpathy tweet can be decomposed into five atomic claims that contradict each other when fed to an LLM without cleaning, yet the same tweet yields perfectly consistent knowledge when run through a structured extraction pipeline.

Problem

The real task is to ingest Andrej Karpathy’s latest tweets about geometry of attention in transformers directly into a self-evolving knowledge graph that an LLM wiki can reflect upon and critique. Without a robust ingestion layer, raw Twitter data arrives polluted with emojis, URLs, thread markers, @mentions and marketing fluff. This leads to fragmented or hallucinated atomic knowledge claims that poison downstream nodes. The system must fetch tweets in real time via Twitter API v2, parse them into typed structured objects using ai-sdk’s generateObject, and run them through a deterministic cleaning pipeline before they ever touch the knowledge graph.

Concept

Twitter API v2 real-time fetching provides a filtered, rule-based firehose of tweets matching specific keywords or user IDs with low latency. Atomic knowledge claim extraction is the principle of breaking a single sentence into the smallest self-contained factual units that each assert exactly one new piece of knowledge (subject-predicate-object style) while preserving provenance. ai-sdk generateObject uses the LLM’s own structured output capabilities (via JSON schema and constrained decoding) to guarantee the model returns TypeScript-typed objects instead of free-form text. The tweet cleaning pipeline for LLM ingestion removes everything that cannot survive deterministic downstream processing—URLs, handles, emojis, markdown, timestamps—while preserving geometric and mathematical terminology that is crucial for later embedding similarity calculations.

Minimal working example

typescript

// twitter-fetcher.tsimport { TwitterApi } from 'twitter-api-v2';import { generateObject } from 'ai';import { z } from 'zod';
const twitterClient = new TwitterApi(process.env.X_API_BEARER_TOKEN!);
const ClaimSchema = z.object({  text: z.string().min(10),  confidence: z.number().min(0.7),  topic: z.enum(['attention', 'geometry', 'transformers', 'other']),  sourceTweetId: z.string()});
type AtomicClaim = z.infer<typeof ClaimSchema>;
export async function extractAtomicClaims(tweetText: string, tweetId: string): Promise<AtomicClaim[]> {  const { object } = await generateObject({    model: 'anthropic:claude-3-haiku-20240307',    schema: z.object({ claims: z.array(ClaimSchema) }),    prompt: `Extract ONLY atomic knowledge claims from this tweet.\nTweet: "${tweetText}"\nReturn only claims that are self-contained and verifiable.`  });  return object.claims.map(c => ({ ...c, sourceTweetId: tweetId }));}
// Real-time fetcher (minimal)export async function* streamKarpathyTweets() {  const stream = await twitterClient.v2.searchStream({    'tweet.fields': 'created_at,author_id,conversation_id',    expansions: 'author_id',    'user.fields': 'username'  });  for await (const tweet of stream) {    if (tweet.data.author_id === '338382310') { // Karpathy      const cleaned = cleanTweet(tweet.data.text);      const claims = await extractAtomicClaims(cleaned, tweet.data.id);      yield { tweetId: tweet.data.id, claims };    }  }}
function cleanTweet(text: string): string {  return text    .replace(/https?:\S+/g, '')           // remove URLs    .replace(/@\w+/g, '')                  // remove mentions    .replace(/[\u2700-\u27BF]/g, '')       // remove emojis    .replace(/\n+/g, ' ')                 // normalize whitespace    .trim();}

// twitter-fetcher.tsimport { TwitterApi } from 'twitter-api-v2';import { generateObject } from 'ai';import { z } from 'zod';
const twitterClient = new TwitterApi(process.env.X_API_BEARER_TOKEN!);
const ClaimSchema = z.object({  text: z.string().min(10),  confidence: z.number().min(0.7),  topic: z.enum(['attention', 'geometry', 'transformers', 'other']),  sourceTweetId: z.string()});
type AtomicClaim = z.infer<typeof ClaimSchema>;
export async function extractAtomicClaims(tweetText: string, tweetId: string): Promise<AtomicClaim[]> {  const { object } = await generateObject({    model: 'anthropic:claude-3-haiku-20240307',    schema: z.object({ claims: z.array(ClaimSchema) }),    prompt: `Extract ONLY atomic knowledge claims from this tweet.\nTweet: "${tweetText}"\nReturn only claims that are self-contained and verifiable.`  });  return object.claims.map(c => ({ ...c, sourceTweetId: tweetId }));}
// Real-time fetcher (minimal)export async function* streamKarpathyTweets() {  const stream = await twitterClient.v2.searchStream({    'tweet.fields': 'created_at,author_id,conversation_id',    expansions: 'author_id',    'user.fields': 'username'  });  for await (const tweet of stream) {    if (tweet.data.author_id === '338382310') { // Karpathy      const cleaned = cleanTweet(tweet.data.text);      const claims = await extractAtomicClaims(cleaned, tweet.data.id);      yield { tweetId: tweet.data.id, claims };    }  }}
function cleanTweet(text: string): string {  return text    .replace(/https?:\S+/g, '')           // remove URLs    .replace(/@\w+/g, '')                  // remove mentions    .replace(/[\u2700-\u27BF]/g, '')       // remove emojis    .replace(/\n+/g, ' ')                 // normalize whitespace    .trim();}

Example breakdown

The code is deliberately split into three pure functions: streamKarpathyTweets (the ingestion source), cleanTweet (the deterministic filter), and extractAtomicClaims (the LLM-powered structured extractor). Each line of cleanTweet targets a specific failure mode that would otherwise break downstream tokenization or geometric embedding computations. The ClaimSchema enforces that every returned claim meets a minimum confidence threshold and belongs to a known topic category—preventing the LLM from inventing irrelevant facts. generateObject is used instead of generateText because it guarantees the JSON matches the Zod schema through constrained sampling, eliminating the common retry loop required with plain JSON mode. The generator streamKarpathyTweets yields one processed batch at a time, allowing the caller to pause or apply back-pressure when the knowledge graph write becomes the bottleneck.

Extended example

In a production system we add rate-limit handling, exponential backoff, duplicate-tweet deduplication based on tweet ID, and a fallback rule-based parser for when confidence drops below 0.85. We also batch up to 10 tweets before calling the LLM to amortize token cost, while keeping the cleaning pipeline fully synchronous to avoid interleaving state. The final cleaned claims are enriched with vector embeddings (showing the geometric intuition the user already prefers) before storage.