Happy to answer questions about the pgvector setup, Cloudflare Workers constraints, or the clustering algorithm tuning.
News articles about the same event tend to share named entities (people, places, organizations), numbers, and factual structure even across languages. "EU approves AI regulation" is a factual statement that embeds similarly regardless of language. This is very different from, say, opinion pieces or cultural commentary where idioms and local framing would diverge more.
That said, similarity alone isn't enough. The real reliability comes from non-semantic constraints layered on top:
- Time gap ≤ 18 hours between article and story — prevents "same topic, different month" false merges
- Story age ≤ 36 hours — old stories stop absorbing new articles
- Two-pass design — matching against refined story embeddings (average of recent articles) is more stable than raw article-to-article comparison
Where it does break: regional stories with heavy local context. A Japanese domestic politics article and an English wire service summary of the same event sometimes land just below threshold because the framing is so different. I accept some missed merges there rather than lowering the threshold and getting false positives.
No per-language thresholds so far — the embedding model (Qwen3) seems to normalize well across the languages I cover. But I wouldn't be surprised if that changes when adding languages with less training data representation.