Cross-Lingual News Dedup at $100/Month – Embeddings, Pgvector, and UnionFind(yingjiezhao.com)

2 pointsby ethan_zhao6 hours ago2 comments

ethan_zhao6 hours ago
Author here. I built this for 3mins.news, an AI news aggregator covering 180+ sources in 17 languages. The trickiest part was figuring out that articles in different languages about the same event share zero tokens — MinHash/LSH gives you Jaccard similarity of 0.
Happy to answer questions about the pgvector setup, Cloudflare Workers constraints, or the clustering algorithm tuning.
yugoru6 hours ago
its harder than it first appears. Even with good embeddings, semantic similarity across languages often breaks when articles include local context or idioms. Curious whether you found a threshold strategy that works reliably across languages, or if it still needs manual tuning.
- ethan_zhao5 hours ago
  Good question. The short answer: a single global threshold (cosine similarity ≥ 0.7) works surprisingly well for news, but it's not because embeddings handle idioms perfectly — it's because news articles are structurally constrained.
  News articles about the same event tend to share named entities (people, places, organizations), numbers, and factual structure even across languages. "EU approves AI regulation" is a factual statement that embeds similarly regardless of language. This is very different from, say, opinion pieces or cultural commentary where idioms and local framing would diverge more.
  That said, similarity alone isn't enough. The real reliability comes from non-semantic constraints layered on top:
  - Time gap ≤ 18 hours between article and story — prevents "same topic, different month" false merges
  - Story age ≤ 36 hours — old stories stop absorbing new articles
  - Two-pass design — matching against refined story embeddings (average of recent articles) is more stable than raw article-to-article comparison
  Where it does break: regional stories with heavy local context. A Japanese domestic politics article and an English wire service summary of the same event sometimes land just below threshold because the framing is so different. I accept some missed merges there rather than lowering the threshold and getting false positives.
  No per-language thresholds so far — the embedding model (Qwen3) seems to normalize well across the languages I cover. But I wouldn't be surprised if that changes when adding languages with less training data representation.