1 pointby uulong3 hours ago1 comment
  • uulong2 hours ago
    The Problem: Everyone is using HNSW (graph indexes) for vector search. It works great for servers, but it introduces build-time latency, memory overhead (edges), and random access patterns that kill performance on consumer hardware.

    The Project: QingMing is a header-only C++ engine that implements exact brute-force search. Instead of pruning the search space, I optimized the memory access pattern to saturate the HBM/GDDR6 bandwidth of consumer GPUs.

    Benchmarks (Consumer Hardware):

      Desktop (NVIDIA RTX 5090D - 24GB)
      ---------------------------------
      Dataset:     SIFT-1M (128-dim)
      Recall:      99.2% @ 1 (FP32 variance), 100% @ 10
      Throughput:  9,354 QPS (Batch=10k)
      Latency:     ~5.5ms (P99)
      Build Time:  0 seconds
    
      Desktop (AMD Radeon 7900 XTX - 24GB)
      ------------------------------------
      Dataset:     SIFT-1M (128-dim)
      Recall:      99.2% @ 1, 100% @ 10
      Throughput:  6,275 QPS (Batch=10k)
      Latency:     ~11.2ms (P99)
      Note:        Running via HIP/ROCm 6.2 on Ubuntu
    
      Mobile (Snapdragon 8 Gen 5)
      ---------------------------
      Scenario:    100k Vectors (128d) for personal knowledge base
      Latency:     ~8ms per query
      Endurance:   Ran 10k consecutive queries with ZERO thermal throttling
                   (due to L3/System Cache residency optimization)
    
    Why use this? 1. Local RAG: Run high-quality retrieval on your gaming PC or phone. 2. Simplicity: No hyperparameters to tune (ef_search, M, nprobe). 3. Deterministic: No approximation errors for critical data.

    Happy to answer questions about the NEON/CUDA/HIP memory coalescing details!