6 pointsby kaliades15 days ago1 comment
  • incidentiq14 days ago
    The historical slowlog persistence is the killer feature here. Lost count of how many times I've had a Redis performance issue, went to check slowlog, and found it already rotated because the buffer filled during the incident. By the time you're investigating, the evidence is gone.

    The pattern analysis ("HGETALL user:* is 80% of your slow queries") is what teams manually do during postmortems - automating that correlation saves real debugging time.

    Two questions:

    1. How does the Prometheus integration handle high-cardinality key patterns? One of the pain points with Redis metrics is that per-key metrics can explode label cardinality. Are you sampling or aggregating at the pattern level?

    2. For the anomaly detection - what's the baseline learning window? Redis workloads can be very bursty (batch jobs, cache warming after deploy), so false positives on "anomaly" can be noisy if the baseline doesn't account for periodic patterns.

    Good timing on the Valkey support - with the Redis license change, a lot of teams are evaluating migration and will need tooling that supports both.

    • kaliades14 days ago
      Thanks! Those are exactly the right questions.

      1. Cardinality: We don't do per-key metrics — that's a guaranteed way to blow up Prometheus. All pattern metrics are aggregated at the command pattern level (e.g., HGETALL user:* not HGETALL user:12345). The pattern extraction normalizes keys so you see the shape of your queries, not the individual keys. For cluster slot metrics, we automatically cap at top 100 slots by key count — otherwise you'd get 16,384 slots × 4 metrics = 65k series just from slot stats. The metrics that can grow are client connections by name/user, but those scale with unique client names, not keys. If it becomes an issue, standard Prometheus relabel_configs can aggregate or drop those labels.

      2. Baseline window: We use a rolling circular buffer of 300 samples (5 minutes at 1-second polling). Minimum 30 samples to warm up before detection kicks in. To reduce noise from bursty workloads, we require 3 consecutive samples above threshold before firing, plus a 60-second cooldown between alerts for the same metric. This helps with the "batch job at 2am" scenario — a single spike won't trigger, but sustained deviation will. That said, you're right that periodic patterns (daily batch jobs, cache warming after deploy) aren't explicitly modeled yet. It's on the roadmap — likely as configurable "expected variance windows" or integration with deployment events. Would love to hear what approach would work best for your use case.

      I think the licensing issues are long gone (it was all the way in 2024), so most people have moved on, but monitoring and observability are something that people have said are missing over and over.