2 pointsby synthesis5x8 hours ago1 comment
  • synthesis5x11 hours ago
    Meta has a problem in its clusters in Boca Raton, Miami; this is affecting the MNA content delivery network and direct content consumption. This has a regional impact in Latin America, since so far most non-cacheable content is consumed from the clusters in Florida.

    The impact is traceable via ICMP, but also reproducible via TCP and difficult to measure via UDP. This is why monitoring tools are misleading: there is no “slowness” resulting from interface saturation; instead, there is data corruption where packets are discarded at the interface level. Therefore, if network performance is measured using those same data points, it won’t work and you won’t see any alerts.

    The issue can also be replicated from the looking glass. In fact, I will attach images below, although you can also see them on the website attached to the post, as well as a more specific report

    There is packet loss and probably flapping on a BGP instance, OSPF, or some IGP within Meta’s network. I believe it is between 129.134.101.34, 129.134.104.84, and 129.134.101.51. It is possible that it’s a faulty interface in a bundle or some hardware issue that a “show interface status” doesn’t reveal, which is why I’ve failed to report this problem through your NOC.

    How can Meta replicate the failure?

    1: Look for random MNA cluster IPs from your clients. 2: Ping from 157.240.14.15 with a payload larger than 500 bytes (a packet is more likely to get corrupted on a faulty interface if the payload increases). 3: Ping many servers from point 1.

    You will see that once you find the affected upstream or downstream route combination, you will have 10-60% packet loss to the destination host.

    How to fix it? Isolate the port or discard faulty hardware.

    Why didn’t we see it before?

    Simply put, your monitoring tools and troubleshooting protocols don’t work for these problems. The protocol is to attach a HAR file that bases its performance on window scaling and TCP RTT; if both are good, even with data loss, there’s “no problem.” Especially because that HAR file is extracted using QUIC, and QUIC is particularly good at mitigating slowness caused by data loss (since packets are retransmitted without the TCP penalty). You know what uses TCP? WhatsApp Statuses, and those are slow.

    Can an MTR show where the problem is?

    Generally not, this is because:

    In any network route, there is a certain number of hops; for example, suppose there are 5 hops between host A and host B. To perform a traceroute, packets are sent with increasing TTL values (1, 2, 3, etc.). Each time a packet expires before reaching its destination, the transit hop reports a “TTL Time Exceeded” message, which is how the route is mapped. The problem is that these are basically point-to-point probes; it’s like pinging each hop individually. And when there’s a problem on an affected interface in an ECMP or bundle, those P2P connections won’t necessarily take the affected path. So they are unreliable; generally, you will see that the losses are produced by the final host even though the fault is in the middle. check metafixthis.com