People do a speedtest and see low (sub-100) numbers and think that's why their video call is failing. Never mind the fact that Zoom only needs 3 Mbps for 1080p video.
They’d only developed it with sub-millisecond latency to the server, so they never noticed this.
I don’t think it was a coincidence that the team was US-based: in Australia, we’re used to internet stuff having hundreds of milliseconds of latency, since so much of the internet is US-hosted, so I think Australians would be more likely to notice such issues early on. All those studies about people abandoning pages if they take more than two seconds to load… at those times, it was a rare page that has even started rendering that soon, because of request waterfalls and high latency. (These days, it’s somewhat more common to have CDNs fix the worst of the problem.)
Having gotten my hands on an experimental 128kbps link early on, but later and moving to the countryside with a 56kb-1Mbps really spotty connection made me really appreciate local state as every time things blocked on the internet made it pretty notorious.
I'm glad there's a push for synchronized, local-first state now, as roaming around on mobile or with a laptop hopping on wifi can only perform nicely with local state.
This reminds me a ton about the studies showing harm to children's future mental health as a result of growing up poor. I definitely "hoard" data locally because I grew up with accessing BBSs at 1200 baud, moving up thru dial-uo Internet, and eventually shitty DSL. Services and products that rely on constant access to the Internet strike me as dodgy and likely to fail as a result of my "upbringing".
It's not at all impossible to design fast and responsive sites and single page applications under these constraints, you just have to be aware of them, and actively target it during the full course of development.
part of the reason why modern software is so crappy, is because developers often have thee most powerful machines (MacBook pro like) and don't even realize how resource hungry and crrappy their software at lower end devices
This is by far the worst offender I've seen.
Madness.
I’m usually on an old copper line (16 ms ping to Amsterdam) in the Netherlands (130 ms to San Francisco).
Some sites are just consistently slow. Especially GitHub or cloud dashboards. My theory is that the round trip to the database slows things down.
Jira is so agonisingly slow it’s a wonder anyone actually pays for it. Are the devs who work on it held against their will or something? It’s ludicrous.
GitHub gets worse every day, with the worst sin being their cutesy little homegrown loading bar, which is literally never faster than simple reloading the page.
To ward off potential criticism: I know you can mix client-side updates with server-side updates in LiveView and co. I’ve tried. Maintaining client-side changes on a managed DOM, sort of like maintaining a long-lived topic branch that diverges from master, sucks.
Yes, Bloomberg had fun with latency because of their datacenter locations (about a decade ago they still only had two and a half close to New York). Pages that would paint acceptably in London would be unacceptable in Tokyo as when poorly designed they would require several round trips to render. Once the page rendered there was still the matter of updating the prices, which was handled by separately streaming data from servers close to the markets to the terminals. A very different architecture but rather difficult to test because of the significant terminal-side functionality.
Chrome's developer tools allow you to disable caching and choose to simulate a specific network type. Most people know those settings restrict the throughput. But they also increase the request latency. They do this on a request-by-request basis, not at packet level, so it's only an approximation. Still a good test.
Could they run their client from your country and operate the UI remotely?
There are more options than moving the country!
You can even ask one of these guys to do the setup for you. They'll do in a pinch with a happy face.
I know because I did.
That reminds me on the atrocious performance of Apple's TimeMachine with small files. Running backups on SSDs is fast, but cable ethernet is noticeably worse, and even WiFi 6 is utterly disgraceful.
To my knowledge you can't even go and say "do not include any folder named vendor (PHP) or node_modules (JS)", because (at least on my machine) these stacks tend to be the worst offenders in creating hundreds of thousands of small files.
I'm still on the same LTE connection, but everyone kept telling me how my speeds are crap and how I should update to a new LTE cat 21 router. So I got one of more popular models ZTE MF289F. And the speed increased to 50Mb up 75Mb down on a speed test. But all my calls suddenly felt very choppy and the perceived Web browsing was unbearably slow... What happened? Well, the router would just decide every day or so to up it's latency to Google.com from 15ms to 150ms until it was restarted. But that is not all. Even when the ping latency was fine it still felt slower than my ancient tplink lte router... So the zte went into a drawer waiting for the times I'll have time to put Linux on it. And the tplink went back on top of my antenna mast.
See also https://github.com/lynxthecat/cake-autorate for an active measurement tool...
However they tend to use something like the 75th percentile and throw out real data. The waveform bufferbloat test does 95% and supplies whisker charts. cloudflare also.
No web test tests up + down at the same time, which is the worst scenario. crusader and flent.org's rrul test do.
Rathan than argue with your colleague, why not just slap an OpenWrt box as a transparent bridge inline and configure CAKE SQM?
-- Andrew S. Tanenbaum
https://www.bitag.org/latency-explained.php
It's worth a read.
I mean I bet they do care about litres/100km for their car AND 0-100km accelerarion (and many other stats)
A simple rule of thumb is: If a single user experiences poor performance with your otherwise idle cluster of servers, then adding more servers will not help.
You can't imagine how often I have to have this conversation with developers, devops people, architects, business owners, etc...
"Let's just double the cores and see what happens."
"Let's not, it'll just double your costs and do nothing to improve things."
Also another recent conversation:
"Your network security middlebox doesn't use a proper network stack and is adding 5 milliseconds of latency to all datacentre communications."
"We can scale out by adding more instances if capacity is a concern!"
"That's... not what I said."
Have some laughs: https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-...
I'm semi-retired now...
(edit: I forgot to note that the "let's not" part was always overridden by "You're wrong, this will fix it. Do it!" by management. Then we would eventually find and fix the actual problem (because it didn't go away), but the cluster size -- and the cost -- would remain because "No, it was too slow with so few replicas".)
/s
> Brooks points out this limited divisibility with another example: while it takes one woman nine months to make one baby, "nine women can't make a baby in one month".
taking the population of the earth and the birth rate and doing some math, you get around to needing 12,000 women of reproductive age for you to have a baby tomorrow.
12,000 is a lot of women! it's well above Dunbar's number. think about that, next time the 9 women one month baby topic comes up.
ISPs at scale do not use software routers. They use ASIC routers (Juniper/Arista/Cisco/etc.), for many reasons 1) features 2) capacity 3) reliability.
ASIC routers are capable of handling 100-1000x the throughput of the most over-provisioned Linux server (and that may even be an understatement). ASIC routers can also route packets with latency between 750us (0.75ms) and 10us (0.01ms!), complemented by multi-second (>GB) packet buffers.
QoS is rarely used at scale, if anything only on the access layer, because transit has become so cheap that ISPs have more bandwidth than they know what to do with. These days, if a link is congested, it's not cost saving, but instead poor network planning. QoS also has very limited benefits at >100G scale.
With that said, I feel that this article is definitely missing the full picture.
I recommend using a small box running LibreQoS adjacent to the big router. Large-scale routers based on Application-Specific ICs do a wonderful job, but are hard to change. Having a transparent fix in an inexpensive device now is way better than waiting and hoping that the router vendor can update their ASICs (:-))
I emphasised your problem in a video about the article, at https://vimeo.com/1017926413
Bandwidth is cheap.
A report by the Internet Society (ISOC) about IXPN and Kenyan IXP revealed that in early 2020, the port charge at IXPN was US$0.428 per month per Mbps (for a 1 Gbps port), while the cost of international IP transit is US$27.45 per Mbps per month (also for 1 Gbps capacity).
Let me find more recent estimates. Just note that we aren't all in Ashburn, Frankfurt or Amsterdam
From Uganda April 2023. And Zimbabwe and Eritrea are more expensive
Coincidentally, the difference between latency and data rate is also much clearer using these two terms.
"CAKE then added Active Queue Management (AQM), which performs the same kind of bandwidth probing that vanilla TCP does but adds in-band congestion signaling to detect congestion as soon as possible. The transmission rate is slowly raised until a congestion signal is received,[...]"
This appears to suggest that Cake (an in-network AQM process) takes over some of the functionality of TCP (implemented in the endpoints). What's actually happening is that the AQM provides a better signal to allow TCP to do a better job.
The rest of the article is more it less accurate, albeit that it's marketing for one particular tool rather than giving you the level of understanding needed to choose one.
The dig at PIE (another AQM) is also a bit misleading, in that their main complaint is not PIE itself but the lack of all these other features they think necessary. If Cake used PIE instead of CODEL I don't think it would be noticeably different.
Pie had a severe problem in the rate estimator which was fixed in 2018, in Linux, at least:
https://www.sciencedirect.com/science/article/abs/pii/S13891...
Pie's principal advantage is that it is slightly easier to implement in hardware, it's disadvantages are that it does tail drop, rather than head drop, and struggles to be stable at a target of 16ms, where codel can go down to us and targets 5ms by default. I haven't really revisited pie since the above paper was published.
COBALT in cake is a codel derivative. It is slightly tighter in some respects (hitting slow start sooner), and looser in others (it never drops the last packet in a queue, which fq_codel does.fq_codel scales to hundreds of instances and 10s of thousands of queues, still aiming for 5ms across that target, where it would be easier to essentially DOS that many instances of cake with tons of flows.
I can get 1 gb down but only 50 mb upload. Certain tasks (like uploading a docker image) I cant do at all from my personal computer.
The layman has no idea the difference, and even most legislators don't understand the issue ("isn't 1 gb fast enough?")
I'm not very happy.
Have you filed a complaint with the FCC? Both times I had to do it things got sorted very quickly.
https://consumercomplaints.fcc.gov/hc/en-us/articles/1150022...
At one time I was experiencing high ping times and near non existent speed from ATT Fiber to Online.fr's network. I did 80% of the diagnostics for them and provided the details and of course a nudge as to what I felt the issue could be.
It's extremely frustrating to be a networking person having to deal with home internet CS.
To my surprise, it actually did get to their networking team who replied saying the peer was fine and try again. The problem with that was that it came 8 months later, long after I'd left the area and didn't even have service with them anymore.
In my experience, it’s much easier to upload code or commits and build/push artifacts in/from the datacenter, whether manually or via CI.
It can be as simple as exporting DOCKER_HOST=“ssh://root@host”. Docker handles uploading the relevant parts of your cwd to the server.
I have a wickedly fast workstation but spot instances that are way way faster (and on 10gbps symmetric) are pennies. Added bonus: I can use them from a slow computer with no degradation.
For the unlucky, maybe we can take advantage of the fact that most image pushes have a predecessor which they are 99% similar to. With some care about the image contents (nar instead of tar, gzip --rsyncable, etc) we ought to be able to save a lot of bandwidth by using rsync on top of the previous version instead of transferring each image independently.
As someone who used to work with LLMs, I feel this pain. It would take days for me to upload models. Other community members rent GPU servers to do the training on just so that their data will already be in the cloud, but that's not really a sustainable solution for me since I like tinkering at home.
I have around the same speeds, btw. 1Gb down and barely 40Mb up. Factor of 25!
The worst part is that block compression actually does not help if it doesn't do a significantly good job of compression AND decompression. My use case had to immediately deploy the models across a few nodes in a live environment at customer sites. Cloud wasn't an option for us and fiber was also unavailable many times.
The fastest transport protocol was someone's car and a workday of wages.
This is actually the entire premise of AWS Snowball: send someone a bunch of storage space, have them copy their data to that storage, then just ship the storage back with the data on it. It can be several orders of magnitude faster and easier than an internet transfer.
Sneakernet really works. https://en.wikipedia.org/wiki/Sneakernet
They have used deep packet inspection and traffic shaping for ages to screw over Over The Top competition to their own services or tier their offerings into higher priced slightly less artificially sabotaged package deals.
I realy like what the libreqos people are aiming for, but lets not pretend ISP's are trying to be great and just technically hampered (and yes, I'm sure there are exceptions to this rule).
The post was a pretty good explanation about a new distro ISPs can use to help with fair queuing, but this statement is laughably naive.
A distro existing is only a baby first step to an ISP adopting this. They need to train on how to monitor these, scale them, take them out for maintenance, and operate them in a highly available fashion.
It's a huge opex barrier and capex is not why ISPs didn’t bother to solve it in the first place.
We're pretty sure most of those ISPs see reduced opex from support calls.
Capex until the appearance of fq_codel (Preseem, Bequant) or cake (LibreQos, Paraqum) middleboxes was essentially infinite. Now it's pennies per subscriber and many just a get a suitable box off of ebay.
I agree btw, that how to monitor and scale is a learned thing. For example many naive operators look at "drops" as reported by CAKE as a bad thing, when it is actually needed for good congestion control.
Slapped together as a PoC is different than something production ready. Unless those ISPs are so small they don’t care about uptime, a single Ubuntu box in the only hot path of the network is no bueno.
> We're pretty sure most of those ISPs see reduced opex from support calls.
I highly doubt this. As someone who worked in an ISP, the things that people call their ISP for are really unrelated to the ISP (poor WiFi placement, computer loaded with malware, can’t find their WiFi password, can’t get into their gmail/bank/whatever). When Zoom sucks they don’t even think to blame their ISP, they just think zoom sucks.
There is a tiny fraction of power users who might suspect congestion, but they aren’t the type to go into ISP support for help.
> Capex until the appearance of fq_codel (Preseem, Bequant) or cake (LibreQos, Paraqum) middleboxes was essentially infinite. Now it's pennies per subscriber and many just a get a suitable box off of ebay.
These tools have been around for a while now. My point is that the ISPs that haven’t done something about this yet aren’t holding out for a cheaper capex option. They are in the mode of not wanting to change anything at all.
So this attitude that you only need to tell them “there is an open source thing you can run on an old server that will help with something that isn’t costing you money anyway” is out of touch with how most ISPs are run.
The ones that care don’t need their customers to tell them. The ones that don’t care aren’t going to do anything that requires change.
I am sorry you are are so down on the lack of motivations care for customer service that many ISPs have. I am thrilled by how much libreqos's customer base cares.
That said many OpenWrt chips have offloads that bypass that, and while speedier and low power, tend to be over buffered.
Bandwidth is less of a concern for most people as data rates are over 500mb+ . That's enough to comfortably stream 5 concurrent 4k streams (at 20mbps).
Latency and jitter will have a bigger impact on real time applications, particularly video conferencing , VOIP, gaming , and to a lesser extent video streaming when you are scrubbing the feed. You can test yours at https://speed.cloudflare.com/ . If your video is jittery or laggy, and you are having trouble with natural conversation, latency / jitter are likely the issue.
Data Caps are a real concern for most people. At 1gbps, most people are hitting their 1-1.5tb data cap within an hour or so.
Assuming you are around 500mbps or more, latency & data caps are a bigger concern.
Assuming you're talking about consumers: How? All that data needs to go somewhere!
Even multiple 4K streams only take a fraction of one gigabit/s, and while downloads can often saturate a connection, the total transmitted amount of data is capped by storage capacities.
That's not to say that data caps are a good thing, but conversely it also doesn't mean that gigabit connections with terabit-sized data caps are useless.
Gamers tend to have an intuitive understanding of latency, they just use the words "lag" and "ping" instead.
However, people can still have latency issues from their ISP even if their connection isn't fully saturated at home. Bufferbloat is just one situation in which higher latency is created.
Yes, my Zoom call was terrible BECAUSE I was also downloading Diablo saturating my connection. But my Zoom call could also be terrible without anything else being downloaded if my ISP is bad or any number of other things.
As someone who worked in a large ISP, if a customer says their bandwidth is terrible but they are getting their line saturated most ISPs will test for latency issues.
Bufferbloat is one of many many reasons why someone's network might be causing them high latency.
However (assuming no prioritisation), if your bandwidth is at least double your video conference bandwidth requirements then a download shouldn’t significantly affect the video conference since TCP tends to be fair between streams.
Even when I was on a 10Mb/s line I found gaming and voice was generally fine even with a download.
However, if you’re using peer to peer (like BitTorrent), then that is utilizing dozens or hundreds of individual TCP streams and then your video conference bandwidth getting equal amount per all other streams is too slow.
Bufferbloat exacerbates high utilisation symptoms because it confounds the TCP algorithms which struggle to find the correct equalibrium due to “erratic” feedback on if you’re transmitting too much.
It’s like queuing in person at a government office and not being able to see past a door or corner how bad the queue really is, if you could see it’s bad you might come back later, but because you can’t you stand a while on the queue only to realize quite a bit later you’ll have to wait much longer than you initially expected, but if you’d known upfront it would be bad you might have opted to come back later when it’s more quiet. Most people feel that since they’ve sunk the time already they may as well wait as long as it takes, further keeping the queue long.
Higher throughput would help, but just knowing ahead that now’s a bad time would help a lot too.
I do wish most consumer ISPs supported deprioritising packets of my choice, which would allow you to download things heavily at low priority and your video call would be fine.
Also, this doesn’t take into account that the congestion/queueing issue might be at an upstream. I could have 100g from the customers local co to my core routers, but if the route is going over a 20g link to a local IX that’s saturated it probably won’t help to have fq/codel at the edge to the customer.
Also take a look at Measurement Swiss Army-Knife (MSAK) https://netbeez.net/blog/msak/
Yeah yeah I know thats only 40m ish for sound.
Ya. Canada is like that. Lack of choice in ISPs, high costs and horrible uptime performance.
The terrible performance is spotty: that was a particularly glaring example I detected when everything was failing (:-))
So... every ISP that exists then? Networking is one of those fields where the results are just varying shades of terrible no matter how hard you try.
So I'm looking for some opinions, what's your experience? Casual googling seems to suggest that the best solution to implement traffic management would either be a dedicated machine with something like OpenWRT or an all in one solution (ie. a Firewalla gold + some AP to provide wifi).
Personally I've moved to OpnSense: Some run it natively on a refurbished low-power SFF hardware (6000 or 7000 series Intel should be fine, or some Ryzen), so even in countries with high electricity costs that's feasible these days.
More specifically, I run OpnSense in a qemu/libvirt VM (2C of a E5-2690v4) and do WiFi with a popular prosumer APs. Mind that VMs are likely to introduce latency, so if you try this route, make sure to PCIe-passthrough your network devices to the VM - I was prepared to ditch the VM for a dedicated SFF.
Interestingly, some ARM SoCs that are worse on paper do much better because they hardware-accelerate PPPoE (e.g. recent MediaTek Filogic SoCs or some Qualcomm/Marvel SoCs). Most of those routers also use less power and have great WiFi. The downside (coming back to buffer bloat) is that they may not be able to do multi-gbit SQM.
Ugh.
Serialization delay, queuing delay etc. often dominate, but these have little to do with the actual propagation delay, which also can't be neglected.
> when a customer is yelling at me telling me that the latency should be absolute 0
The speed of light isn't infinity, is it?
So it can be relevant, even as an approximation.