When I finally got round to see what he was doing I was disappointed to find he was attempting to kill the 'system idle' process.
We used to rotate the "person of contact" (POC) each shift, and they were responsible for reaching out to customers, and doing initial ticket triage.
One customer kept having a CPU usage alarm go off on their Windows instances not long after midnight. The overnight POC reached out to the customer to let them know that they had investigated and noticed that "system idle processes" were taking up 99% of CPU time and the customer should probably investigate, and then closed the ticket.
I saw the ticket within a minute or two of it reopening as the customer responded with a barely diplomatic message to the tune of "WTF". I picked up that ticket, and within 2 minutes had figured out the high CPU alarm was being caused by the backup service we provided, apologised to the customer and had that ticket closed... but not before someone not in the team saw the ticket and started sharing it around.
I would love to say that particular support staff never lived that incident down, but sadly that particular incident was par for the course with them, and the team spent inordinate amount of time doing damage control with customers.
He justified the capex by saying if cashiers could scan products faster, customers would spend less time in line and sales would go up.
A little digging showed that the CIO wrote the point-of-sale software himself in an ancient version of Visual Basic.
I didn't know VB, but it didn't take long to find the loops that do nothing except count to large numbers to soak up CPU cycles since VB didn't have a sleep() function.
1. take input from a web form
2. do an expensive database lookup
3. do an expensive network request, wait for response
4. do another expensive network request, wait for response
5. and, of course, another expensive network request, wait for response
6. fuck it, another expensive network request, wait for response
7. a couple more database lookups for customer data
8. store the data in a table
9. store the same data in another table. and, of course, another one.
10. now, check to see if the form was submitted with valid data. if not, repeat all steps above to back-out the data from where it was written.
11. finally, check to see if the customer is a valid/paying customer. if not, once again, repeat all the steps above to back-out the data.
I looked at the logs, and something like 90% of the requests were invalid data from the web form or invalid/non-paying customers (this service was provided only to paying customers).
I was so upset from this dude convincing management that my server was the problem that I sent an email to pretty much everyone that said, basically, "This code sucks. Here's the problem: check for invalid data/customers first.", and I included a snippet from the code. The dude replied-to-all immediately, claiming I didn't know anything about Java code, and I should stay in my lane. Well, throughout the day, other emails started to trickle in, saying, "Yeah, the code is the problem here. Please fit it ASAP." The dude was so upset that he just left, he went completely AWOL, he didn't show up to work for a week or so. We were all worried, like he jumped off a bridge or something. It turned into an HR incident. When he finally returned, he complained to HR that I stabbed him in the back, that he couldn't work with me because I was so rude. I didn't really care; I was a kid. Oh yeah, his nickname became AWOL Wang. LOL
The company’s customer-facing website was servlet based. The main servlet was performing horribly, time outs, spinners, errors etc. Our team looked at the code and found that the original team implementing the logic had a problem they couldn’t figure out how to solve, so they decided to apply the big hammer: they synchronized the doService() method… oh dear…
TBF, I don't think a lot of developers at the time (90's) were used to the idea of having to write MT-safe callback code. Nowadays thousands of object allocations per second is nothing to sweat over, so a framework might make a different decision to instantiate callbacks per request by default.
Now, if your language lacks the sleep statement or some other way to yield execution, what should you do instead when your program has no work to do? Actually, I don’t know what the answer is.
(I disagree that you should be sleeping for any OS event, this is what blocking kernel events do automatically)
Silly idle process.
If you've got time for leanin', you've got time for cleanin'
2. If it isn't doing anything and it just lying to you... when there IS a problem, your tools to diagnose the problem are limited because you can't trust what they're telling you
Fixing it would be gratifying and reassuring too.
Everything worked fine on my Linux install ootb
The only solution was to swap over to a SSD.
Yep, that was "System Idle" that was doing it. They had the best people.
Since the Microsoft response to the bug was denying and gaslighting the affected people, we can't tell for sure what caused it. But several people were in a situation where their computer couldn't finish any work, and the task-manager claimed all of the CPU time was spent on that line item.
Although I can understand how "Please provide data to demonstrate that this is an OS scheduling issue since app bottlenecks are much more likely in our experience" could come across as "denying and gaslighting" to less experienced engineers and layfolk
It worked fine on Mac. On Windows though, if you let it use as many threads as there were CPUs, it would nearly 100% of the time fail before making it through our test suite. Something in scheduling the work would deadlock. It was more likely to fail if anything was open besides the app. Basically, a brush stoke that should complete in a tenth of a second would stall. If you waited 30-60 minutes (yes minutes), it would recover and continue.
I vaguely recall we used the Intel compiler implementation of OpenMP, not what comes with MSVC, so the fault wasn't necessarily a Microsoft issue, but could still be a kernel issue.
I left that company later that year, and MS rolled out Windows 8. No idea how long that bug stuck around.
Well. I wouldn't go that far. Any busy dev team is incentivized to make you run the gauntlet:
1. It's not an issue (you have to prove to me it's an issue)
2. It's not my issue (you have to prove to me it's my issue)
3. It's not that important (you have to prove it has significant business value to fix it)
4. It's not that time sensitive (you have to prove it's worth fixing soon)
It was exactly like this at my last few companies. Microsoft is quite a lot like this as well.
If you have an assigned CSAM, they can help run the gauntlet. That's what they are there for.
See also: The 6 stages of developer realization:
https://www.amazon.com/Panvola-Debugging-Computer-Programmer...
Years ago at a job we were seeing issues with a network card on a VM. One of my coworkers spent 2-3 days working his way through support engineer after support engineer until they got into a call with one. He talked the engineer through what was happening. Remote VM, can only access over RDP (well, we could VNC too, but that idea just confuses Microsoft support people for some reason.)
The support engineer decided that the way to resolve the problem was to uninstall and re-install the network card driver. Coworker decided to give the support engineer enough rope to hang themselves with, hoping it'd help him escalate faster: "Won't that break the RDP connection?" "No sir, I've done this many times before, trust me" "Okay then...."
Unsurprisingly enough, when you uninstall the network card driver and cause the instance to have no network cards, RDP stops working. Go figure.
Co-worker let the support engineer know that he'd now lost access, and a guess why. "Oh, yeah. I can see why that might have been a problem"
Co-worker was right though, it did finally let us escalate further up the chain....
What I'm getting at with my post is the dev teams support has to talk to, which they just forward along their responses verbatim.
A lot of MSFT support does suck. There are also some really amazing engineers in the support org.
I did my time in support early in my career (not at MSFT), and so I understand well it's extremely hard to hire good support engineers, and even harder to keep them. The skills they learn on the job makes them attractive to other parts of the org, and they get poached.
There is also an industry-wide tendency for developers to treat support as a bunch of knuckle-dragging idiots, but at the same time they don't arm them with detailed information on how stuff works.
But the "support" that the end user sees is that combination, not two different teams (even if they know it's two or more different teams). The point is that the end user reached out for help and was told their own experiences weren't true. The fact that Dave had Doug actually tell them that is irrelevant.
If we're going to call it gaslighting, then gaslighting is typical dev team behavior, which of course flows back down to support. It's a problem with Microsoft just like it is a problem for any other company which makes software.
Almost every software company out there will jump into their customers complaints, and try to fix the issue even when the root cause is not on their software.
You might be lucky in that you've worked at companies where you are a big enough customer they bend over backwards for you. For example: If you work for Wal-Mart, you probably get this less often. They are usually the biggest fish in whatever pond they are swimming in.
"Windows 10 Task Manager shows 100% CPU but Performance Monitor Shows less than 2%" - https://answers.microsoft.com/en-us/windows/forum/all/window...
Gaslighting customers was the standard Microsoft's reaction to bugs until at least 2007, when I last oversaw somebody interacting with them.
Doesn't look like there's a lot of discussion on the mailing list, but I don't know if I'm reading the thread view correctly.
Therefore it cannot really be portable, because other timers in other devices will have different memory maps and different commands for reading.
The fault is with the designers of these timers, who have failed to provide a reliable way to read their value.
It in hard to believe that this still happens in this century, because reading correct values despite the fact that the timer is incremented or decremented continuously is an essential goal in the design of any timer that may be read, and how to do it has been well known for more than 3 quarters of century.
The only way to make such a workaround somewhat portable is to parametrize it, e.g. with the number of retries for direct reading or with the delay time when reading the auxiliary register. This may be portable between different revisions of the same buggy timer, but the buggy timers in other unrelated CPU designs will need different workarounds anyway.
Don't leave me hanging! How to do it?
Synchronous counters are more expensive in die area than asynchronous counters, especially at high clock frequencies. Moreover, it may be difficult to also synchronize the reading signal with the timer clock. Therefore the second solution may be preferable, which uses a separate capture register for reading the timer value.
This was implemented in the timer described in TFA, but it was done in a wrong way.
The capture register must either ensure that the capture is already complete by the time when it is possible to read its value after giving a capture command, or it must have some extra bit that indicates when its value is valid.
In this case, one can read the capture register until the valid bit is on, having a complete certainty that the end value is correct.
When adding some arbitrary delay between the capture command and reading the capture register, you can never be certain that the delay value is good.
Even when the chosen delay is 100% effective during testing, it can result in failures on other computers or when the ambient temperature is different.
What about different variants, revisions, and speeds of this CPU?
"working at slow clock" part, might explain why some other implementations had different code path for 32.768 KHz clocks. According to docs there are two available clock sources "Fast clock" and "32768 Hz" which could mean that "slow clock" refers to specific hardware functionality is not just a vague phrase.
As for portability concerns, this is already low level hardware specific register access. If Marvell releases new SOC not only there is no assurance that will require same timing, it might was well have different set of registers which require completely different read and setup procedure not just different timing.
One thing that slightly confuses me - the old implementation had 100 cycles of "cpu_relax()" which is unrelated to specific timer clock, but neither is reading of TMR_CVWR register. Since 3-5 of cycles of that worked better than 100 cycles of cpu_relex, it clearly takes more time unless cpu_relax part got completely optimized out. At least I didn't find any references mentioning that timer clock affects read time of TMR_CVWR.
> I didn't find any references mentioning that timer clock affects read time of TMR_CVWR.
Reading the register might be related to the timer's internal clock, as it would have to wait for the timer's bus to respond. This is essentially implied if Marvell recommend re-reading this register, or if their reference implementation did so. My main complaint is it's all guesswork, because Marvell's docs aren't that good.
One can also say Omegamon (or whatever tool) was misreporting, because it didn't account for the processor time of the various supporting systems that dealt with peripheral operations. After all, they also paid for the disk controllers, disks, tape drives, terminal controllers and so on, so they could want to drive those to close to 100% as well.
I had my time squeezing the last cycle possible from a Cyber 205 waaaay back in the day.
> This code is more complicated than what I expected to see. I was thinking it would just be a simple register read. Instead, it has to write a 1 to the register, and then delay for a while, and then read back the same register. There was also a very noticeable FIXME in the comment for the function, which definitely raised a red flag in my mind.
Regardless, this was a very nice read and I'm glad they got down to the issue and the problem fixed.
But yeah, nice writeup of the kinds of problem you can run into in embedded systems programming.
u32 val;
do {
val = readl(...);
} while (val != readl(...));
return val;
compiles to a nice 6-instr little function on arm/thumb too, with no delays readclock:
LDR R2, =...
1:
LDR R0, [R2]
LDR R1, [R2]
CMP R0, R1
BNE 1b
BX LR
Though he did say a VAREITY of laptops, both Windows and Linux. Can someone be _that_ unlucky?
Plus I'd be surprised if I got the same thing on both linux and windows
Fan speeds should ideally be looking at temperature sensors, CPU idling is working albeit with interrupt waits as pointed out here. The only impact seems to be surprise that the CPU is working harder than it really is when looking at this number.
It's far better to look at the system load (which was 0.0 - already a strong hint this system is working below capacity). It has a formal definition (average waiting cpu task queue depth over 1, 5, 10 minutes) and succinctly captures the concept of "this machine is under load".
Many years ago, a coworker deployed a bad auditd config. CPU usage was below 10%, but system load was 20x the number of cores. We moved all our alerts to system load and used that instead.
Why reading it multiple times will fix the issue?
Is it just because reading takes time, therefore reading multiple time makes the needed time from writing to reading passes? If so, it sounds like a worse solution than just extending waiting delay longer like the author did initially.
If not, then I would like to know the reason.
(Needless to say, a great article!)
When reading directly, the value may be completely wrong, because the timer is incremented continuously and the updating of its bits is not synchronous with the reading signal. Therefore any bit in the value that is read may be wrong, because it has been read exactly during a transition between valid values.
The workaround in this case is to read multiple times and accept as good a value that is approximately the same for multiple reads. The more significant bits of the timer value change much less frequently than the least significant bits, so at most attempts of reading, only a few bits can be wrong. Only seldom the read value can be complete garbage, when comparing it with the other read values will reject it.
The second reading method was to use a separate capture register. After giving a timer capture command, reading an unchanging value from the capture register should have caused no problems. Except that in this buggy timer, it is unpredictable when the capture is actually completed. This requires the insertion of an empirically determined delay time before reading the capture register, hopefully allowing enough time for the capture to be complete.
It's only incrementing at 3.25MHz, right? Shouldn't you be able to get exactly the same value for multiple reads? That seems both simpler and faster than using this very slow capture register, but maybe I'm missing something.
In general, when reading a timer that increments faster, you may want to mask some of the least significant bits, to ensure that you can have the same values on successive readings.
I went with the multiple reads because that's what Marvell's own kernel fork does. My reasoning was that people have been using their fork, not only on the PXA168, but on the newer PXAxxxx series, so it would be best to retain Marvell's approach. I could have just increased the delay loop, but I didn't have any way of knowing if the delay I chose would be correct on newer PXAxxx models as well, like the chip used in the OLPC. Really wish they had more/better documentation!
However, a more likely explanation is the use of "volatile" (which only appears in the working version of the code). Without it, the compiler might even have completely removed the loop?
No, because the loop calls cpu_relax(), which is a compiler barrier. It cannot be optimized away.
And yes, reading via the memory bus is much, much slower than a barrier. It's absolutely likely that reading 4 times from main memory on such an old embedded system takes several hundred cycles.
The silliest part of this mess is that 26Mhz clock for APB1 bus is derived from the same source as 13Mhz, 6.5Mhz 3.25Mhz, 1Mhz clocks usable by fast timers.
Looking at the assembly code for both versions of this delay loop might clear it up.
There's really just very few reasons to ever use 'volatile'. In fact, the Linux kernel even has its own documentation why you should usually not use it:
https://www.kernel.org/doc/html/latest/process/volatile-cons...
Or simpler: don't assume anything what you think you might know about this object, just do as you're told.
And yes, that for instance prohibits putting a value from a memory address into a register for further use, which would be a simple case of data optimization. Instead, a fresh retrieval from memory must be done on each access.
However, if your system has caching or an MMU is outside of the spec. The compiler does not care. If you tell the compiler to give you the byte at address 0x1000, it will do so. 'volatile' just forbids the compiler to deduce the value from already available knowledge. If a hardware cache or MMU messes with that, that's your problem, not the compiler's.
> This request requires up to three timer clock cycles. If the selected timer is working at slow clock, the request could take longer.
Let's ignore the weirdly ambiguous second sentence and say for pedagogical purposes it takes up to three timer clock cycles full stop. Timer clock cycles aren't CPU clock cycles, so we can't just do `nop; nop; nop;`. How do we wait three timer clock cycles? Well a timer register read is handled by the timer peripheral which runs at the timer clock, so reading (or writing) a timer register will take until at least the end of the next timer clock.
This is a very common pattern when dealing with memory mapped peripheral registers.
---
I'm making some reasonable assumptions about how the clock peripheral works. I haven't actually dug into the Marvell documentation.
Yes.
> If so, it sounds like a worse solution than just extending waiting delay longer like the author did initially.
Yeah, it's a judgement call. Previously, the code called cpu_relax() for waiting, which is also dependent on how this is defined (can be simply NOP or barrier(), for instance). The reading of the timer register maybe has the advantage that it is dependent on the actual memory bus speed, but I wouldn't know for sure. Hardware at that level is just messy, and especially niche platforms have their fair share of bugs where you need to do ugly workarounds like these.
What I'm rather wondering is why they didn't try the other solution that was mentioned by the manufacturer: reading the timer directly two times and compare it, until you get a stable output.
So I uninstalled it.
Not having any programs that are not good citizens.
> I opted to use 4 as a middle ground
reminded me of xkcd: Standards
"Task Manager doesn't report memory usage correctly" is another B.S. excuse heard on Windows. It's actually true, but the other way around -- Task Manager underreports the memory usage of most programs.
These days the biggest hog of memory is the browser. Not everyone does this, but a lot of people, myself included, have tens of tabs open at a time (with tab groups and all of that)... all day. The browser is the primary reason I recommend a minimum of 16gb ram to F&F when they ask "the it guy" what computer to buy.
When my Chrome is happily munching on many gigabytes of ram I don't think a few hundred megs taken by your average Electron app is gonna move the needle.
The situation is a bit different on mobile, but Electron is not a mobile framework so that's not relevant.
PS: Can I rant a bit how useless the new(ish) Chrome memory saver thing is? What is the point having tabs open if you're gonna remove them from memory and just reload on activation? In the age of fast consumer ssds I'd expect you to intelligently hibernate the tabs on disk, otherwise what you have are silly bookmarks.
> These days the biggest hog of memory is the browser.
That’s the problem: Electron is another browser instance.
> I don't think a few hundred megs taken by your average Electron app is gonna move the needle.
Low-end machines even in 2025 still come with single-digit GB RAM sizes. A few hundred MB is a substantial portion of an 8GB RAM bank.
Especially when it’s just waste.
My computer: starts 5 seconds slower
1mln of computers in the world: start cumulatively 5mln seconds slower
Meanwhile a Microsoft programmer whose postgres via ssh starts 500ms slower: "I think this is a rootkit installed in ssh"
Most GUI toolkits can do layout / graphics / fonts in a much simpler (and sane) way. "Reactive" layout is not a new concept.
HTML/CSS/JS is not an efficient or clean way to do layout in an application. It only exists to shoehorn UI layout into a rich text DOCUMENT format.
Can you imagine if Microsoft or Apple had insisted that GUI application layout be handled the way we do it today back in the 80s and 90s? Straight up C was easier to grok that this garbage we have today. The industry as a whole should be ashamed. It's not easier, it doesn't make things look better, and it wastes billions in developer time and user time, not to mention slowly making the oceans boil.
Every time I have to use a web-based application (which is most of the time nowadays), it infuriates me. The latency is atrocious. The UIs are slow. There's mysterious errors at least once or twice daily. WTF are we doing? When a Windows 95 application ran faster and was more responsive and more reliable than something written 30 years later, we have a serious problem.
Here's some advice: stop throwing your web code into Electron, and start using a cross-platform GUI toolkit. Use local files and/or sqlite databases for storage, and then sync to the cloud in the background. Voila, non-shit applications that stop wasting everybody's effing time.
If your only tool is a hammer, something, something, nails...
My literal several hundreds of tabs are silly bookmarks in practice.
Modern computers are designed to idle at 0% then temporarily boost up when you have work to do. Then once the task is done, they can drop back to idle and cool down again.
> Specify a tolerance for the accuracy of when your timers fire. The system will use this flexibility to shift the execution of timers by small amounts of time—within their tolerances—so that multiple timers can be executed at the same time. Using this approach dramatically increases the amount of time that the processor spends idling…[1]
[0] https://arstechnica.com/gadgets/2013/06/how-os-x-mavericks-w...
[1] https://developer.apple.com/library/archive/documentation/Pe...
The p cores can be activated and deactivated very quickly, on the order of microseconds IIRC, which means the processor always "feels" fast while still conserving battery life.
One of the older ways (in x86 side) to do this was to invoke the HLT instruction https://en.wikipedia.org/wiki/HLT_(x86_instruction) : you stop the processor, and then the processor wakes up when an interrupt wakes it up. An interrupt might come from the sound card, network card, keyboard, GPU, timer (e.g. 100 times a second to schedule an another process, if some process exists that is waiting for CPU), and during the time you wait for the interrupt to happen you just do nothing, thus saving energy.
I suspect things are more complicated in the world of multiple CPUs.
A modern CPU also has multiple cores not all of which may be needed, and will be supported by hardware that can do lots of tasks.
For example, sending out an audio signal isn’t typically done by the main CPU. It tells some hardware to send a buffer of data at some frequency, then prepares the next buffer, and can then sleep or do other stuff until it has to send the new buffer.
However for a CPU with multiple cores, each running at 2+ GHz, there is enough room for idling while seeming active.
I guess it was some _theoretical_ task scheduling stuff.... When you are doing task scheduling, yes, maybe, depends on what you optimize for.
.... but this bug have nothing to do with that. This bug is about some accounting error.