Why is my CPU usage always 100%?(www.downtowndougbrown.com)

537 pointsby pncnmnp6 months ago24 comments

WediBlino6 months ago
An old manager of mine once spent the day trying to kill a process that was running at 99% on Windows box.
When I finally got round to see what he was doing I was disappointed to find he was attempting to kill the 'system idle' process.
- Twirrim6 months ago
  Years ago I worked for a company that provided managed hosting services. That included some level of alarm watching for customers.
  We used to rotate the "person of contact" (POC) each shift, and they were responsible for reaching out to customers, and doing initial ticket triage.
  One customer kept having a CPU usage alarm go off on their Windows instances not long after midnight. The overnight POC reached out to the customer to let them know that they had investigated and noticed that "system idle processes" were taking up 99% of CPU time and the customer should probably investigate, and then closed the ticket.
  I saw the ticket within a minute or two of it reopening as the customer responded with a barely diplomatic message to the tune of "WTF". I picked up that ticket, and within 2 minutes had figured out the high CPU alarm was being caused by the backup service we provided, apologised to the customer and had that ticket closed... but not before someone not in the team saw the ticket and started sharing it around.
  I would love to say that particular support staff never lived that incident down, but sadly that particular incident was par for the course with them, and the team spent inordinate amount of time doing damage control with customers.
  - panarky6 months ago
    In the 90s I worked for a retail chain where the CIO proposed to spend millions to upgrade the point-of-sale hardware. The old hardware was only a year old, but the CPU was pegged at 100% on every device and scanning barcodes was very sluggish.
    He justified the capex by saying if cashiers could scan products faster, customers would spend less time in line and sales would go up.
    A little digging showed that the CIO wrote the point-of-sale software himself in an ancient version of Visual Basic.
    I didn't know VB, but it didn't take long to find the loops that do nothing except count to large numbers to soak up CPU cycles since VB didn't have a sleep() function.
    jimt12346 months ago
    That's hilarious. I had a similar situation, also back in the 90s, when a developer shipped some code that kept pegging the CPU on a production server. He insisted it was the server, and the company should spend $$$ on a new one to fix the problem. We went back-and-forth for a while: his code was crap versus the server hardware was inadequate, and I was losing the battle, because I was just a lowly sysadmin, while he was a great software engineer. Also, it was Java code, and back then, Java was kinda new, and everyone thought it could do no wrong. I wasn't a developer at all back then, but I decided to take a quick look at his code. It was basically this:
    1. take input from a web form
    2. do an expensive database lookup
    3. do an expensive network request, wait for response
    4. do another expensive network request, wait for response
    5. and, of course, another expensive network request, wait for response
    6. fuck it, another expensive network request, wait for response
    7. a couple more database lookups for customer data
    8. store the data in a table
    9. store the same data in another table. and, of course, another one.
    10. now, check to see if the form was submitted with valid data. if not, repeat all steps above to back-out the data from where it was written.
    11. finally, check to see if the customer is a valid/paying customer. if not, once again, repeat all the steps above to back-out the data.
    I looked at the logs, and something like 90% of the requests were invalid data from the web form or invalid/non-paying customers (this service was provided only to paying customers).
    I was so upset from this dude convincing management that my server was the problem that I sent an email to pretty much everyone that said, basically, "This code sucks. Here's the problem: check for invalid data/customers first.", and I included a snippet from the code. The dude replied-to-all immediately, claiming I didn't know anything about Java code, and I should stay in my lane. Well, throughout the day, other emails started to trickle in, saying, "Yeah, the code is the problem here. Please fit it ASAP." The dude was so upset that he just left, he went completely AWOL, he didn't show up to work for a week or so. We were all worried, like he jumped off a bridge or something. It turned into an HR incident. When he finally returned, he complained to HR that I stabbed him in the back, that he couldn't work with me because I was so rude. I didn't really care; I was a kid. Oh yeah, his nickname became AWOL Wang. LOL
    eludwig6 months ago
    Hehe, being a Java dev since the late 90’s meant seeing a lot of bad code. My favorite was when I was working for a large life insurance company.
    The company’s customer-facing website was servlet based. The main servlet was performing horribly, time outs, spinners, errors etc. Our team looked at the code and found that the original team implementing the logic had a problem they couldn’t figure out how to solve, so they decided to apply the big hammer: they synchronized the doService() method… oh dear…
    foobazgt6 months ago
    For those not familiar with servlets, this means serializing every single request to the server that hits that servlet. And a single servlet can serve many different pages. In fact, in the early days, servlet filters didn't exist, so you would often implement cross-cutting functionality like authentication using a servlet.
    TBF, I don't think a lot of developers at the time (90's) were used to the idea of having to write MT-safe callback code. Nowadays thousands of object allocations per second is nothing to sweat over, so a framework might make a different decision to instantiate callbacks per request by default.
    liontwist6 months ago
    I am a little confused. He was intentionally sabotaging performance?
    aoanevdus6 months ago
    People write code that does sleep statements when waiting for something else to happen. It makes sense in some contexts. Think of it like async/await with an event loop. Except you are using the OS scheduler like your “event loop”. And you sleep instead of awaiting.
    Now, if your language lacks the sleep statement or some other way to yield execution, what should you do instead when your program has no work to do? Actually, I don’t know what the answer is.
    dspillett6 months ago
    Later versions of VB (4 and later IIRC) did have a sleep function, though many didn't bother using it and kept with their DoEvents loops instead (which would allow their own UI updates to process but still kept the CPU pegged as much as their process could). With earlier versions you could actually call the windows sleep API. Whether using the OS sleep() or the built-in function (itself just a wrapper around the OS sleep() function), it was worth calling DoEvents a couple of times first to ensure any progress information you'd updated on your UI had been processed, so the user can see it.
    liontwist6 months ago
    Thanks for explaining.
    (I disagree that you should be sleeping for any OS event, this is what blocking kernel events do automatically)
- m4636 months ago
  That's what managers do.
  Silly idle process.
  If you've got time for leanin', you've got time for cleanin'
- cassepipe6 months ago
  I abandonned Windows 8 for linux because of an bug (?) where my HDD was showing it was 99% busy all the time. I had removed every startup program that could be and analysed thouroughly for any viruses, to no avail. Had no debugging skills at the time and wasn't sure the hardware could stand windows 10. That's how linux got me.
  - ryandrake6 months ago
    Recent Linux distributions are quickly catching up to Windows and macOS. Do a fresh install of your favorite distribution and then use 'ps' to look at what's running. Dozens of processes doing who knows what? They're probably not pegging your CPU at 100%, which is good, but it seems that gone are the days when you could turn on your computer and it was truly idle until you commanded it to actually do something. That's a special use case now, I suppose.
    ndriscoll6 months ago
    IME on Linux the only things that use random CPU while idle are web browsers. Otherwise, there's dbus and NetworkManager and bluez and oomd and stuff, but most processes have a fraction of a second used CPU over months. If they're not using CPU, they'll presumably swap out if needed, so they're using ~nothing.
    6 months ago
    undefined
    craftkiller6 months ago
    This is one the reasons I love FreeBSD. You boot up a fresh install of FreeBSD and there are only a couple processes running and I know what each of them does / why they are there.
    m30476 months ago
    At least under some circumstances Linux shows (schedulable) threads as separate processes. Just be aware of that.
    johnmaguire6 months ago
    this is why I use arch btw
    diggan6 months ago
    Add Gnome3 and you can have that too! Source: me, a arch+gnome user, who recently had to turn off the search indexer as it was stuck processing countless multi-GB binary files...
    johnisgood6 months ago
    Exactly, or Void, or Alpine, but I love pacman.
    rirze6 months ago
    this guy arches
    ciupicri6 months ago
    I recommend using systemd-cgls to get a better idea of what's going on.
  - margana6 months ago
    Why is this such a huge issue if it merely shows it's busy, but the performance of it indicates that it actually isn't? Switching to Linux can be a good choice for a lot of people, the reason just seems a bit odd here. Maybe it was simply the straw that broke the camel's back.
    RHSeeger6 months ago
    1. I expect that a HD that is actually doing things 100% of the time is going to have it's lifespan significantly reduce, and
    2. If it isn't doing anything and it just lying to you... when there IS a problem, your tools to diagnose the problem are limited because you can't trust what they're telling you
    ddingus6 months ago
    Over the years I have used top and friends to profile machines and identify expensive bottlenecks. Once one comes to count on those tools, the idea of one being wrong, and actually really wrong! --is just a bad rub.
    Fixing it would be gratifying and reassuring too.
  - saintfire6 months ago
    I had this happen with an nvme drive. Tried changing just about every setting that affected the slot.
    Everything worked fine on my Linux install ootb
  - BizarroLand6 months ago
    Windows 8/8.1/10 had an issue for a while where when it was run on spinning rust HDD it would peg it out and slow the system to a crawl.
    The only solution was to swap over to a SSD.
  - dr-detroit6 months ago
    [dead]
- nullhole6 months ago
  To be fair, it is a really poorly named "process". The computer equivalent of the "everything's ok" alarm.
  - chowells6 months ago
    Long enough ago (win95 era) it wasn't part of Windows to sleep the CPU when there was no work to be done. It always assigned some task to the CPU. The system idle process was a way to do this that played nicely with all of the other process management systems. I don't remember when they finally added CPU power management. SP3? Win98? Win98SE? Eh, it was somewhere in there.
    drsopp6 months ago
    I remember listening on FM radio to my 100MHz computer running FreeBSD, which sounded like calm rain, and to Windows 95, which sounded like a screaming monster.
    eggsome6 months ago
    There were a number of hacks to deal with this. RAIN was very popular back in the day, but AMNHLTM appears to have better compatibility with modern CPUs.
- Agentus6 months ago
  reminds of when i was a kid and noticed a virus had taken over a registry. from that point forward i attempted to delete every single registry file, not quite understanding. Between that and excessive bad website viewing, I dunno how i ever managed to not brick my operating system, unlike my grandma who seemed to brick her desktop in a timely fashion before each of the many monthly visits to her place.
  - bornfreddy6 months ago
    The things grandmas do to see their grandsons regularly. Smart. :-)
- jsight6 months ago
  I worked at a government site with a government machine at one time. I had an issue, so I took it to the IT desk. They were able to get that sorted, but then said I had another issue. "Your CPU is running at 100% all the time, because some sort of unkillable process is consuming all your cpu".
  Yep, that was "System Idle" that was doing it. They had the best people.
- belter6 months ago
  Did he have a pointy hair?
- mrmuagi6 months ago
  I wonder if you make a process with idle in it you could end up in the reverse track where users ignore it. Is there anything preventing an executable being named System Idle.
- kernal6 months ago
  You're keeping us in suspense. Did he ever manage to kill the System Idle process?
- marcosdumay6 months ago
  Windows used to have that habit of making the processes CPU starved, and yet claiming the CPU was idle all the time.
  Since the Microsoft response to the bug was denying and gaslighting the affected people, we can't tell for sure what caused it. But several people were in a situation where their computer couldn't finish any work, and the task-manager claimed all of the CPU time was spent on that line item.
  - nerdile6 months ago
    As a former Windows OS engineer, based on the short statement here, my assumption would be that your programs are IO-bound, not CPU-bound, and that the next step would be to gather data (using a profiler) to investigate the bottlenecks. This is something any Win32 developer should learn how to do.
    Although I can understand how "Please provide data to demonstrate that this is an OS scheduling issue since app bottlenecks are much more likely in our experience" could come across as "denying and gaslighting" to less experienced engineers and layfolk
    1000100_10001016 months ago
    I'm not the original poster, but I ran into something similar late in Win 7 (Win 8 was in beta at the time). We had some painting software, and we used open-MP to work on each scan-line of a brush in parallel.
    It worked fine on Mac. On Windows though, if you let it use as many threads as there were CPUs, it would nearly 100% of the time fail before making it through our test suite. Something in scheduling the work would deadlock. It was more likely to fail if anything was open besides the app. Basically, a brush stoke that should complete in a tenth of a second would stall. If you waited 30-60 minutes (yes minutes), it would recover and continue.
    I vaguely recall we used the Intel compiler implementation of OpenMP, not what comes with MSVC, so the fault wasn't necessarily a Microsoft issue, but could still be a kernel issue.
    I left that company later that year, and MS rolled out Windows 8. No idea how long that bug stuck around.
    lostmsu5 months ago
    This sounds like you had a deadlock in the painting software code, that would only reproduce under Windows.
  - RajT886 months ago
    > Since the Microsoft response to the bug was denying and gaslighting the affected people
    Well. I wouldn't go that far. Any busy dev team is incentivized to make you run the gauntlet:
    1. It's not an issue (you have to prove to me it's an issue)
    2. It's not my issue (you have to prove to me it's my issue)
    3. It's not that important (you have to prove it has significant business value to fix it)
    4. It's not that time sensitive (you have to prove it's worth fixing soon)
    It was exactly like this at my last few companies. Microsoft is quite a lot like this as well.
    If you have an assigned CSAM, they can help run the gauntlet. That's what they are there for.
    See also: The 6 stages of developer realization:
    https://www.amazon.com/Panvola-Debugging-Computer-Programmer...
    Twirrim6 months ago
    Even when you have an expensive contract with Microsoft and a direct account manager to help you run the gauntlet you still end up having to deal with awful support people.
    Years ago at a job we were seeing issues with a network card on a VM. One of my coworkers spent 2-3 days working his way through support engineer after support engineer until they got into a call with one. He talked the engineer through what was happening. Remote VM, can only access over RDP (well, we could VNC too, but that idea just confuses Microsoft support people for some reason.)
    The support engineer decided that the way to resolve the problem was to uninstall and re-install the network card driver. Coworker decided to give the support engineer enough rope to hang themselves with, hoping it'd help him escalate faster: "Won't that break the RDP connection?" "No sir, I've done this many times before, trust me" "Okay then...."
    Unsurprisingly enough, when you uninstall the network card driver and cause the instance to have no network cards, RDP stops working. Go figure.
    Co-worker let the support engineer know that he'd now lost access, and a guess why. "Oh, yeah. I can see why that might have been a problem"
    Co-worker was right though, it did finally let us escalate further up the chain....
    brokenmachine6 months ago
    But was it fixed after the driver reinstall?
    ziddoap6 months ago
    >If you have an assigned CSAM
    That's an unfortunate acronym. I assume you mean Customer Service Account Manager.
    RajT886 months ago
    Customer Success Account Manager. And I would agree - it is very unfortunate.
    Definitely in my top 5 questionable acronym choices from MSFT.
    robocat6 months ago
    That 1 to 4 gauntlet sounds orfully close to: https://youtube.com/watch?v=nb2xFvmKWRY
    thatfunkymunki6 months ago
    Your reticence to accept the term gaslighting clearly indicates you've never had to interact with MSFT support.
    RajT886 months ago
    On the contrary, I have spent thousands of hours interacting with MSFT support.
    What I'm getting at with my post is the dev teams support has to talk to, which they just forward along their responses verbatim.
    A lot of MSFT support does suck. There are also some really amazing engineers in the support org.
    I did my time in support early in my career (not at MSFT), and so I understand well it's extremely hard to hire good support engineers, and even harder to keep them. The skills they learn on the job makes them attractive to other parts of the org, and they get poached.
    There is also an industry-wide tendency for developers to treat support as a bunch of knuckle-dragging idiots, but at the same time they don't arm them with detailed information on how stuff works.
    RHSeeger6 months ago
    > What I'm getting at with my post is the dev teams support has to talk to, which they just forward along their responses verbatim.
    But the "support" that the end user sees is that combination, not two different teams (even if they know it's two or more different teams). The point is that the end user reached out for help and was told their own experiences weren't true. The fact that Dave had Doug actually tell them that is irrelevant.
    RajT886 months ago
    I guess I see your point.
    If we're going to call it gaslighting, then gaslighting is typical dev team behavior, which of course flows back down to support. It's a problem with Microsoft just like it is a problem for any other company which makes software.
    marcosdumay6 months ago
    I've never seen the same behavior from any other software supplier.
    Almost every software company out there will jump into their customers complaints, and try to fix the issue even when the root cause is not on their software.
    RajT886 months ago
    I can't say I've seen it with every vendor. Or even internal dev team I've been an internal customer of - but I've seen it around a lot.
    You might be lucky in that you've worked at companies where you are a big enough customer they bend over backwards for you. For example: If you work for Wal-Mart, you probably get this less often. They are usually the biggest fish in whatever pond they are swimming in.
  - gruez6 months ago
    I've never heard of this. How do you know it's windows "gaslighting" users, and not something dumb like thermal throttling or page faults?
    belter6 months ago
    Well this is one possible scenario. Power management....
    "Windows 10 Task Manager shows 100% CPU but Performance Monitor Shows less than 2%" - https://answers.microsoft.com/en-us/windows/forum/all/window...
    marcosdumay6 months ago
    It's gaslighting because it consists on people from Microsoft explicitly saying that it is impossible, it's not how Windows behave, and the user's system is idle instead of overloaded.
    Gaslighting customers was the standard Microsoft's reaction to bugs until at least 2007, when I last oversaw somebody interacting with them.
- fifilura6 months ago
  To be fair, there are worse mistakes. It does say 99% CPU.
- TacticalCoder6 months ago
  [dead]
veltas6 months ago
It doesn't feel like reading 4 times is necessarily a portable solution, if there will be more versions at different speeds and different I/O architectures; or how this will work under more load, and whether the original change was done to fix some other performance problem OP is not aware of, but not sure what else can be done. Unfortunately many vendors like Marvell can seriously under-document crucial features like this. If anything it would be good to put some of this info in the comment itself, not very elegant but how else practically are we meant to keep track of this, is the mailing list part of the documentation?
Doesn't look like there's a lot of discussion on the mailing list, but I don't know if I'm reading the thread view correctly.
- adrian_b6 months ago
  This is a workaround for a hardware bug of a certain CPU.
  Therefore it cannot really be portable, because other timers in other devices will have different memory maps and different commands for reading.
  The fault is with the designers of these timers, who have failed to provide a reliable way to read their value.
  It in hard to believe that this still happens in this century, because reading correct values despite the fact that the timer is incremented or decremented continuously is an essential goal in the design of any timer that may be read, and how to do it has been well known for more than 3 quarters of century.
  The only way to make such a workaround somewhat portable is to parametrize it, e.g. with the number of retries for direct reading or with the delay time when reading the auxiliary register. This may be portable between different revisions of the same buggy timer, but the buggy timers in other unrelated CPU designs will need different workarounds anyway.
  - stkdump6 months ago
    > how to do it has been well known for more than 3 quarters of century
    Don't leave me hanging! How to do it?
    adrian_b6 months ago
    Direct reading without the risk of reading incorrect values is possible only when the timer is implemented using a synchronous counter instead of an asynchronous counter and the synchronous counter must be fast enough to ensure a stable correct value by the time when it is read, and the reading signal must be synchronized with the timer clock signal.
    Synchronous counters are more expensive in die area than asynchronous counters, especially at high clock frequencies. Moreover, it may be difficult to also synchronize the reading signal with the timer clock. Therefore the second solution may be preferable, which uses a separate capture register for reading the timer value.
    This was implemented in the timer described in TFA, but it was done in a wrong way.
    The capture register must either ensure that the capture is already complete by the time when it is possible to read its value after giving a capture command, or it must have some extra bit that indicates when its value is valid.
    In this case, one can read the capture register until the valid bit is on, having a complete certainty that the end value is correct.
    When adding some arbitrary delay between the capture command and reading the capture register, you can never be certain that the delay value is good.
    Even when the chosen delay is 100% effective during testing, it can result in failures on other computers or when the ambient temperature is different.
  - veltas6 months ago
    > This is a workaround for a hardware bug of a certain CPU.
    What about different variants, revisions, and speeds of this CPU?
- Karliss6 months ago
  The related part of doc has one more note "This request requires up to three timer clock cycles. If the selected timer is working at slow clock, the request could take longer." From the way doc is formatted it's not fully clear what "this request" refers to. It might explain where 3-5 attempts come from, and that it might not be pulled completely out of thin air. But the part about taking up to but sometimes more clock cycles makes it impossible to have a "proper" solution without guesswork or further clarifications from vendor.
  "working at slow clock" part, might explain why some other implementations had different code path for 32.768 KHz clocks. According to docs there are two available clock sources "Fast clock" and "32768 Hz" which could mean that "slow clock" refers to specific hardware functionality is not just a vague phrase.
  As for portability concerns, this is already low level hardware specific register access. If Marvell releases new SOC not only there is no assurance that will require same timing, it might was well have different set of registers which require completely different read and setup procedure not just different timing.
  One thing that slightly confuses me - the old implementation had 100 cycles of "cpu_relax()" which is unrelated to specific timer clock, but neither is reading of TMR_CVWR register. Since 3-5 of cycles of that worked better than 100 cycles of cpu_relex, it clearly takes more time unless cpu_relax part got completely optimized out. At least I didn't find any references mentioning that timer clock affects read time of TMR_CVWR.
  - veltas6 months ago
    It sounds like this is an old CPU(?), so no need to worry about the future here.
    > I didn't find any references mentioning that timer clock affects read time of TMR_CVWR.
    Reading the register might be related to the timer's internal clock, as it would have to wait for the timer's bus to respond. This is essentially implied if Marvell recommend re-reading this register, or if their reference implementation did so. My main complaint is it's all guesswork, because Marvell's docs aren't that good.
    MBCook6 months ago
    The Chumby hardware I’m thinking of is from 2010 or so. So if that’s it, it would certainly be old. And it would explain a possible relation with the OLPC having a similar chip.
    https://en.wikipedia.org/wiki/Chumby
- _nalply6 months ago
  I also wondered about this, but there's a crucial differnce, no idea if it matters: in that loop it reads the register, so the register is read at least 4 times.
rbanffy6 months ago
In the late 1990's I worked in a company that had a couple mainframes in their fleet and once I looked into a resource usage screen (Omegamon, perhaps? Is it that old?) and noticed the CPU was pegged at 100%. I asked the operator if that was normal. His answer was "Of course. We paid for that CPU, might as well use it". Funny though that mainframes are designed for that - most, if not all, non-application work is offloaded to other processors in the system so that the CPU can run applications as fast as it can.
- defrost6 months ago
  Having a number of running processes take the CPU usage to 100% is one thing, have an under utilised CPU with almost no processes running report that usage is at 100% is another thing, the subject of the article here.
  - rbanffy6 months ago
    I didn't intend this as an example of the issue the article mentions (a misreporting of usage because of a hardware design issue). It was just a fun example of how different hardware behaves differently.
    One can also say Omegamon (or whatever tool) was misreporting, because it didn't account for the processor time of the various supporting systems that dealt with peripheral operations. After all, they also paid for the disk controllers, disks, tape drives, terminal controllers and so on, so they could want to drive those to close to 100% as well.
    defrost6 months ago
    Sure, no drama - I came across as a little dry and clipped as I was clarifying on the fly as it were.
    I had my time squeezing the last cycle possible from a Cyber 205 waaaay back in the day.
  - datadrivenangel6 months ago
    Some mainframes have the ability to lock clock speed and always run at exactly 100%, so you can often have hard guarantees about program latency and performance.
sneela6 months ago
This is a wonderful write-up and a very enjoyable read. Although my knowledge about systems programming on ARM is limited, I know that it isn't easy to read hardware-based time counters; at the very least, it's not as simple as the x86 rdtsc [1]. This is probably why the author writes:
> This code is more complicated than what I expected to see. I was thinking it would just be a simple register read. Instead, it has to write a 1 to the register, and then delay for a while, and then read back the same register. There was also a very noticeable FIXME in the comment for the function, which definitely raised a red flag in my mind.
Regardless, this was a very nice read and I'm glad they got down to the issue and the problem fixed.
[1]: https://www.felixcloutier.com/x86/rdtsc.
- pm2156 months ago
  Bear in mind that the blog post is about a 32 bit SoC that's over a decade old, and the timer it is reading is specific to that CPU implementation. In the intervening time both timers and performance counters have been architecturally standardised, so on a modern CPU there is a register roughly equivalent to the one x86 rdtsc uses and which you can just read; and kernels can use the generic timer code for timers and don't need to have board specific functions to do it.
  But yeah, nice writeup of the kinds of problem you can run into in embedded systems programming.
dmitrygr6 months ago
Curiously, instead of "set capture reg, wait for clock edge, read", the "read reg twice, until same result is obtained" approach is ignored. This is strange as it is usually much faster - reading a 3.25MHz counter at 200MHz+ twice is very likely to see the same value twice. For a 32KHz counter, it is basically guaranteed.
```
   u32 val;
   do {
       val = readl(...);
   } while (val != readl(...));

   return val;
```
compiles to a nice 6-instr little function on arm/thumb too, with no delays
```
   readclock:
     LDR  R2, =...
   1:
     LDR  R0, [R2]
     LDR  R1, [R2]
     CMP  R0, R1
     BNE  1b
     BX   LR
```
askvictor6 months ago
My recurring issue (on a variety of laptops, both Linux and Windows): the fans will start going full-blast, everything slows down, then as soon as I open a task manager CPU usage drops from 100% to something negligible.
- crazydoggers6 months ago
  You my friend, most likely have mining malware on your systems. They’ll shutdown when they detect task manager is opened so you don’t notice them.
  - michaelcampbell6 months ago
    That was my thought too; one way to get another data point is to just run the task manager as soon as you boot and let it stay there. If the fan behavior NEVER comes back while doing that, another point in the "mining malware" favor (though of course, not definitive).
    Though he did say a VAREITY of laptops, both Windows and Linux. Can someone be _that_ unlucky?
    askvictor6 months ago
    If it was malware I'd expect it to happen more often; it's usually when I do have a lot of thing going on (browser tabs, VSCode sessions), so I spark up the task manager to work out the problem process, but CPU usage drops before I can investigate.
    Plus I'd be surprised if I got the same thing on both linux and windows
    RicardoLuis06 months ago
    or possibly the malware has spread to multiple of their devices?
    michaelcampbell6 months ago
    Always possible; I haven't seen any legit mining malware in a long time. Mostly because it's so easy to spot with the CPU issue, and that CPU based mining even on someone else's machine isn't worth the trouble any more. But perhaps my circle/context is not that of a lot of people.
steventhedev6 months ago
Aside from the technical beauty of this post, what is the practical impact of this?
Fan speeds should ideally be looking at temperature sensors, CPU idling is working albeit with interrupt waits as pointed out here. The only impact seems to be surprise that the CPU is working harder than it really is when looking at this number.
It's far better to look at the system load (which was 0.0 - already a strong hint this system is working below capacity). It has a formal definition (average waiting cpu task queue depth over 1, 5, 10 minutes) and succinctly captures the concept of "this machine is under load".
Many years ago, a coworker deployed a bad auditd config. CPU usage was below 10%, but system load was 20x the number of cores. We moved all our alerts to system load and used that instead.
thrdbndndn6 months ago
I don't get the fix.
Why reading it multiple times will fix the issue?
Is it just because reading takes time, therefore reading multiple time makes the needed time from writing to reading passes? If so, it sounds like a worse solution than just extending waiting delay longer like the author did initially.
If not, then I would like to know the reason.
(Needless to say, a great article!)
- adrian_b6 months ago
  The article says that the buggy timer has 2 different methods for reading.
  When reading directly, the value may be completely wrong, because the timer is incremented continuously and the updating of its bits is not synchronous with the reading signal. Therefore any bit in the value that is read may be wrong, because it has been read exactly during a transition between valid values.
  The workaround in this case is to read multiple times and accept as good a value that is approximately the same for multiple reads. The more significant bits of the timer value change much less frequently than the least significant bits, so at most attempts of reading, only a few bits can be wrong. Only seldom the read value can be complete garbage, when comparing it with the other read values will reject it.
  The second reading method was to use a separate capture register. After giving a timer capture command, reading an unchanging value from the capture register should have caused no problems. Except that in this buggy timer, it is unpredictable when the capture is actually completed. This requires the insertion of an empirically determined delay time before reading the capture register, hopefully allowing enough time for the capture to be complete.
  - Dylan168076 months ago
    > The workaround in this case is to read multiple times and accept as good a value that is approximately the same for multiple reads.
    It's only incrementing at 3.25MHz, right? Shouldn't you be able to get exactly the same value for multiple reads? That seems both simpler and faster than using this very slow capture register, but maybe I'm missing something.
    adrian_b6 months ago
    In this specific case, yes, if none of two successive readings is corrupted and when you did not straddle a transition, they should be the same.
    In general, when reading a timer that increments faster, you may want to mask some of the least significant bits, to ensure that you can have the same values on successive readings.
- dougg36 months ago
  Author here. Thanks! I believe the register reads are just extending the delay, although the new approach does have a side effect of reading from the hardware multiple times. I don't think the multiple reads really matter though.
  I went with the multiple reads because that's what Marvell's own kernel fork does. My reasoning was that people have been using their fork, not only on the PXA168, but on the newer PXAxxxx series, so it would be best to retain Marvell's approach. I could have just increased the delay loop, but I didn't have any way of knowing if the delay I chose would be correct on newer PXAxxx models as well, like the chip used in the OLPC. Really wish they had more/better documentation!
- rep_lodsb6 months ago
  It's possible that actually reading the register takes (significantly) more time than an empty countdown loop. A somewhat extreme example of that would be on x86, where accessing legacy I/O ports for e.g. the timer goes through a much lower-clocked emulated ISA bus.
  However, a more likely explanation is the use of "volatile" (which only appears in the working version of the code). Without it, the compiler might even have completely removed the loop?
  - deng6 months ago
    > However, a more likely explanation is the use of "volatile" (which only appears in the working version of the code). Without it, the compiler might even have completely removed the loop?
    No, because the loop calls cpu_relax(), which is a compiler barrier. It cannot be optimized away.
    And yes, reading via the memory bus is much, much slower than a barrier. It's absolutely likely that reading 4 times from main memory on such an old embedded system takes several hundred cycles.
    Karliss6 months ago
    From what I understand the timer registers should be on APB(1) bus which operates at fixed 26MHz clock. That should be much closer to the scale of fast timer clocks compared to cpu_relax() and main CPU clock running somewhere in the range of 0.5-1GHz and potentially doing some dynamic frequency scaling for power saving purpose.
    The silliest part of this mess is that 26Mhz clock for APB1 bus is derived from the same source as 13Mhz, 6.5Mhz 3.25Mhz, 1Mhz clocks usable by fast timers.
    rep_lodsb6 months ago
    You're right, didn't account for that. Though even when declared volatile, the counter variable would be on the stack, and thus already in the CPU cache (at least 32K according to the datasheet)?
    Looking at the assembly code for both versions of this delay loop might clear it up.
    deng6 months ago
    The only thing volatile does is to assure that the value is read from memory each time (which implicitly also forbids optimizations). Whether that memory is in a CPU cache is purely a hardware issue and outside the C specification. If you read something like a hardware register, you yourself need to take care in some way that a hardware cache will not give you old values (by mapping it into a non-cached memory area, or by forcing a cache update). If you for-loop over something that acts as a compiler barrier, all that 'volatile' on the counter variable will do is potentially make the for-loop slower.
    There's really just very few reasons to ever use 'volatile'. In fact, the Linux kernel even has its own documentation why you should usually not use it:
    https://www.kernel.org/doc/html/latest/process/volatile-cons...
    sim7c006 months ago
    doesnt volatile also ensure the address is not changed for the read by compiler (as it might optimise data layout otherwise)? (so you can be sure when using mmio etc. it wont read from wrong place?)
    deng6 months ago
    "volatile", according to the standard, simply is: "An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine."
    Or simpler: don't assume anything what you think you might know about this object, just do as you're told.
    And yes, that for instance prohibits putting a value from a memory address into a register for further use, which would be a simple case of data optimization. Instead, a fresh retrieval from memory must be done on each access.
    However, if your system has caching or an MMU is outside of the spec. The compiler does not care. If you tell the compiler to give you the byte at address 0x1000, it will do so. 'volatile' just forbids the compiler to deduce the value from already available knowledge. If a hardware cache or MMU messes with that, that's your problem, not the compiler's.
- mastax6 months ago
  Karliss above found docs which mention:
  > This request requires up to three timer clock cycles. If the selected timer is working at slow clock, the request could take longer.
  Let's ignore the weirdly ambiguous second sentence and say for pedagogical purposes it takes up to three timer clock cycles full stop. Timer clock cycles aren't CPU clock cycles, so we can't just do `nop; nop; nop;`. How do we wait three timer clock cycles? Well a timer register read is handled by the timer peripheral which runs at the timer clock, so reading (or writing) a timer register will take until at least the end of the next timer clock.
  This is a very common pattern when dealing with memory mapped peripheral registers.
  ---
  I'm making some reasonable assumptions about how the clock peripheral works. I haven't actually dug into the Marvell documentation.
- deng6 months ago
  > Is it just because reading takes time, therefore reading multiple time makes the needed time from writing to reading passes?
  Yes.
  > If so, it sounds like a worse solution than just extending waiting delay longer like the author did initially.
  Yeah, it's a judgement call. Previously, the code called cpu_relax() for waiting, which is also dependent on how this is defined (can be simply NOP or barrier(), for instance). The reading of the timer register maybe has the advantage that it is dependent on the actual memory bus speed, but I wouldn't know for sure. Hardware at that level is just messy, and especially niche platforms have their fair share of bugs where you need to do ugly workarounds like these.
  What I'm rather wondering is why they didn't try the other solution that was mentioned by the manufacturer: reading the timer directly two times and compare it, until you get a stable output.
evanjrowley6 months ago
This headline reminded me of Mumptris, an implementation of Tetris in the old mainframe-oriented language MUMPS, which by design, uses 100% CPU to reduce latency: https://news.ycombinator.com/item?id=4085593
a1o6 months ago
This was very well written, I somehow read every single line and didn't skip to the end. Great work too!
RajT886 months ago
TIL there are still Chumby's alive in the wild. My Insignia Chumby 8 didn't last.
rbohac6 months ago
This was a well written article! It was nice to read the process of troubleshooting with the rabbit holes included. Glad you stuck it out!
WalterBright6 months ago
I noticed that one time. Looked at the process list, and what was running was a program that enabled streaming. But since I wasn't streaming anything, I wondered what it was doing reading the disk drive.
So I uninstalled it.
Not having any programs that are not good citizens.
ndesaulniers6 months ago
Great read! Eerily similar to some bugs I've had, but the root cause has been a compiler bug. Debugging a kernel that doesn't boot is... interesting. QEMU+GDB to the rescue.
NotYourLawyer6 months ago
That’s an awful lot of effort to deal with an issue that was basically just cosmetic. I suspect at some point the author was just nerd sniped though.
- dougg36 months ago
  To be fair, other non-cosmetic stuff uses the CPU percentage. This same bug was preventing fast user suspend on the OLPC until they worked around it. It was also a fun challenge.
g-b-r6 months ago
I expected it to be about holding down the spacebar :/
- lohfu6 months ago
  He must running version 10.17 or newer
- labster6 months ago
  Spacebar heating was great for my workflow, please re-enable
  - smidgeon6 months ago
    For the confused: https://www.xkcd.com/1172/
- g-b-r6 months ago
  Not to argue, but I don't understand why someone downvoted it
amelius6 months ago
To diagnose, why not run "time top" and look at the user and sys outputs?
markhahn6 months ago
very nice investigation.
shame about the unnecessary use of cat :)
- dougg36 months ago
  Ha! Guilty as charged. I have a coworker who's probably very disappointed in me right now!
InsomniacL6 months ago
> Chumby’s kernel did a total of 5 reads of the CVWR register. The other two kernels did a total of 3 reads.
> I opted to use 4 as a middle ground
reminded me of xkcd: Standards
https://xkcd.com/927/
6 months ago
undefined
perayi52936 months ago
[flagged]
- raincole6 months ago
  Just twice so far. Not even from the same account. Why so hostile?
  - rbanffy6 months ago
    Even HN boosts submissions admins deem good enough that didn't get sufficient exposure the last time they were published. And, since this one got quite a bit of engagement, it demonstrates that multiple submissions are OK. Otherwise a good article would have been lost.
- Aissen6 months ago
  I, for one, welcome this repost of a very interesting technical article that I had missed, and arguably did not get the audience it deserved 8 months ago.
- 6 months ago
  undefined
TrickyReturn6 months ago
Probably running Slack...
Suppafly6 months ago
Isn't this one of those problems that switching to linux is supposed to fix?
- DougN76 months ago
  He’s on linux
  - Suppafly6 months ago
    Exactly, that's the joke. If it had been an issue on Windows the default response from folks here would be to switch to Linux instead of trying to get to the root of the issue. Guess I should have included an /s on my comment.
    dzaima6 months ago
    Well it's not like you could get to the root of the issue on windows, with it being closed-source and all.
begueradj6 months ago
Oops, this is not valid.
- homebrewer6 months ago
  This feels like the often-repeated "argument" that Electron applications are fine because "unused memory is wasted memory". What Linus meant by that is that the operating system should strive to use as much of the free RAM as possible for things like file and dentry caches. Not that memory should be wasted on millions of layers of abstraction and too-high resolution images. But it's often misunderstood that way.
  - ack_complete6 months ago
    It's so annoying when that line is used to defend applications with poor memory usage, ignoring the fact that all modern OSes already put unallocated memory to use for caching.
    "Task Manager doesn't report memory usage correctly" is another B.S. excuse heard on Windows. It's actually true, but the other way around -- Task Manager underreports the memory usage of most programs.
  - Culonavirus6 months ago
    Eeeh, the Electron issue is oveblown.
    These days the biggest hog of memory is the browser. Not everyone does this, but a lot of people, myself included, have tens of tabs open at a time (with tab groups and all of that)... all day. The browser is the primary reason I recommend a minimum of 16gb ram to F&F when they ask "the it guy" what computer to buy.
    When my Chrome is happily munching on many gigabytes of ram I don't think a few hundred megs taken by your average Electron app is gonna move the needle.
    The situation is a bit different on mobile, but Electron is not a mobile framework so that's not relevant.
    PS: Can I rant a bit how useless the new(ish) Chrome memory saver thing is? What is the point having tabs open if you're gonna remove them from memory and just reload on activation? In the age of fast consumer ssds I'd expect you to intelligently hibernate the tabs on disk, otherwise what you have are silly bookmarks.
    eadmund6 months ago
    > Eeeh, the Electron issue is oveblown.
    > These days the biggest hog of memory is the browser.
    That’s the problem: Electron is another browser instance.
    > I don't think a few hundred megs taken by your average Electron app is gonna move the needle.
    Low-end machines even in 2025 still come with single-digit GB RAM sizes. A few hundred MB is a substantial portion of an 8GB RAM bank.
    Especially when it’s just waste.
    p0w3n3d6 months ago
    And this company that says: let's push to the users the installer of our brand new app, that will reside in their tray, which we have made in electron. Poof. 400MB taken for a tray notifier that also accidentally adds a browser to the memory
    My computer: starts 5 seconds slower
    1mln of computers in the world: start cumulatively 5mln seconds slower
    Meanwhile a Microsoft programmer whose postgres via ssh starts 500ms slower: "I think this is a rootkit installed in ssh"
    smolder6 months ago
    Your argument against electron being a memory hog is that chrome is a bigger one? You are aware that electron is an instance of chromium, right?
    rbanffy6 months ago
    This is a good point, but it would be interesting if we had a "just enough" rendering engine for UI elements that was a subset of a browser with enough functionality to provide a desktop app environment and that could be driven by the underlying application (or by the GUI, passing events to the underlying app).
    nejsjsjsbsb6 months ago
    Problem there is Electron devs do it for convenience. That means esbuild, npm install react this that. If it ain't a full browser this won't work.
    caspper696 months ago
    Funny thing about all of this is that it's just such oppressive overkill.
    Most GUI toolkits can do layout / graphics / fonts in a much simpler (and sane) way. "Reactive" layout is not a new concept.
    HTML/CSS/JS is not an efficient or clean way to do layout in an application. It only exists to shoehorn UI layout into a rich text DOCUMENT format.
    Can you imagine if Microsoft or Apple had insisted that GUI application layout be handled the way we do it today back in the 80s and 90s? Straight up C was easier to grok that this garbage we have today. The industry as a whole should be ashamed. It's not easier, it doesn't make things look better, and it wastes billions in developer time and user time, not to mention slowly making the oceans boil.
    Every time I have to use a web-based application (which is most of the time nowadays), it infuriates me. The latency is atrocious. The UIs are slow. There's mysterious errors at least once or twice daily. WTF are we doing? When a Windows 95 application ran faster and was more responsive and more reliable than something written 30 years later, we have a serious problem.
    Here's some advice: stop throwing your web code into Electron, and start using a cross-platform GUI toolkit. Use local files and/or sqlite databases for storage, and then sync to the cloud in the background. Voila, non-shit applications that stop wasting everybody's effing time.
    If your only tool is a hammer, something, something, nails...
    Dalewyn6 months ago
    >otherwise what you have are silly bookmarks.
    My literal several hundreds of tabs are silly bookmarks in practice.
- josephg6 months ago
  Only when your computer actually has work to do. Otherwise your CPU is just a really expensive heater.
  Modern computers are designed to idle at 0% then temporarily boost up when you have work to do. Then once the task is done, they can drop back to idle and cool down again.
  - PUSH_AX6 months ago
    Not that I disagree, but when exactly in modern operating systems are there moments where there are zero instructions being executed? Surely there are always processes doing background things?
    reshlo6 months ago
    > Timer Coalescing attempts to enforce some order on all this chaos. While on battery power, Mavericks will routinely scan all upcoming timers that apps have set and then apply a gentle nudge to line up any timers that will fire close to each other in time. This "coalescing" behavior means that the disk and CPU can awaken, perform timer-related tasks for multiple apps at once, and then return to sleep or idle for a longer period of time before the next round of timers fire.[0]
    > Specify a tolerance for the accuracy of when your timers fire. The system will use this flexibility to shift the execution of timers by small amounts of time—within their tolerances—so that multiple timers can be executed at the same time. Using this approach dramatically increases the amount of time that the processor spends idling…[1]
    [0] https://arstechnica.com/gadgets/2013/06/how-os-x-mavericks-w...
    [1] https://developer.apple.com/library/archive/documentation/Pe...
    miki1232116 months ago
    Modern Macs also have two different kinds of cores, slow but energy-efficient e-cores and high-performance p-cores.
    The p cores can be activated and deactivated very quickly, on the order of microseconds IIRC, which means the processor always "feels" fast while still conserving battery life.
    _flux6 months ago
    There are a lot of such moments, but they are just short. When you're playing music, you download a bit of data from the network or the SSD/HDD by first issuing a request and then waiting (i.e. doing nothing) to get the short piece of data back. Then you decode it and upload a short bit of the sound to your sound card and then again you wait for new space to come up, before you send more data.
    One of the older ways (in x86 side) to do this was to invoke the HLT instruction https://en.wikipedia.org/wiki/HLT_(x86_instruction) : you stop the processor, and then the processor wakes up when an interrupt wakes it up. An interrupt might come from the sound card, network card, keyboard, GPU, timer (e.g. 100 times a second to schedule an another process, if some process exists that is waiting for CPU), and during the time you wait for the interrupt to happen you just do nothing, thus saving energy.
    I suspect things are more complicated in the world of multiple CPUs.
    Someone6 months ago
    We’re not talking about what humans call “a moment”. For a (modern) computer, a millisecond is “a moment”, possibly even “a long moment”. It can run millions of instructions in such a time frame.
    A modern CPU also has multiple cores not all of which may be needed, and will be supported by hardware that can do lots of tasks.
    For example, sending out an audio signal isn’t typically done by the main CPU. It tells some hardware to send a buffer of data at some frequency, then prepares the next buffer, and can then sleep or do other stuff until it has to send the new buffer.
    johannes12343216 months ago
    From human perception there will "always" be work on a "normal" system.
    However for a CPU with multiple cores, each running at 2+ GHz, there is enough room for idling while seeming active.
    pintxo6 months ago
    With multi-core cpus, some of them can be fully off, while others handle any background tasks.
    nejsjsjsbsb6 months ago
    My processor gets several whole nanoseconds to rest up, I am not a slave driver.
- TonyTrapp6 months ago
  What you are probably thinking of is "race to idle". A CPU should process everything it can, as quickly it can (using all the power), and then go to an idle state, instead of processing everything slowly (potentially consuming less energy at that time) but take more time.
- zaik6 months ago
  You're probably thinking about memory and caching. There are no advantages to keeping the CPU at 100% when no workload needs to be done.
- M95D6 months ago
  I'm sure a few more software updates will take care of this little problem...
- j16sdiz6 months ago
  > computer architecture courses.
  I guess it was some _theoretical_ task scheduling stuff.... When you are doing task scheduling, yes, maybe, depends on what you optimize for.
  .... but this bug have nothing to do with that. This bug is about some accounting error.