Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it
So that would turn the whole sequence of “add constant to SP” into 2 executable instructions, 1 for constructing immediate and 1 for adding for a total of 8 bytes, and a 4 byte data area for the 17-bit immediate for a total of 12 bytes of binary which is 3 executable instructions worth.
[1] https://developer.arm.com/documentation/dui0801/l/A64-Data-T...
I don't know enough about how people use the go assembler, but I imagine it would be very surprising if `add $imm, rsp, rsp` clobbered an unrelated register when `$imm` is large enough. Especially since what's clobbered is the designated "temporary register", which I imagine is used all the time in handwritten go assembly.
Additionally, they call out interactions with the OS/execution environment. For example, x18 is the "platform register", and it's unspecified what the OS does with it. It's entirely possible that it clobbers it on context switch or during an interrupt or whatever. So don't use that one unless you have a contract with the OS itself.
But locally, i.e. "from instruction to instruction", no such convention exists to my knowledge, and you probably don't want to have registers that pseudo-instructions might trash inadvertently in general, because it means you can't optimally use these registers.
It's possible for pseudo-instructions or generally macros to be documented as, e.g., "this macro uses x3 as a temporary register and trashes it", but in my experience most macros that need additional temporary registers actually ask you to specify them as part of the macro invocation.
E.g. suppose you have a macro "weirdhash" that takes two registers and saves some kind of hash of them in a third register, but that also needs an extra register to perform its work. You would call it with:
weirdhash x9, x10, x11, x0
Where x0 would be the scratch register you don't care about.[1] I’m not familiar with AMD64, but maybe, you could use a thread local (edit: wouldn’t work with M:N threads. You’d need a coroutine-local. That would tie the assembler to golang, and thus would, even on that alone, be a very bad idea) or reserve space in the stack frame for it, too, but I don’t see those as realistic options
Yes, though that weird stuff with dollars in it is not normal AArch64 assembly!
The article could have mentioned the "stack moves once" rule.
Still, I find great that Go got back the 1990's tradition that compiled languages have an assembler as part of their tooling, regardless of the syntax.
That’s because the C ABI supports unwinding with a fairly expressive set of tools for describing stack-pointer state on a per-instruction level. Even the simpler Microsoft ABI essentially uses bytecode for that[1]; and on the more complicated Itanium ABI, you get DWARF CFI instructions, which make the correct way to preserve a(n x86) register in the function prologue look like
push rbx
.cfi_adjust_cfa_offset 8
.cfi_rel_offset rbx, 8
which are impossible to miss when reading compiler-generated assembly because of the sheer amount of annoying noise they create.The Go authors decided to sidestep all of this complexity, which is understandable to a degree, but apparently they did not think through all the ramifications of doing so.
[1] https://learn.microsoft.com/en-us/cpp/build/exception-handli...
[1] https://dwarfstd.org/doc/DWARF5.pdf#page=171
[2] Slightly modified by psABI[3] section 3.7 for x86-64 or the LSB[4] section 11.6 for ARM64, but at this point that’s a drop in the bucket as far as overall complexity is concerned.
[3] https://gitlab.com/x86-psABIs/x86-64-ABI/-/jobs/artifacts/ma...
[4] https://refspecs.linuxfoundation.org/LSB_4.0.0/LSB-Core-gene...
* https://jdebp.uk/FGA/function-perilogues.html#StandardMIPS
I wrote up the x86 equivalent of doing just two read-modify-write operations on the stack pointer over 16 years ago.
There was a userspace thread library I came across a long time ago that used variable length arrays to switch between thread stacks; the scheduler would allocate an array of the right size to bump the stack pointer to the different thread's stack.
See the AT&T vs Intel syntax since you aren't familiar with assembly:
The 2A06 assembler that people who write NES code (and later on SNES/GB/etc) use has some real quirks: $ prefixes a literal hex value but % is binary, but # in front of that is an address, registers are baked into the opcode (ldx -> load into X), and more.
Playstation folks all just used MIPS dialects which are mostly AT&Tish but the PS2 used an Intel style assembler.
However it is probably not as problematic due to the way Go allows for Assembly being used directly.
While the JVM and CLR don't allow for direct access to Assembly code, Go does, thus I assume expecting safepoints everywhere is not an option, as any subroutine call can land on code that was manually written.
(Well technically there is a way to inject assembly without the function call overhead. That's what https://pkg.go.dev/runtime/internal/atomic is doing. But you will need to modify the runtime and compiler toolchain for it.)
Whereas when you go over CGO, you get a marshaling layer similar to how JNI, P/Invoke work, that take care of those issues.
Does the Go team have a natural language bot or is this just comment.contains(“backport”) type stuff?
(found via https://go.dev/wiki/gopherbot)
HEY GUYS WE JUST FOUND A GOLANG COMPILER BUG AND FATAL PANICS!
Everyone is like “Hmm. I need to fix this now.”
So, 99% probability it’s what it is. 1% it’s some secret defensive thing because there was a bad stupid zero day someone would get fired over or that could leave the world in shambles if uncovered, or maybe something else needed to be swept under the rug, or maybe someone wants to distract while they introduce a new vulnerability.
I don’t think this with CVEs, but when someone’s like “install this patch everybody!” the dim red light flickers on.
This issue, and the fix, has perfectly good visibility. Even if you personally can't understand the code, plenty of others can and do.
All of which makes your claims seem like quite unnecessary paranoia — to a lot of folk... and I suspect that is probably why your comment is getting heavily downvoted.
I found out a bug on Turbo Pascal 6, where if you declare a variable with the same name as the function name, then the result was random garbage.
For those that don't know Pascal, the function name has to be assigned for the result value, so if a local variable with the same name is possible, then you cannot set the return value.
Something like this https://godbolt.org/z/s6srhTW66
(* In Turbo Pascal 6 this would compile *)
function Square(num: Integer): Integer;
var
Square: Integer;
begin
Square := num * num; (* Here the local variable gets used instead *)
end;
The reporter actually spent the effort to track it down, turns out it _was_ a Go compiler bug. (https://github.com/golang/go/issues/20427)
In the HFT sphere i haven't talked to a company that hasn't reported (bragged about finding) a super weird gcc/clang bug.
Well, also, at my last job we used a snapshot version of the compiler, bc... Any nanoseconds matters.
The thing is, it's quite unlikely that your competitor hits the exact same bug. The cost of us having to keep upstream patched, tested isn't justified.
Also in HFT world there are some very similar patterns across competing companies, yet, we just saw TernFS coming out from XTX, with not much fear of competitors benefiting from it more than they do.
Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.
I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.
> After this change, stacks larger than 1<<12 will build the offset in a temporary register and then add that to rsp in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.
Seems silly to pessimize the runtime, even slightly, to account for the partial register construction. DWARF bytecode ought to be powerful enough to express the calculations needed for restoring the true stack pointer if we're between immediate adjustments.
But isn't that the same thing here? The bug occurred in their production workflows, not in some specific debug builds, so with that seems pretty reasonable to call it a compiler bug?
As for the actual bug:
Unless you're unwinding the stack by walking the linked list of frames threaded through the frame pointer, then each time you unwind a level of the stack, you need to consult a table keyed on instruction pointer to look up how to compute the register contents of the previous frame based on register content of the current frame. One of the registers you can compute this way is the previous frame's stack pointer.
I haven't looked in depth at what the Go runtime is doing exactly, but at a glance, I don't see mention of frame pointers in the linked article, so I'm guessing Go uses the SP-and-unwind-table approach? If so, the real bug here is that the table didn't have separate entries for the two ADDs and so gave incorrect reconstruction instructions for one of them.
If, however, frame pointers are a load-bearing part of the Go runtime, and that runtime failed to update frame pointer (not just the stack pointer) in the contractually mandatory manner, well, that's a codegen bug and needs a codegen fix.
I guess I just don't like, as a matter of philosophy if not practical engineering, having frame pointers at all. Without the frame pointer, the program already contains all the information you need to unwind, at no runtime cost --- you pay for table lookups only when you unwind, not all the time, on straight-line code.
The purist in me doesn't like burning a register for debugging, but you have to use the right tool for the job I guess.
As an aside, this is the type of a problem that I think model checkers can't help with. You can write perfect and complicated TLA+/Lean/FizzBee models and even if somehow these models can generate code for you from your correct models you can still run into bugs like these due to platform/compiler/language issues. But, thankfully, such bugs are rare.
For the implementation, you can use certified compilers like CompCert [1], but:
- you still have to show your code is correct
- there are still parts of CompCert that are not certified
Interesting to hear Go & ARM in use.
And yeah, a lot of Rust but also a lot of Go.
Nobodys fault really, but bad results ensued.
(You cannot express relaxed atomics in golang, but you could technically add support in the compiler for use in the runtime code)
Uh, the fault is entirely in writing an assembler _that is not an assembler_, but rather something that is _almost_ like one but then 1% like an IR instead. It's an unforced error.
At least in the 90s, there were actually macro assemblers that supported OOP programming in assembly. Borland Turbo Assembler 5.0 comes to mind, if was kind of fun.
By the way Embarcaredo still has Turbo Assembler.
https://docwiki.embarcadero.com/RADStudio/Athens/en/Turbo_As...
Now a thing of the past, but Assemblers for game consoles were also quite powerfull in their macro capabilities.
I never liked the UNIX Assembly culture, because naturally as soon as C became a thing, they became the bare minimum required to assemble the generated Assembly out of the C compiler, as another step into the compilation pipeline.
All the niceties of macro assemblers came through the other platforms, like being able to use NASM instead of the platform assembler, not even GNU AS nor clang are that great in their abilities as Assemblers beyond the basic stuff.
Even then, if the code-gen was written BEFORE the preemption then it was fairly sloppy for those implementing the preemption to not consider the function epilogue, granted statically adjusting the stack/frame pointer by more than 4kb is probabably a tad of an edge-case.
When you're doing enough transactions you start to see a noise floor of e.g. bit flips from cosmic rays, and looking for issues involves correlating/categorizing possible software failures and distinguishing them from the misbehavior of hardware.
The feedback loop here should be: novel bug comes in ==> determine how existing testing was deficient ==> modify the testing in a general way that would have found this bug ==> run these modified tests in the background to see if anything similar was missed. Bugs should be used as indicators that regions (as large as possible) of bug space have been inadequately covered.
Compiler bugs are actually quite common ( I used to find several a year in gcc ), but as the author says, some of them only appear when you work at a very large scale, and most people never dive that far.
My background is not networking ( it’s math then hpc then broader stuff ) but I keep stumbling on similar problems ( including a beautiful one related to intel NICs a few years ago which led be into a rabbit hole of ebpf and kernel network layer and which surfaced later on the cloudflare blog), and the only tech company with which this seems to be a regular occurrence is cloudflare. Their space is a bit unknown to me so I guess I’m having a hard time projecting something onto the job offers.
I’d happily chat to someone working for cloudflare though - I guess this would help me understand what it is that actually happens over there. I guess I’m a bit intimidated by this unknown yet really good looking world :-)
Can't speak to the locations but the stuff you're interested/experienced in seems extremely likely to overlap with what they do. They do a lot of very deep technical things in all kinds of areas.
my recommendation if you want to talk to someone about it: search github/twitter/linkedin for ppl who work there on stuff you like, and just send them a message and ask for a 20 minute call!
have done it plenty of times, has always been extremely positive
If you have the skills, they have the coin.
They won’t hire some react guy in X country but someone who can find compiler bugs and save them XX+ million dollars a year? Heck yeah.
I'm in a similar position where I'd like to do something a lot more interesting, but intersection between where the interesting companies have offices and where I'd be willing to live do not really overlap enough justify rooting up my life.
(Unless we're talking about "too good to ignore", that's a different story.)
Anyone who can optimize a company’s bottom line will be hired.
Like I said, no random average mid react guy or dime a dozen Java developer is getting hired as a remote employee in some flyover country.
But if someone can provide like 50x value then hell yeah..
I thought that was obvious in my message considering we are discussing compiler optimization
I think there's also quite a big spectrum of skill, even when we're talking about compiler optimization and highly skilled software developers. I'd put myself up there, but still I'm no Lars Bak (for whom Google allegedly created an office in Denmark).
If you're asking what would constitute someone being special, it would depend on the role and skillset. As I said in my earlier comment, someone who is a beast and can find and fix bugs in compilers is a rare person. Especially if that skillset can help the company save boatloads of money that can be deployed elsewhere.
There are probably only a handful of people in the world who understand and can push the AI landscape forward. A lot of them are Chinese immigrants, and yet OpenAI/Meta/etc are paying them boatloads of money.
As for remote roles, I once worked on a project where we hired some dude for like $500/hr as a contractor because he was one of the few people who knew the inside/out of postgres and oracle rdbms because we were doing some very important migration.
I was just puzzled by the middle part of the article, where they start investigating their code but seem to overlook the fact that it only happens on ARM64.
Still, I understand that it’s professional to proceed step by step logically.
Great article, it was a pleasure reading it!
Even Java, as widespread as it is, I have made half-a-dozen reports. None in the last several years, though.
Better testing? The sheer scale of software being produced?
But definitely, better engineering and QA practices must also help here.
But I'm not sure that matters, because the unwind code they show uses the stack pointer rather than the frame pointer anyway.
Their repro case required a stack adjustment larger than 1<<12 (4kiB).
Used to have a lot of fun with those 3 decades ago.
I'm sure it was a relief to find a thorough solution that addressed the root cause. But it doesn't seem plausible that it was fun while it was unexplained. When I have this kind of bug it eats my whole attention.
Something this deep is especially frustrating. Nobody suspects the standard library or the compiler. Devs have been taught from a young age that it's always you, not the tools you were given, and that's generally true.
One time, I actually did find a standard library bug. I ended up taking apart absolutely everything on my side, because of course the last hypothesis you test is that the pieces you have from the SDK are broken. So a huge amount of time is spent chasing the wrong lead when it actually is a fundamental problem.
On top of this, the thing is a race condition, so you can't even reliably reproduce it. You think it's gone like they did initially, and then it's back. Like cancer.
Maybe different people find different things fun.
You wouldn't pay to be given compiler race condition bugs, right?
Hunting bugs that people have given up on or have no ideas on how to tackle is near the top of that list.
This, and now there’s pernosco which makes everything much easier.
Now, under pressure, this is going to be a nightmare unless you have a high tolerance to stress.
https://heinen.dev/ - I’m Thea “Teddy” Heinen (she/her or they/them)!
Yeah, and that's fun for me. Some of my most fun bugs to debug have been compiler, or even CPU issues.
I don't think I'd be allowed spend weeks to debug something like this. Credit to Cloudflare's PMs.
https://blog.cloudflare.com/however-improbable-the-story-of-...
> But [the Cloudbleed sensitive information disclosure security incident] wasn’t the only consequence of the bug. Sometimes it could lead to an invalid memory read, causing the NGINX process to crash, and we had metrics showing these crashes in the weeks leading up to the discovery of Cloudbleed. So one of the measures we took to prevent such a problem happening again was to require that every crash be investigated in detail.
Since then, they have a "no crashes go uninvestigated" policy, which for the scale Cloudflare operates at, seems pretty impressive.
That's not been my experience at all FWIW. Tools get things wrong all the time.
Simply that more mature projects with heavy use like eg; gcc or clang/llvm generally tend to have had major bugs stamped out by this point. They do still happen though.
More nascent language and compiler ecosystems are more likely to run into issues. Especially languages with runtimes.
Hey; it could've been type-3 fun.
I also don’t like many puzzle games, like Sudoku, because to me they feel like this kind of work. Many colleagues of mine have expressed bafflement that I don’t find such puzzles fun and give me all kinds of grief about how I ought to enjoy them, since they do.
It’s the same thing here, just flipped around: this person seems to enjoy the debugging experience; just let them be. Or recruit them, because that temperament is valuable.
Complex engineering isn't something to be avoided by default.
To me this points to a lack of verification, testing, and most importantly awareness of the invariants that are relied on. If the GC relies on the stack pointer being valid at all times, then the IR needs a way to guarantee that modifications to it are not split into multiple instructions during lowering. It means that there should be explicit testing of each kind of stack layout, and tests that look at the real generated code and step through it instruction by instruction to verify that these invariants are never broken...
I like Go but I don't really like their NIH / replace everything with our stuff stance - esp on system tools like assemblers and linkers.