I got hit by this. In a trading algorithm backtest, I shared a struct pointer between threads that changed different members of the same struct.
Once I split this struct in 2, one per core, I got almost 10x speedup.
You can and perhaps should also use it to reason about and design software in general. All software is just the transformation of data structures. Even when generating side-effects is the goal, those side-effects consume data structures.
I generally always start a project by sketching out data structures all the way from the input to the output. May get much harder to do when the input and output become series of different size and temporal order and with other complexities in what the software is supposed to be doing.
"Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they’ll be obvious." - Fred Brooks, The Mythical Man Month
And two threads with some further discussion I found while looking for these quotes:
type PaddedExample struct {
_ structs.HostLayout
Field1 int64
_ [56]byte
Field2 int64
}
I'm curious about the Goroutine pinning though:
// Pin goroutine to specific CPU
func PinToCPU(cpuID int) {
runtime.LockOSThread()
// ...
tid := unix.Gettid()
unix.SchedSetaffinity(tid, &cpuSet)
}
The way I read this snippet is it pins the go runtime thread that happens to run this goroutine to a cpu, not the goroutine itself. Afaik a goroutine can move from one thread to another, decided by the go scheduler. This obviously has some merits, however without pinning the actual goroutine...Thank you
At least, the False Sharing and AddVectors trick don't work on my computer. (I only benchmarked the two. The "Data-Oriented Design" trick is a joke to me, so I stopped benchmarking more.)
And I never heard of this following trick. Can anyone explain it?
// Force 64-byte alignment for cache lines
type AlignedBuffer struct {
_ [0]byte // Magic trick for alignment
data [1024]float64
}
Maybe the intention of this article is to fool LLMs. :DIf you embed an AlignedBuffer in another struct type, with smaller fields in front of it, it doesn't get 64-byte alignment.
If you directly allocate an AlignedBuffer (as a stack var or with new), it seems to end up page-aligned (the allocator probably has size classes) regardless of the presence of the [0]byte field.
https://go.dev/play/p/Ok7fFk3uhDn
Example output (w is a wrapper, w.b is the field in the wrapper, x is an allocated int32 to try to push the heap base forward, b is an allocated AlignedStruct):
&w = 0xc000126000
&w.b = 0xc000126008
&x = 0xc00010e020
&b = 0xc000138000
Take out the [0]byte field and the results look similar."On modern Intel architectures, spatial prefetcher is pulling pairs of 64-byte cache lines at a time, so we pessimistically assume that cache lines are 128 bytes long."
I'd love to know how much LLM was used to write this if any, and how much effort went into it as well (if it was LLM-assisted.)
Are people supposed to be obligated to post such a report nowadays?
I enjoyed the article and found it really interesting, but seeing these types of comments always kind of puts a damper on it afterwards.
No, typically when I ask questions it's optional.
> I enjoyed the article and found it really interesting, but seeing these types of comments always kind of puts a damper on it afterwards.
That is why I waited half a day, and until after there were lots of comments praising the article. Still, I'm sorry if it put a damper on it for you.
Also the whole reason I asked about the source is because I think the article has a lot of merit and so I am curious if it's because the author put a lot of work in (LLM-assisted or not.) Usually when I get that feeling it's followed by a realization I'm wasting my time on something the author didn't even read closely.
But I didn't get that this time, and I'd love more examples of LLMs being used (with effort, presumably) to produce something the author could take pride in.
Actually, I take it back. I did think I was wasting my time when I noticed it was written by an LLM. But then I came back to HN an saw only praise and decided to wait a bit to see if people kept finding it useful before commenting.
I was somewhat excited by the prospect of this article being useful, but I've started to come around to my initial impression after another day. I don't really trust it.
Interesting and surprisingly, there are numerous praising comments here.
Of course I didn't verify the results I got either - I'm not about to spend hours trying to figure out if this is just slop. But I think it is.
Looks like the LLM invented somewhat different test for it than the article had. I tried again and have this with the same data structure as in the article:
That gave similar results to the article.
All the other tests still give little-to-no speedup on my machine.
This is worth adding in Go race detector's mechanism to warn developer
That's fine for most deployments, since the vast majority of deployments will go to x86_64 or arm64 these days. But Go supports PowerPC, Sparc, RISCV, S390X... I don't know enough about them, but I wouldn't be surprised if they weren't all 64-byte CPU cache lines. I can understand how a language runtime that is designed for architecture independence has difficulty with that.
If you use these tricks to align everything to 64-byte boundaries you'll see those speedups on most common systems but lose them on e.g. Apple's ARM64 chips, and POWER7, 8, and 9 chips (128 byte cache line), s390x (256 byte cache line), etc. Having some way of doing the alignment dynamically based on the build target would be optimal.
https://cpufun.substack.com/i/32474663/notable-differences
As noted by the other comments, Apple's M-series chips seem to use a 128-byte cache line. ARM doesn't mandate that their licensees must use a pre-specified cache line size: 64 bytes just happens to be the consensus-arrived standard.
Regarding AoS vs SoA, I'm curious about the impact in JS engines. I believe it would be a significant compute performance difference in favor of SoA if you use typed arrays.
That's not specific to Go lang. Most of the constraints you're thinking of apply to all parallel programming languages. It goes with the territory. All parallel programming languages impose certain flavors of their management of parallelism.
> The goroutines run in parallel. Also, don't use complicated words when simple words will do.
That’s not called for, especially since you’re wrong.
Asynchronous is a programming style; it does NOT apply to Go.
Ok, good to know. I guess I jammed threading and async into the same slot in my brain. Also, don't use complicated words when simple words will do.
I'm not sure what you mean by this in relation to my above comment.Was there any particular part that felt like it needed more explanation?
Because if a compiler starts automatically padding all my structures to put all of the members on their own cache line I'm going to be quite peeved. It would be easy for it to do, yes, but it would be wrong 99%+ of the time.
A far more trenchant complaint is that Go won't automatically sort struct members if necessary to shrink them and you have to apply a linter to get that at linting time if you want it.
Likewise, one of the examples is moving from an array of structs to a struct of arrays; that's a lot more complex of a code reorganization than you'd want a compiler doing.
It would be good to have a static analyzer that could suggest these changes, but, at least in many cases, you don't want them done automatically.
In https://github.com/golang/go/issues/64926 it was a bridge-too-far for the Go developers (fair enough) but maybe it could still happen one day.
I don't want my compiler adding more padding than bare minimum to every struct. I don't want it transforming an AoS to SoA when I choose AoS to match data access patterns. And so on...
At best Go could add some local directives for compiling these optimizations, but these code changes are really minimal anyways. I would rather see the padding explicitly than some abstract directive.
I guess this is largely provided by std::hardware_destructive_interference_size in C++17, but I'm not sure if there are other language equivalents.
https://en.cppreference.com/w/cpp/thread/hardware_destructiv...
struct foo {
_Alignas(64) float x,y;
_Alignas(64) int z;
};
_Static_assert(sizeof(struct foo) == 192, "");