How much slower is random access, really?(samestep.com)

114 pointsby sestep7 months ago16 comments

andersa7 months ago
Note this is not true random access in the manner it occurs in most programs. By having a contiguous array of indices to look at, that array can be prefetched as it goes, and speculative execution will take care of loading many upcoming indices of the target array in parallel.
A more interesting example might be if each slot in the target array has the next index to go to in addition to the value, then you will introduce a dependency chain preventing this from happening.
- wtallis7 months ago
  > A more interesting example might be if each slot in the target array has the next index to go to in addition to the value, then you will introduce a dependency chain preventing this from happening.
  However, on some processors there's a data-dependent prefetcher that will notice the pointer-like value and start prefetching that address before the CPU requests it.
  - less_less7 months ago
    The data-dependent prefetcher is a cool feature, though you do have to be careful with side-channel issues, so some of them can disable it with the Data-Independent Timing bit or similar.
    At this point I'm kinda expecting CPU vendors to stop putting as many Spectre mitigations in the main core, and just have a small crypto core with full-fat arithmetic, less hardware for memory access, less speculation, and careful side-channel hardening. You still have to block Meltdown and other large vulnerabilities on the main cores, but if someone wants to protect elliptic curves from weird attacks? Try to set the DIT bit, trap into the OS, and get sent to the hardened core.
  - gpderetta7 months ago
    Even if the prefetcher was capable of traversing pointers, it wouldn't help. The hypothetical benchmark wouldn't do anything other chasing pointers, and the prefetcher can't really do that any quicker. A traversing prefetcher is useful if the code actually does work for each traversed node, then the prefetcher (or the OoO machinery) could realistically run ahead.
  - deepsun7 months ago
    Could probably overcome that by using integers, but converting them to a pointer after accessing (like '0'+1 is '1').
    wtallis7 months ago
    Do you mean storing the next index/offset and having the pointer value calculated as late as possible by adding the starting address (and maybe multiplying the index by sizeof)? That would probably defeat/mislead Intel's prefetcher, as described at https://www.intel.com/content/www/us/en/developer/articles/t...
- jiggawatts7 months ago
  This is why array random access and linked-list random access have wildly different performance characteristics.
  Another thing I noticed is that the spike on the left hand side of his graphs is the overhead of file access.
  Without this overhead, small array random access should have a lot better per-element cost.
  - sestep7 months ago
    To be clear, the overhead on the left part is only due to file access for the last two graphs (the "direct" summation ones with just one blue line). For all the charts with both blue and yellow lines, there is no file access happening on the left hand side of the graphs, since the file gets read into memory first and then the measurements are run.
- hansvm7 months ago
  Fun fact, that's part of why parsing protobuf is so slow.
  - elcritch7 months ago
    Indirection kills performance nowadays. I did a bunch of benchmarking a couple years back and found that you can parse MessagePack and CBOR faster than Protobuf if you know the types and serialize directly into them. Even if field order isn't known and you use non-allocated field strings.
    Well in a language that allows you to generate compile time specialized serde code like Nim or Zig. Maybe C++ has enough compile time reflection to do it now as well?
    I don't know enough about Rust's serde, but it seems like there'd be a lot of performance overhead with it's design and the limits of Rust's macro and compiler system.
    Retr0id7 months ago
    By the way, DAG-CBOR and dCBOR enforce sorted map keys*, in which case you always know the field order.
    *maddeningly, with mutually incompatible sorting rules.
- jltsiren7 months ago
  If the next index is stored in the target array and the indexes are random, you will likely get a cycle of length O(sqrt(n)), which can be cached.
  You can avoid this with two arrays. One contains random query positions, and the target array is also filled with random values. The next index is then a function of the next query position and the previous value read from the target array.
  - 7 months ago
    undefined
  - eru7 months ago
    You could sample from a different random distribution.
    Eg start with every element referencing the next element (i.e. i+1 with wrap-around), and then use a random shuffle. That way, you preserve the full cycle.
    jltsiren7 months ago
    You can do that, but it's inefficient with larger arrays. Iterating over 10 billion elements in a random order can take up to an hour. Which is probably more than what you are willing to wait for a single case in a benchmark. On the other hand, you will probably find a cycle in a uniformly random array of 10 billion elements within 0.1 seconds, which is not enough to mitigate the noise in the measurements. So you need a way of generating unpredictable query positions for at least a few seconds without wasting too much time setting it up.
    eru7 months ago
    You could also use group theory to help you.
    Basically, pick a large prime number p as the size of your array and a number 0 < x < p. Then visit your array in the order of (i*x) modulo p.
    You can also do something with (x^i) modulo p, if your processor is smart enough to figure out your additive pattern.
    Basically, the idea is to look into the same theory they use to produce PRNG with long cycles.
- sestep7 months ago
  Great point thanks, and I agree! I thought about also including another experiment for this "linked list"-style access pattern to see what the difference in performance is, but didn't get around to it. Maybe I'll write a followup post doing that.
- delusional7 months ago
  > By having a contiguous array of indices to look at, that array can be prefetched as it goes
  Does x86 64 actually do this data dependent single deref prefetech? Because in that case I have a some design assumptions I have to reevaluate.
  - alain940407 months ago
    On modern cpus? Most likely. Those kinds of optimizations are done by the core with no compiler magic needed.
    CPU implementation has become too complex to grasp. The only sure way to know how a CPU will behave for a given workload is to run the workload. It's good to have some basic expectations of performance, instructions/cycle, memory bandwidth, to detect if something is off. I guess I'm trying to say it's hard to keep in your head all the details of what ~1B transistors are doing together to run your code. It's just too big.
  - 7 months ago
    undefined
  - phi-go7 months ago
    Hardware definitely supports this but it might need compiler support, as in adding instructions to do prefetching. Which might be done automatically or requires a pragma or calling a builtin. So it can be implemented in any case.
  - shakna7 months ago
    The compiler probably does [0].
    [0] https://gcc.gnu.org/projects/prefetch.html
    delusional7 months ago
    That list doesn't include any current mainline processors. It's all Itanium, 3DNow!, and MIPS.
    wtallis7 months ago
    Intel added PREFETCHW to their Broadwell processors launched in 2014, years after AMD dropped all 3DNow! instructions except the prefetch instructions. That timeline strongly suggests that the instructions aren't no-ops and likely are used by some popular software.
- kortilla7 months ago
  The article very clearly compares using randomized indexes and sequential. It’s kinda the point of the article.
  - cb3217 months ago
    You seem to misunderstand @andersa's point which I think is well expressed - it doesn't matter if the indices are randomized if the CPU can pre-fetch what they will be. The power of CPU speculative execution to hide latency can be quite surprising the first time you see it.
    This is a very small Nim program to demonstrate for "show me the code" and "it must just not be 'random enough'!" skeptics: https://github.com/c-blake/bu/blob/main/memlat.nim It uses the exact dependency idea @andersa mentions of a random cycle of `x[i] = i` that others else-sub-thread say some CPUs these days are smart enough to "see through". On Intel CPUs I have, the dependency makes things 12x slower at the gigabyte scale (DIMMs).
    EDIT: This same effect makes many a naive hash table microbenchmark { e.g., `for key in keys: lookup(key)` } unrepresentative of performance in real programs where each `key` is often not speculatively pre-computable.
    gpderetta7 months ago
    In the end it depends exactly what you want to measure. Of course a load-load dependency will make everything as slow as the latency of the cache level you are accessing as that becomes the bottleneck.
    Traversing a contiguous list of pointers in L1 is also slower than accessing those pointers by generating their address sequentially, so adding a load-load dependency is not a good way to benchmarking random access vs sequential access (it is a good way to benchmark vector traversal vs list traversal of course).
    At the end of the day you have to accept that like caching and prefetching speedup sequential access, OoO execution[1] will speedup (to a lesser extent) random access. Instead of memory latency, in this case the bottleneck would be the OoO queue depth, or more likely the maximum number of outstanding L1/L2/L3 (and potentially TLB) misses. As long as the maximum number of outstanding misses is lower than the memory latency for that cache level, then, in first approximation, the cpu can effectively hide the sequential vs random access cost for independent accesses.
    Benchmarking is hard. Making sure that that a microbenchmark represents your load effectively, doubly so.
    [1] Even many in-order CPUs have some run-ahead capabilities for memory.
    cb3217 months ago
    All true. No real disagreement and I'm often saying "it all depends.." myself. :-) In this case, there is also some vagueness around "random" (predictable to what subsystem when).
    I still suspect @kortilla is one of today's lucky 10,000 (https://xkcd.com/1053/) or just read/replied too quickly. :-)
    There is a lot written that indicates that the complexity of modern CPUs is ill-disseminated. But there is also wonderful stuff like https://gamozolabs.github.io/metrology/2019/08/19/sushi_roll... { To add a couple links to underwrite my reply in full agreeance. :-) }
    unkulunkulu7 months ago
    But what I dont unserstand then is the spike at the end that goes away with “direct summation”, I expected there be no spike for larger files because of the prefetching.
    The only explanation I see it the OS implementation then, i.e. cpu does perfect job of prefetching ram into cache, but OS does not perfect job of prefetching ssd to ram?
porcoda7 months ago
The RandomAccess (or GUPS) benchmark (see: https://ieeexplore.ieee.org/document/4100365) was looking at measuring machines on this kind of workload. In high performance computing this was important for graph calculations and was one of the things the Cray (formerly Tera) MTA machine was particularly good at. I suppose this benchmark wouldn’t be very widely known outside HPC circles.
- jandrewrogers7 months ago
  I worked on the MTA architectures for years among several other HPC systems but I don’t remember this particular benchmark. I suspect it was replaced by the Graph500 benchmark. Graph500 measures something similar and was introduced only a few years after GUPS.
  - porcoda7 months ago
    The HPCS benchmarks predated Graph500. They were talked about at SC for a few years in the early 2000s but mostly faded into the background. It’s hard to dig up the numbers for the MTA on RandomAccess, but the Eldorado paper from ‘05 by Feo and friends (https://dl.acm.org/doi/10.1145/1062261.1062268) mentions it and you can see the MTA beating the other popular architectures of the time in one of the tables.
    jandrewrogers7 months ago
    Feo was a major MTA stan and proponent, even years later. Honestly, it is probably my favorite computing architecture of all time despite the weaknesses of the implementation. It was extraordinarily efficient in some contexts. Few people could design properly optimized code for them though, which was an additional problem.
    There were proofs of concept by 2010 that the latency-hiding mechanics could be implemented on CPUs in software, which while not as efficient had the advantage of cost and performance, which was a death knell for the MTA. A few attempts to revive that style of architecture have come and gone. It is very difficult to compete with the economics of mass-scale commodity silicon.
    I hold out hope that a modern barrel processor will become available at some point but I’m not sanguine about it.
JonChesterfield7 months ago
Random access is catastrophically slower because of the successive cache misses when the prefetcher fails to guess what you're doing.
One hint in the same article that random access is not cheap, in contrast with the conclusion, was noticing that the shuffle was unacceptably slow on large data sets.
Still, good to see peformance measurements, especially where the curves look roughly like you'd hope them to.
- sestep7 months ago
  The shuffle was only unacceptably slow for data too big to fit in memory. For data that fits in memory, Fisher-Yates is totally fine; this is why it's fine for the two-pass shuffle to use buckets that fit in RAM but not in cache.
archi427 months ago
What surprises me is the 24 GB of DDR4 DRAM on a dual channel memory controller? AFAIK there are only 8 GB or 16 GB modules, no 12 GB modules. At least I can only find 12 GB DDR5 modules listed, but not DDR4.
This means: The system likely uses 3x 8 GB modules. As a result, one channel has two modules with 16 GB total, while the other channel has only a single 8 GB module.
Not sure how big this impact is with the given memory access patterns and assuming [mostly] exclusive single-threaded access. It's just something I noted, and could be a source of unexpected artifacts.
- sestep7 months ago
  Yes, sorry for not being more explicit! It's 3x8GiB. Originally 4x, but one of my RAM sticks broke and I never bothered to replace it.
  - archi427 months ago
    I'm not deep into the details of the AMD DRAM controller, but this detail could cause some of your anomalies. If this was an academic paper, the findings would be borderline invalid. You might want to remove the extra module and run the benchmarks again.
    At least once the tests become big enough to have some data in both partitions, the bandwidth will start to matter.
    sestep7 months ago
    Thanks, I may try that.
    Out of curiosity, what do you see when running the same code on your machine?
    archi427 months ago
    I can't run it right now: My AM4 desktop has a broken Linux dual boot since some Microsoft update seems to have nuked something important a few months ago, and my pure Linux Intel machine sits at 1.2 load with services I can't stop.
    If I get around to fixing my dual boot machine, I'll try to remember running the benchmark and dropping you some results by mail.
- CodesInChaos7 months ago
  Could be 2x8 + 2x4. Mine has 2x32 + 2x8, since I upgraded from 16 to 80 instead of 64.
FpUser7 months ago
I did another type of experiment which evaluates benefits of branch prediction on AMD 9950X on contiguous array with 1,000,000 elements. Calculated sum adding element if it is bigger than 125 (50% of 256). Difference between random and sorted was 10 times. I guess branch prediction plays a huge role as well.
- Andys7 months ago
  Thanks for sharing that.
  Presumably if you'd split the elements into 16 shares (one for each CPU), summed with 16 threads, and then summed the lot at the end, then random would be faster than sorted?
  - bee_rider7 months ago
    I don’t think random should be faster than contiguous access, if you parallelize both of them.
    Although, it looks like that chip has a 1MB L2 cache for each core. If these are 4 Bytes ints, then I guess they won’t all fit in one core’s L2, but maybe they can all start out in their respective cores’ L2 if it is parallelized (well, depends on how you set it up).
    Maybe it will be closer. Contiguous should still win.
    Andys7 months ago
    What if you factored in time to sort them first?
Animats7 months ago
If, of course, you have the CPU and its caches all to yourself.
- tiluha7 months ago
  This is something i have been thinking about lately. How well do these performance optimizations work in the cloud on a shared system?
  - Animats7 months ago
    Doesn't even have to be a shared system. Cache-dependent optimizations can conflict with other code in your own program that also need cache space. This is a generic problem with extrapolating from microbenchmarks.
  - beng-nl7 months ago
    It’s fair to assume that on a vm in the cloud the cores you get are dedicated to you - otherwise the CSP is risking exposure to headline making security problems.. (In the unpleasant event that someone exploits an unmitigated cpu bug.)
    And of course the headline of getting a cpu you can’t fully use.
    tiluha7 months ago
    Im pretty sure this is not the case on most providers, where "dedicated" VPSs demand a significant premium over the default "shared" VPSs
forrestthewoods7 months ago
Here’s an older blog post of mine on roughly the same topic:
https://www.forrestthewoods.com/blog/memory-bandwidth-napkin...
I’m not sure I agree with the data presentation format. “time per element” doesn’t seem like the right metric.
- klank7 months ago
  What are your qualms with time per element? I liked it as a metric because it kept the total deviation of results to less than 32 across the entire result set.
  Using something like the overall run length would have such large variations making only the shape of the graph particularly useful (to me) less so much the values themselves.
  If I was showing a chart like this to "leadership" I'd show with the overall run length. As I'd care more about them realizing the "real world" impact rather than the per unit impact. But this is written for engineers, so I'd expect it to also be focused on per unit impacts for a blog like this.
  However, having said all that, I'd love to hear what your reservations are using it as a metric.
  - forrestthewoods7 months ago
    It’s not wrong per se. I’m just very wary of nano-scale benchmarks. And I think in general you should advertise “velocity” not “time per”.
    Perhaps it’s a long time inspiration from this post: https://randomascii.wordpress.com/2018/02/04/what-we-talk-ab...
    I also just don’t know what to do with “1 ns per element”. The scale of 1 to 4 ns per element is remarkably imprecise. Discussing 1 to 250 million to 1 billion elements per second feels like a much wider range. Even if it’s mathematically identical.
    Your graphs have a few odd spikes that weren’t deeply discussed. If it’s under 2ns per element who cares!
    The logarithmic scale also made it really hard to interpret. Should have drawn clearer lines at L1/L2/L3/ram limits.
    On skim I don’t think there’s anything wrong. But as presented it’s a little hard for me as an engineer to extract lessons or use this information for good (or evil).
    There shouldn’t be a Linux vs Mac issue. Ignoring mmap this should be HW.
    I dunno. Those are all just surface level reactions.
    sestep7 months ago
    Haha, it seems you may have thought the person you were responding to is the post author :) but actually that would be me.
    Agreed that the odd spikes don't matter, that's why I didn't bother discussing them; I was more interested in the data after the array got large enough that random access was actually slower. It looked like all those weird spikes were for arrays small enough to fit in cache anyways.
    I agree that it could have been helpful if I'd drawn lines at L1/L2/L3/RAM limits, but I didn't do that because I don't think it's entirely clear where those lines should have been drawn. Specifically because there are two arrays. Should the line show just where the floating-point array is small enough to fit in cache, or where both arrays together are?
    Not sure I quite follow what you're saying about mmap on Linux vs Mac; only one of the three sets of experiments used mmap, and the third was explicitly to try to tease out that effect. Especially for the first experiment, I agree that there should be no difference for arrays small enough to fit in RAM, since the whole file gets read into memory first.
    forrestthewoods7 months ago
    > Agreed that the odd spikes don't matter, that's why I didn't bother discussing them
    That was sarcasm =P Those spikes are very curious and the choice of presentation makes them seem like noise but there is something there that should be investigated further imho. In the graph it looks like noise. I mean it’s just 1ns. But a 2x throughput difference isn’t noise! Thats huge! Very curious.
    > Not sure I quite follow what you're saying about mmap on Linux vs Mac
    Your 4th conclusion is “On Linux, random order starts getting even slower for arrays over a gigabyte, becoming more than 50x slower than first-to-last order; in contrast, random order on the MacBook seems to just level out as long as everything fits in RAM.”. That doesn’t make sense. There shouldn’t be any OS difference here.
    sestep7 months ago
    Gotcha, sorry for not picking up on the sarcasm. Yeah, I mean, I didn't really bother running the experiments many times for the smaller array sizes, so it could potentially be interesting to see if those artifacts persist when poked.
    Could you clarify why there shouldn't be an OS difference? I was under the impression that it's the OS that handles how swap space is implemented (which was used by the first set of experiments), as well as how memory-mapped files are implemented (which was used by the second set of experiments). Am I mistaken about that?
    forrestthewoods7 months ago
    > Could you clarify why there shouldn't be an OS difference?
    Ignoring mmap. But why on a 16gb system why would performance degrade at 1Gb? You shouldn’t be hitting the swap. So any differences should be hardware architecture related. M1 unified vs Ryzen. And I wouldn’t expect 1Gb to be a magic threshold.
    I would definitely expect a threshold beyond 16gb. And I’d expect the swap to come into play at maybe 12gb. I wouldn’t expect a huge difference between 500mb and 8gb. Ok there’s probability difference of hitting the L3 cache. But most of those random accesses will be hitting system RAM so it should be the same.
    Could be wrong! But that’s what I’d expect.
    sestep7 months ago
    Ohh you're right thanks, I mixed up which part of the post you were talking about; sorry for getting confused.
    Yes, I agree that it doesn't make sense for that to be due to the OS. That's just poor writing on my part: in that case I wasn't actually trying to imply that it was due to the operating system specifically, I was just using "Linux" and "the MacBook" as shorthand to refer to my two different computers, which differ in more ways than just the OS. In this case, another commenter suggested it might be due to my physical setup of having three RAM sticks: https://news.ycombinator.com/item?id=44397214
    So yes, in that sentence I should have written "on my desktop" instead of just "on Linux".
    alain940407 months ago
    Could it be a TLB issue? Page size on Mac is 4x larger than on Intel.
- sestep7 months ago
  Great post, thanks for the link! I think you and I were just focusing on different things. You gave a broader discussion of the topic from a few different angles, with more specific basic numbers about CPUs as well as more realistic benchmarks than what I have here. I just wanted to focus on the simplest example I could think of, and run it on as wide a range of different array sizes as I could.
  The reason I chose "time per element" then follows from that different goal, because I was comparing across vastly different array sizes, so no other metric I could think of would have really worked for the charts I was drawing.
- alain940407 months ago
  From your blog post:
  > Random access from the cache is remarkably quick. It's comparable to sequential RAM performance
  That's actually expected once you think about it, it's a natural consequence of prefetching.
  - forrestthewoods7 months ago
    Heh. That line often gets called out.
    Lots of things are expected when you deeply understand a complex system and think about it. But, like, not everyone knows the system that deeply nor have they thought about it!
  - delusional7 months ago
    If that wasn't the case the machine would have to prefetch to register file. I don't know of any CPU that does that.
- petermcneeley7 months ago
  Whats most misleading is the data for the smaller sizes (1k)
Adhyyan12527 months ago
Love this analysis! Was expecting random to be much slower. 4x is not bad at all
- 7 months ago
  undefined
- Nevermark7 months ago
  There has to be some power hit for all those extra cache fills. No idea if it would be measurable.
7 months ago
undefined
7 months ago
undefined
o11c7 months ago
Hm, no discussion of cache line size, page size, or the limits of cache associativity?
- sestep7 months ago
  Fair, it probably would have been useful for me to include a link to a page discussing those ideas. Since those theoretical/qualitative ideas are already covered in plenty of places online, I didn't bother to talk about them here since they're easy to look up; I just wanted to focus on quantitative data from actual measurements. But again, I agree I should have at least mentioned them or linked somewhere.
Cold_Miserable7 months ago
Worst case scenario for random access is a multiple level TLB miss, a memory refresh cycle and then a system management mode interrupt all occurring consecutively.
anonymousDan7 months ago
Would a better benchmark not just use some kind of pseudo randomly generated sequence to avoid having two arrays?
- sestep7 months ago
  Unclear if that'd be "better" but definitely something to compare against! Here's a related post by someone else that measures more of what you're talking about: https://lemire.me/blog/2018/03/24/when-shuffling-large-array...
Surac7 months ago
is this not just a memory test for the burst capacitiy and access strategy of the dram controller?
7 months ago
undefined
b0a04gl7 months ago
[dead]