Is Python Code Sensitive to CPU Caching? (2024)(lukasatkinson.de)

81 pointsby leonry3 months ago6 comments

PhilipRoman3 months ago
IME CPU caches can be observed in almost all languages, regardless of how detached they are from the machine. What I'm really wondering is, if branch prediction can be observed from interpreted code.
- exyi3 months ago
  It is! Although my test case is probably an unrealistically bad scenario:
  It's the classic, why is processing sorted array faster than unsorted one
  def f(arr): r = True for x in arr: if x > 128: r = not r return r arr = [ random.randint(0, 255) for i in range(0, 1000_000) ] arr_sorted = list(sorted(arr)) %timeit f(arr) %timeit f(arr_sorted)
  Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms for sorted. For comparison, in a compiled language, the difference is 4x
  - ammar23 months ago
    Edit: Analyzed the wrong thing earlier.
    This depends on the Python version, but if it has the specializing interpreter changes, the `COMPARE_OP` comparing the integers there is probably hitting a specialized `_COMPARE_OP_INT` [1].
    This specialization has a ternary that does `res = (sign_ish & oparg) ? PyStackRef_True : PyStackRef_False;`. This might be the branch that ends up getting predicted correctly?
    Older versions of Python go through a bunch of dynamic dispatch first and then end up with a similar sort of int comparison in `long_richcompare`. [2]
    [1] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...
    [2] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...
    dzaima3 months ago
    This isn't actually timing the sorting, but just the (dumb) function f.
    ammar23 months ago
    Oh whoops, that's right. I totally missed that.
  - LPisGood3 months ago
    This is a really good example. It is more like branch prediction than standard data/instruction caching.
    I wonder if you could do Spectre type vulnerabilities in python. You would need some way to leak micro-architectural state, so without being particularly clever maybe python code could only be used as a gadget or something.
  - cma3 months ago
    Python speed up is probably from small integer caching, a sorted array will have runs of pointers to the same integers adjacent. The compiled language one is probably branch prediction right?
    exyi3 months ago
    I intentionally stayed in the small integer range to avoid benchmarking the cache. 256 distinct values should fit into L1 just fine in both cases.
    I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me
    The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)
    cma3 months ago
    How big is the pointed to small integer? With alignment etc. I'm seeing some stuff saying 256 of them would fill an 8KB L1. Plus other stuff for the interpreter might overfill it. Sorted that would be less of an issue.
    Larger range one being slower unsorted yes makes sense because of allocation order no longer matching the iteration order.
    exyi3 months ago
    I don't know how large are those boxes, but normal CPU L1 cache has 32 or 48KB which should be plenty for this. Python opcodes for this program are going to be tiny, and the interpreter itself uses the instruction-L1 cache (which is another 32-48KB). I hope the sequential scan of the big array won't flush the L1 cache (there should be 12-way associativity with LRU, so I don't see how it could).
    Anyway, there is no need to have 256 integers, just 2 is enough. When I try that, the results are similar: 17.5 ms (unsorted) / 12.5 ms (sorted)
    bgwalter3 months ago
    That seems very likely. The benchmark should probably use a range that is guaranteed to be outside of the cached smallint range.
    exyi3 months ago
    Then you are back to what the article discusses. Each integer is in a separate box, those boxes are allocated in one order, sorting the array by value will shuffle it by address and it will be much slower. I tested this as well, see the other comment.
  - GBhpkmY3 months ago
    [dead]
- eulgro3 months ago
  My guess would be that branch misprediction does have an impact on interpreted language, but much less. If bytecode instructions take on average 20 CPU cycles to execute, and the branch misprediction penalty is 50 CPU cycles, the relative cost of a misprediction is much smaller than in compiled code.
  - adgjlsfhk13 months ago
    the counterpoint is that a lot of those 20 instructions are also branches
- vlovich1233 months ago
  Why would you think that branch prediction wouldn’t be? Do you think the interpreter is adding too much false sharing to the branch predictor to render it worthless?
  - almostgotcaught3 months ago
    you do realize that the Python
    if x > 128: r = not r
    doesn't necessarily correspond to one branch at the ASM instruction level right? in fact, a priori, it doesn't even correspond to absolutely any branches at the ASM instruction level.
    vlovich1233 months ago
    Yes I do and that’s neither here nor there. There almost certainly ASM instruction level branches to implement the conditional and the branch predictor isn’t tied to a single instruction - the rough high level mental model is a cache of a few of the least significant digits of the CPU to the prediction although in practice it’s far more complicated. Since the predictor is right like 80% of the time, it means that even when there’s false sharing of a lot of branches, the CPU does a good job predicting. It’s performance is primarily impacted when execution takes both branches closer to 50/50 than to 100/0.
    That’s why I asked about false sharing specifically as that’s the main way I can think of that Python code wouldn’t be able to observe the branch predictor performance - because the interpreter has so many branches internally that it dominates any branches that might be caused by the Python code itself.
    exyi3 months ago
    It corresponds to a way more than one branch at instruction level. The branch prediction AFAIK does not care based on what are you branching, it just assumes branches will go in similar sequences as they did last time. If the Python 'if' is never taken, the instruction-level predictor will learn that after the comparison operation, there is an 'if' operation and then another array access operation. If the Python 'if' is unpredictable unpredictable, CPU predictor can't be sure which opcode are we processing after 'if', so there will be penalty.
    mardifoufs3 months ago
    Is there any public documentation on modern branch prediction algorithms? I know branch prediction is very important to modern CPU so SOTA techniques are probably not public... But it's really amazing what it can do especially considering the "limited" cache sizes that branch predictors have .
    exyi3 months ago
    I guess it depends on how deep you want to go, I think the real predictors are based on publicly known algorithms such as TAGE. This seems to be nice overview: https://comparch.net/2013/06/30/why-tage-is-the-best/ (it's 2013, so definitely not SOTA, but advanced enough for my taste :) )
- stingraycharles3 months ago
  It should be, as most interpreted code uses a JIT native compiler at this point, which implies the branch prediction is native as well.
  I would be happy to stand corrected if someone knows more about this, though!
  - pansa23 months ago
    > most interpreted code uses a JIT native compiler at this point
    I suspect most JavaScript code is JIT-compiled - but most Python, Ruby etc is interpreted (via bytecode).
  - tarruda3 months ago
    AFAIK CPython doesn't JIT compile.
    pansa23 months ago
    Recent versions of CPython do have an experimental JIT compiler. I'm not sure how widely-used it is, though.
    https://peps.python.org/pep-0744/
    Qem3 months ago
    It's disabled by default, not production ready. You need to compile CPython yourself if you want to try it.
- saagarjha3 months ago
  Yep, this is why Spectre mitigations are applied to browsers.
PaulHoule3 months ago
"Mechanical sympathy" effects are so strong on modern CPUs you feel them even in languages like Python that are far from the metal.
Delk3 months ago
I really don't see why you wouldn't expect to find cache-sensitivity in Python, or in some other similarly high-level language.
Python sequences are defined as "support[ing] efficient element access using integer indices" [1]. Python lists are sequences and thus must support random access. In practice that means the implementation is a (dynamically resizing) array allocated contiguously. That means spatial locality is relevant.
If the list type were defined simply as an iterable collection, with no requirement of constant-time random access, the definition would be abstract enough that an implementation might end up being something else than a contiguous array. But if you define the type so that it supports constant-time random access [2], you pretty much end up with cache-friendly sequential access as well.
If you don't define the list type as supporting random access, you also sacrifice asymptotic algorithmic efficiency for lookups by index. Any language that cares about efficiency at all separates collections that support efficient random access from those that don't. (For example, Python has lists and dicts/sets for different access patterns. The Java standard library has separate types for contiguous arrays/lists, hashtable-backed collections and tree-backed collections because the three have different algorithmic efficiencies for different access patterns. In practice it leads to different properties in terms of cache-friendliness as well.)
[1] https://docs.python.org/3/glossary.html#term-sequence
[2] As usual, the Python documentation is a bit vague on the details. It doesn't really say random access has to be constant-time, only that it has to be "efficient". So you might be able to have a non-constant time implementation such as an indexable skiplist while arguing that it's efficient, but you'd have to go out of your way to do that.
- zahlman3 months ago
  >I really don't see why you wouldn't expect to find cache-sensitivity in Python, or in some other similarly high-level language... Python lists are sequences and thus must support random access. In practice that means the implementation is a (dynamically resizing) array allocated contiguously.
  The contents of the dynamically allocated array, in the C implementation, are pointers (PyObject), since the lists are heterogeneous and must be able to store any Python object. That entails indirection, which defeats spatial locality. Even if you iterate over cached data, you have to jump around to retrieve the actual objects to do anything* with them.
  - Delk3 months ago
    Sure. But that's one less level of indirection than if you also had to jump around to get the reference to the object in the first place.
3 months ago
undefined
davvid3 months ago
The article didn't mention Python's atomic refcounts. They're very bad for cache usage as they're constantly invalidating cache.
- quotemstr3 months ago
  The only version of Python that uses atomic reference counting is the very new free threaded version.
- senderista3 months ago
  Non-atomic refcounts invalidate cache as well, they're just not as expensive.
SteelByte3 months ago
[flagged]
- saagarjha3 months ago
  No, they can’t. The “caching inefficiencies” (?!) are part of the hardware and can’t be optimized away by a JIT. I think you have a very confused understanding of the post.
  - kbolino3 months ago
    A compiler (JIT or AOT) can certainly make decisions which are more or less cache-friendly. It does depend on the language and the amount of freedom the compiler has, but just as one example, struct/object field reordering can reduce padding and thus fit more useful data in a cache line.