It's the classic, why is processing sorted array faster than unsorted one
def f(arr):
r = True
for x in arr:
if x > 128: r = not r
return r
arr = [ random.randint(0, 255) for i in range(0, 1000_000) ]
arr_sorted = list(sorted(arr))
%timeit f(arr)
%timeit f(arr_sorted)
Results are (on my machine): 17.5 ms for unsorted, and 13.5 ms for sorted. For comparison, in a compiled language, the difference is 4xThis depends on the Python version, but if it has the specializing interpreter changes, the `COMPARE_OP` comparing the integers there is probably hitting a specialized `_COMPARE_OP_INT` [1].
This specialization has a ternary that does `res = (sign_ish & oparg) ? PyStackRef_True : PyStackRef_False;`. This might be the branch that ends up getting predicted correctly?
Older versions of Python go through a bunch of dynamic dispatch first and then end up with a similar sort of int comparison in `long_richcompare`. [2]
[1] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...
[2] https://github.com/python/cpython/blob/561965fa5c8314dee5b86...
I wonder if you could do Spectre type vulnerabilities in python. You would need some way to leak micro-architectural state, so without being particularly clever maybe python code could only be used as a gadget or something.
I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me
The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)
Larger range one being slower unsorted yes makes sense because of allocation order no longer matching the iteration order.
Anyway, there is no need to have 256 integers, just 2 is enough. When I try that, the results are similar: 17.5 ms (unsorted) / 12.5 ms (sorted)
if x > 128:
r = not r
doesn't necessarily correspond to one branch at the ASM instruction level right? in fact, a priori, it doesn't even correspond to absolutely any branches at the ASM instruction level.That’s why I asked about false sharing specifically as that’s the main way I can think of that Python code wouldn’t be able to observe the branch predictor performance - because the interpreter has so many branches internally that it dominates any branches that might be caused by the Python code itself.
I would be happy to stand corrected if someone knows more about this, though!
I suspect most JavaScript code is JIT-compiled - but most Python, Ruby etc is interpreted (via bytecode).
Python sequences are defined as "support[ing] efficient element access using integer indices" [1]. Python lists are sequences and thus must support random access. In practice that means the implementation is a (dynamically resizing) array allocated contiguously. That means spatial locality is relevant.
If the list type were defined simply as an iterable collection, with no requirement of constant-time random access, the definition would be abstract enough that an implementation might end up being something else than a contiguous array. But if you define the type so that it supports constant-time random access [2], you pretty much end up with cache-friendly sequential access as well.
If you don't define the list type as supporting random access, you also sacrifice asymptotic algorithmic efficiency for lookups by index. Any language that cares about efficiency at all separates collections that support efficient random access from those that don't. (For example, Python has lists and dicts/sets for different access patterns. The Java standard library has separate types for contiguous arrays/lists, hashtable-backed collections and tree-backed collections because the three have different algorithmic efficiencies for different access patterns. In practice it leads to different properties in terms of cache-friendliness as well.)
[1] https://docs.python.org/3/glossary.html#term-sequence
[2] As usual, the Python documentation is a bit vague on the details. It doesn't really say random access has to be constant-time, only that it has to be "efficient". So you might be able to have a non-constant time implementation such as an indexable skiplist while arguing that it's efficient, but you'd have to go out of your way to do that.
The contents of the dynamically allocated array, in the C implementation, are pointers (PyObject), since the lists are heterogeneous and must be able to store any Python object. That entails indirection, which defeats spatial locality. Even if you iterate over cached data, you have to jump around to retrieve the actual objects to do anything* with them.