So `n % 2 == 1` should probably [1] be replaced with `n % 2 != 0`.
While this may be obvious with experience, if the code says `n % 2 == 0`, then a future developer who is trying to reverse the operation for some reason must know that they need to change the equality operator not the right operand. Whereas, with `n % 1 == 0`, they can change either safely and get the same result.
This feels problematic because the business logic that necessitated the change may be "do this when odd" and it may feel incorrect to implement "don't do this when even".
I really disfavor writing code that could be easily misinterpreted and modified in future by less-experienced developers; or maybe just someone (me) who's tired or rushing. For that reason, and the performance one, I try to stick to the bitwise operator.
[1] Of course, if for some reason you wanted to test for only positive odd numbers, you could use `n % 2 == 1`, but please write a comment noting that you're being clever.
That's their problem. Otherwise you're just contributing to the decline.
Edit: apparently JS, java, and C all do this. That’s horrifying
https://stackoverflow.com/questions/13683563/whats-the-diffe...
For instance if you're making a loop to count the bits that are set in a number, the compiler can recognize the entire loop and turn it into a single popcnt instruction (e.g. https://lemire.me/blog/2016/05/23/the-surprising-cleverness-... )
To elaborate a bit on the specialness of popcount: It is a generally accepted belief in the computer architecture community that several systems included a popcount iNstruction Solely due to A request from a single "very good customer".
Look at this --beauty-- eww, thing, should compilers really spend time trying to figure out how to optimise insane code?
def is_even(n):
return str(n)[len(str(n))-1] in [str(2*n) for n in range(5)]
I could see that as a novel feedback mechanism for software engineers.
As it stands, I'm glad they design optimizations abstractly, even if that means code I don't like gets the benefits
tl;dr: there are general optimizations for "this function in a for loop is a constant expression, we dont need to call it 500 times"
or
"this obscure combination of asm instructions is optimal on pentium iii 350 mhz dual core"
not "we need to turn this unholy CS101 student spaghetti code where they do a 500 branch-if into a for loop"
comment over here is attempting to communicate that as well https://news.ycombinator.com/item?id=42705758
I've never, ever, heard the idea that compilers are burdened by the workload of maintaining thousands of type-specific optimizations for hilariously bad code, until today. I've been here since 2009, so it is puzzling to me to see it referred to off hand, in a "this is water" manner https://en.wikipedia.org/wiki/This_Is_Water
I've heard tons of people complain about slow compilers, so even if compiler devs find it easy architect their compilers to do multiple kinds of optimisations there's a cost to it that devs running the compilers pay.
Also, if you think about it, optimising code has to follow diminishing returns, so at some point we are putting too much CPU time into little to no gains, and it's also possible to get slower code with more optimisations if they interact poorly, or at least not better code even if spending more CPU time. This is why there's -O3 in gcc and it's not the default, there's a cost to it that's likely not worth paying.
A slow compiler does not imply the compiler is slow because there's thousands of bespoke optimizations for nonsense code being ran
> Also, if you think about it, optimising code has to follow diminishing returns,
Nope, trivially. Though, I'm always eager for a Fermat-style marvelous proof that may have been too big for the initial margin you had. :)
Take a classic case of a buggy compiler generating O(n²) temporary copies due to missed alias analysis. One optimization pass to fix that analysis transforms it to O(n).
> at some point we are putting too much CPU time into little to no gains
It is theoretically possible to design a compiler such that it spends much more time looking for optimizations that the total sum of looking is greater than the program it is optimizing's runtime.
For example, an optimizer that is a while loop that checks if the function returns 42, but the function returns 43.
I'm not sure what light that sheds.
I'm not sure that implies that compilers have tons of bespoke optimizations for hand-transforming specific instances of absurd string code.
If they do, I would be additionally surprised because I have never observed that. What I have observed is compilers, universally, optimize code structures of a certain general form
> This is why there's -O3 in gcc and it's not the default, there's a cost to it that's likely not worth paying.
The existence of an argument with a higher processing level than default does not imply the compiler is slow because there's thousands of bespoke optimizations for nonsense code being ran. (n.b. -O3 is understood, in practice, to be risky because it might be too aggressive, not that it might not be worth it)
gcc likes to use `and edi,1` (logical AND between 32-bit edi register and 1). Meanwhile, clang uses `test dil,1` which is similar, except the result isn't stored back in the register, which isn't relevant in my test case (it could be relevant if you want to return an integer value based on the results of the test).
After the logical AND happens, the CPU's ZF (zero) flag is set if the result is zero, and cleared if the result is not zero. You'd then use `jne` (jump if not equal) or maybe `cmovne` (conditional move - move register if not equal). Note again that there is no explicit comparison instruction. If you don't use O3, the compiler does produce an explicit `cmp` instruction, but it's redundant.
Now, the question is: Which is more efficient, gcc's `and edi,1` or clang's `test dil,1`? The `dil` register was added for x64; it's the same register as `edi` but only the lower 8 bits. I figured `dil` would be more efficient for this reason, because the `1` operand is implied to be 8 bits and not 32 bits. However, `and edi,1` encodes to 3 bytes while `test dil,1` encodes to 4 bytes. I guess the `and` instruction lets you specify the bit size of the operand regardless of the register size.
There is one more option, which neither compiler used: `shr edi,1` will perform a right shift on EDI, which sets the CF (carry) flag if a 1 is shifted out. That instruction only encodes to 2 bytes, so size-wise it's the most efficient.
The right-shift option fascinates me, because I don't think there's really a C representation of "get the bit that was right-shifted out". Both gcc and clang compile `(i >> 1) << 1 == i` the same as `i & 1 == 0` and `i % 2 == 0`.
Which of the above is most efficient on CPU cycles? Who knows, there are too many layers of abstraction nowadays to have a definitive answer without benchmarking for a specific use case.
I code a lot of Motorola 68000 assembly. On m68k, shifting right by 1 and performing a logical AND both take 8 CPU cycles. But the right-shift is 2 bytes smaller, because it doesn't need an extra 16 bits for the operand. That makes a difference on Amiga, because (other than size) the DMA might be shared with other chips, so you're saving yourself a memory read that could stall the CPU while it's waiting its turn. Therefore, at least on m68k, shifting right is the fastest way to test if a value is even.
In isolation it's the smallest, but it's no longer the smallest if you consider that the value, which in this example is the loop counter, needs to be preserved, meaning you'll need at least 2 bytes for another mov to make a copy. With test, the value doesn't get modified.
There's also BTST #0,xx but it wastefully needs an extra 16 bits say which bit to test (even though the bit can only be from 0-31)
> That makes a difference on Amiga, because (other than size) the DMA might be shared with other chips, so you're saving yourself a memory read that could stall the CPU while it's waiting its turn.
That's a load-bearing "could". If the 68000 has to read/write chip RAM, it gets the even cycles while the custom chips get odd cycles, so it doesn't even notice (unless you're doing something that steals even cycles from the CPU, e.g. the blitter is active and you set BLTPRI, or you have 5+ bitplanes in lowres or 3+ bitplanes in highres)
That reminds me, it's theoretically fastest to do `and d1,d0` e.g. in a loop if d1 is pre-loaded with the value (4 cycles and 1 read). `btst d1,d0` is 6 cycles and 1 read.
> the blitter is active and you set BLTPRI
I thought BLTPRI enabled meant the blitter takes every even DMA cycle it needs, and when disabled it gives the CPU 1 in every 4 even DMA cycles. But yes, I'm splitting hairs a bit when it comes to DMA performance because I code game/demo stuff targeting stock A500, meaning one of those cases (blitter running or 5+ bitplanes enabled) is very likely to be true.
That's true, although I'd add that ASR/AND are destructive while BTST would be nondestructive, but we're pretty far down a chain of hypotheticals at this point (why would someone even need to test evenness in a loop, when they could unroll the loop to doing 2/4/6/8 items at a time with even/odd behaviour baked in)
> I thought BLTPRI enabled meant the blitter takes every even DMA cycle it needs, and when disabled it gives the CPU 1 in every 4 even DMA cycles
Yes, that is true: https://amigadev.elowar.com/read/ADCD_2.1/Hardware_Manual_gu... "If given the chance, the blitter would steal every available Chip memory cycle [...] If DMAF_BLITHOG is a 1, the blitter will keep the bus for every available Chip memory cycle [...] If DMAF_BLITHOG is a 0, the DMA manager will monitor the 68000 cycle requests. If the 68000 is unsatisfied for three consecutive memory cycles, the blitter will release the bus for one cycle."
> one of those cases is very likely to be true
It blew my mind when I realised this is probably why Workbench is 4 colours by default. If it were 8, an unexpanded Amiga would seem a lot slower to application/productivity users.
> I tried both versions (modulo 2 and bitwise AND) and got the same result. I think the optimizer recognizes modulo 2 and converts it to bitwise AND.
Yes, even without specifying optimizations - https://godbolt.org/z/9se9c6qKT
You can see that the output of the compiler is identical whether you use `i%2 == 0` or `(i&1) == 0`. The bitwise AND is instruction 12 in the output.
Using -O3 like in the post actually compiles to SIMD instructions on x86-64 - https://godbolt.org/z/dWbcK947G
A quick check in the compiler explorer (godbolt.org) confirms that this is indeed true for GCC on x86_64 and aarch64, but not for clang on the same (clang does optimize it with -O3).
- Once you have the complete prime factorization, check whether 2 is among its prime factors...
- If 2 is a factor, it’s even; if not, odd.
Also I only use the step over command in the debugger
Also given the halting problem, you could write an algorithm that would be impossible to determine if it loops forever.
def isEven_modulus(num):
return num % 2 == 0
def isEven_bit(num):
return (num & 1) == 0
import random, time
testSet = random.choices(range(0, 100), k=10)
iterations = range(1000)
print("These are our test numbers: ", testSet)
mod_start = time.perf_counter_ns()
for z in iterations:
for n in testSet:
isEven_modulus(n)
mod_end = time.perf_counter_ns()
for z in iterations:
for n in testSet:
isEven_bit(n)
bit_end = time.perf_counter_ns()
print("Modulus method: ", mod_end - mod_start, "ns")
print("Bitwise method: ", bit_end - mod_end, "ns")
There's some variance run to run but for the most part they're close enough to not matter. I do see a very small difference generally in favour of bitwise, but we're talking about a 60000ns (0.06ms) difference occasionally on 1000 runs (or about 60ns per run). Unlikely that this will be a significant bottleneck for anyone.An example:
These are our test numbers: [45, 88, 55, 52, 40, 70, 62, 47, 78, 30]
Modulus method: 757341 ns
Bitwise method: 698872 ns
Possibly just a well understood and well-optimized problem.Things are probably different, these days, so maybe that isn’t effective.
extension FixedWidthInteger { var isEven: Bool { 0 == 1 & self } }
You don't have to guess, you could turn or O3 or look at the disassembly.
I'd write it more like this:
int main(int argc, char **argv) {
int total = 0;
for (int i=2147483647; i; --i) {
total += i & 1;
}
printf("%d\n", total);
return 0;
}
In a function as simple as this, the existence of a branch may be as fast or faster than a version without as the CPU has the opportunity to eliminate register/memory modification via branch prediction.
So even if a compiler does not optimize out this particular if construct, there is a good chance the CPU will.
total += !(i&1);
...and since there's another comment here about Asm, I'd compile the above as (assume edx is i and total in eax, high 24 bits of ebx precleared): test dl, 1
setnz bl
add eax, ebx
>> typeof NaN
<- "number"
Let's see then: >> (NaN % 2) == 0
<- false
So clearly NaN is odd. /s(And if you're thinking "you gotta equals harder":
>> (NaN % 2) === 0
<- false
Nope, still odd.Both of the infinities are also odd by the same logic, too, if you were curious.
null and false are even. true is odd. [] is even, [0] is even, [1] is odd.)