Since I seemingly have nothing else to do, I dedicated an inordinate amount of time to this problem during the weekend.
Without direct hardware support, it SEEMED like it should be possible to do better than O(log(w)) for w=64bit. And indeed, it is possible to do it in O(log log w), except the performance crossover doesn't happen until w>=256bit.
Either way, I gave it a go and the best I could come up with was the following mix of techniques:
uint64_t msb64 (uint64_t n) { const uint64_t M1 = 0x1111111111111111; // we need to clear blocks of b=4 bits: log(w/b) >= b n |= (n>>1); n |= (n>>2); // reverse prefix scan, compiles to 1 mulx uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64; // parallel-reduce each block s |= (s>>1); s |= (s>>2); // parallel reduce, 1 imul uint64_t c = (s&M1)*(M1<<4); // collect last nibble, generate compute count - count%4 c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits c &= (0x0F<<2); // zero the lowest 2 bits // add the missing bits; this could be better solved with a bit of foresight // by having the sum already stored uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb const uint64_t S = 0x3333333322221100; // last should give -1ul return c | ((S>>(4*b)) & 0x03);}
This solution is branchless and doesn't require an external table that can generate cache misses. The two 64-bit multiplications aren't much of a performance issue in modern x86-64 architectures.
I benchmarked the 64-bit versions of some of the most common solutions presented here and elsewhere.Finding a consistent timing and ranking proved to be way harder than I expected. This has to do not only with the distribution of the inputs, but also with out-of-order execution, and other CPU shennanigans, which can sometimes overlap the computation of two or more cycles in a loop.
I ran the tests on an AMD Zen using RDTSC and taking a number of precautions such as running a warm-up, introducing artificial chain dependencies, and so on.
For a 64-bit pseudorandom even distribution the results are:
name | cycles | comment |
---|---|---|
clz | 5.16 | builtin intrinsic, fastest |
cast | 5.18 | cast to double, extract exp |
ulog2 | 7.50 | reduction + deBrujin |
msb64* | 11.26 | this version |
unrolled | 19.12 | varying performance |
obvious | 110.49 | "obviously" slowest for int64 |
Casting to double is always surprisingly close to the builtin intrinsic. The "obvious" way of adding the bits one at a time has the largest spread in performance of all, being comparable to the fastest methods for small numbers and 20x slower for the largest ones.
My method is around 50% slower than deBrujin, but has the advantage of using no extra memory and having a predictable performance. I might try to further optimize it if I ever have time.