Quantcast
Channel: What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C? - Stack Overflow
Viewing all articles
Browse latest Browse all 36

Answer by paperclip optimizer for What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?

$
0
0

Since I seemingly have nothing else to do, I dedicated an inordinate amount of time to this problem during the weekend.

Without direct hardware support, it SEEMED like it should be possible to do better than O(log(w)) for w=64bit. And indeed, it is possible to do it in O(log log w), except the performance crossover doesn't happen until w>=256bit.

Either way, I gave it a go and the best I could come up with was the following mix of techniques:

uint64_t msb64 (uint64_t n) {    const uint64_t  M1 = 0x1111111111111111;    // we need to clear blocks of b=4 bits: log(w/b) >= b    n |= (n>>1); n |= (n>>2);    // reverse prefix scan, compiles to 1 mulx    uint64_t s = ((M1<<4)*(__uint128_t)(n&M1))>>64;    // parallel-reduce each block    s |= (s>>1);    s |= (s>>2);    // parallel reduce, 1 imul    uint64_t c = (s&M1)*(M1<<4);    // collect last nibble, generate compute count - count%4    c = c >> (64-4-2); // move last nibble to lowest bits leaving two extra bits    c &= (0x0F<<2);    // zero the lowest 2 bits    // add the missing bits; this could be better solved with a bit of foresight    // by having the sum already stored    uint8_t b = (n >> c); // & 0x0F; // no need to zero the bits over the msb    const uint64_t  S = 0x3333333322221100; // last should give -1ul    return c | ((S>>(4*b)) & 0x03);}

This solution is branchless and doesn't require an external table that can generate cache misses. The two 64-bit multiplications aren't much of a performance issue in modern x86-64 architectures.

I benchmarked the 64-bit versions of some of the most common solutions presented here and elsewhere.Finding a consistent timing and ranking proved to be way harder than I expected. This has to do not only with the distribution of the inputs, but also with out-of-order execution, and other CPU shennanigans, which can sometimes overlap the computation of two or more cycles in a loop.

I ran the tests on an AMD Zen using RDTSC and taking a number of precautions such as running a warm-up, introducing artificial chain dependencies, and so on.

For a 64-bit pseudorandom even distribution the results are:

namecyclescomment
clz5.16builtin intrinsic, fastest
cast5.18cast to double, extract exp
ulog27.50reduction + deBrujin
msb64*11.26this version
unrolled19.12varying performance
obvious110.49"obviously" slowest for int64

Casting to double is always surprisingly close to the builtin intrinsic. The "obvious" way of adding the bits one at a time has the largest spread in performance of all, being comparable to the fastest methods for small numbers and 20x slower for the largest ones.

My method is around 50% slower than deBrujin, but has the advantage of using no extra memory and having a predictable performance. I might try to further optimize it if I ever have time.


Viewing all articles
Browse latest Browse all 36

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>