For a 4 bit nibble n, the value can be computed in 3 instructions by using a constant 32bit unsigned value as a lookup table with 16 entries of two bits each.
(0xffffaa50 >> (n << 1)) & 3This can be used to trim the end of the divide and conquer methods making them faster. The following functions have a constant dependency chain length of 16 for the 32bit version or 20 for the 64bit.
size_t blog32(uint32_t v) { size_t r = 0, t; t = (0 != (v >> 16)) << 4; v >>= t; r |= t; t = (0 != (v >> 8)) << 3; v >>= t; r |= t; t = (0 != (v >> 4)) << 2; v >>= t; r |= t; return r + ((0xffffaa50 >> (v << 1)) & 3);}For 64bit, add one more level:
size_t blog64(uint64_t v) { size_t r = 0, t; t = (0 != (v >> 32)) << 5; v >>= t; r |= t; t = (0 != (v >> 16)) << 4; v >>= t; r |= t; t = (0 != (v >> 8)) << 3; v >>= t; r |= t; t = (0 != (v >> 4)) << 2; v >>= t; r |= t; return r + ((0xffffaa50 >> (v << 1)) & 3);}The same method can be used to process 4 bits at a time for the loop or the tree versions. The naive version for 32 bits becomes:
size_t logb(uint32_t v) { // top 7 nibbles, from the top down for(int i = 27; i > 0; i -= 4) { // nibble multiplied by 2 uint32_t n = (v >> i) & 30; if (n) return ++i | ((0xffffaa50 >> n) & 3); } // Bottom nibble, could add test for v == 0 return (0xffffaa50 >> (v << 1)) & 3);}The function above will test very well with uniformly distributed inputs since it will return the first time through the loop in 15/16 of all the possible 32bit values.