Expanding on Josh's benchmark...one can improve the clz as follows
/***************** clz2 ********************/#define NUM_OF_HIGHESTBITclz2(a) ((a) \ ? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \ : 0)
Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.