Quantcast
Channel: What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C? - Stack Overflow
Viewing all articles
Browse latest Browse all 36

Answer by Glenn Slayden for What is the fastest/most efficient way to find the highest set bit (msb) in an integer in C?

$
0
0

Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.

The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:

readonly static byte[] msb_tab_15;// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.// The table is compressed by half, so use (value >> 1) for indexing.static MyStaticInit(){    var p = new byte[0x8000];    for (byte n = 0; n < 16; n++)        for (int c = (1 << n) >> 1, i = 0; i < c; i++)            p[c + i] = n;    msb_tab_15 = p;}

The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).

Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:

ulong (64-bit) Version

/// <summary> Index of the highest set bit in 'v', or -1 for value '0'</summary>public static int HighestOne(this ulong v){    if ((long)v <= 0)        return (int)((v >> 57) & 0x40) - 1;      // handles cases v==0 and MSB==63    int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;    j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;    return j + msb_tab_15[v >> (j + 1)];}

uint (32-bit) Version

/// <summary> Index of the highest set bit in 'v', or -1 for value '0'</summary>public static int HighestOne(uint v){    if ((int)v <= 0)        return (int)((v >> 26) & 0x20) - 1;     // handles cases v==0 and MSB==31    int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;    return j + msb_tab_15[v >> (j + 1)];}

Various overloads for the above

public static int HighestOne(long v) => HighestOne((ulong)v);public static int HighestOne(int v) => HighestOne((uint)v);public static int HighestOne(ushort v) => msb_tab_15[v >> 1];public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];public static int HighestOne(char ch) => msb_tab_15[ch >> 1];public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];public static int HighestOne(byte v) => msb_tab_15[v >> 1];

This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.




That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.


The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.

 1  candidates.HighestOne_Tab16A               622,496 2  candidates.HighestOne_Tab16C               628,234 3  candidates.HighestOne_Tab8A                649,146 4  candidates.HighestOne_Tab8B                656,847 5  candidates.HighestOne_Tab16B               657,147 6  candidates.HighestOne_Tab16D               659,650 7  _highest_one_bit_UNMANAGED.HighestOne_U    702,900 8  de_Bruijn.IndexOfMSB                       709,672 9  _old_2.HighestOne_Old2                     715,81010  _test_A.HighestOne8                        757,18811  _old_1.HighestOne_Old1                     757,92512  _test_A.HighestOne5  (unsafe)              760,38713  _test_B.HighestOne8  (unsafe)              763,90414  _test_A.HighestOne3  (unsafe)              766,43315  _test_A.HighestOne1  (unsafe)              767,32116  _test_A.HighestOne4  (unsafe)              771,70217  _test_B.HighestOne2  (unsafe)              772,13618  _test_B.HighestOne1  (unsafe)              772,52719  _test_B.HighestOne3  (unsafe)              774,14020  _test_A.HighestOne7  (unsafe)              774,58121  _test_B.HighestOne7  (unsafe)              775,46322  _test_A.HighestOne2  (unsafe)              776,86523  candidates.HighestOne_NoTab                777,69824  _test_B.HighestOne6  (unsafe)              779,48125  _test_A.HighestOne6  (unsafe)              781,55326  _test_B.HighestOne4  (unsafe)              785,50427  _test_B.HighestOne5  (unsafe)              789,79728  _test_A.HighestOne0  (unsafe)              809,56629  _test_B.HighestOne0  (unsafe)              814,99030  _highest_one_bit.HighestOne                824,34530  _bitarray_ext.RtlFindMostSignificantBit    894,06931  candidates.HighestOne_Naive                898,865

Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:

[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]public static extern int RtlFindMostSignificantBit(ulong ul);

It's really too bad, because here's the entire actual function:

    RtlFindMostSignificantBit:        bsr rdx, rcx          mov eax,0FFFFFFFFh          movzx ecx, dl          cmovne      eax,ecx          ret

I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:

public static int HighestOne_Tab8A(ulong v){    if ((long)v <= 0)        return (int)((v >> 57) & 64) - 1;    int j;    j =  /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;    j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;    j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;    return j + msb_tab_8[v >> j];}

The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:

const ulong N_bsf64 = 0x07EDD5E59A4E28C2,            N_bsr64 = 0x03F79D71B4CB0A89;readonly public static sbyte[]bsf64 ={    63,  0, 58,  1, 59, 47, 53,  2, 60, 39, 48, 27, 54, 33, 42,  3,    61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22,  4,    62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,    56, 45, 25, 31, 35, 16,  9, 12, 44, 24, 15,  8, 23,  7,  6,  5,},bsr64 ={     0, 47,  1, 56, 48, 27,  2, 60, 57, 49, 41, 37, 28, 16,  3, 61,    54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11,  4, 62,    46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,    25, 39, 14, 33, 19, 30,  9, 24, 13, 18,  8, 12,  7,  6,  5, 63,};public static int IndexOfLSB(ulong v) =>    v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;public static int IndexOfMSB(ulong v){    if ((long)v <= 0)        return (int)((v >> 57) & 64) - 1;    v |= v >> 1; v |= v >> 2;  v |= v >> 4;   // does anybody know a better    v |= v >> 8; v |= v >> 16; v |= v >> 32;  // way than these 12 ops?    return bsr64[(v * N_bsr64) >> 58];}

There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.


Viewing all articles
Browse latest Browse all 36

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>