Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.
The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal
. As you can see, for maximum performance, the whole example is written as native:
readonly static byte[] msb_tab_15;// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.// The table is compressed by half, so use (value >> 1) for indexing.static MyStaticInit(){ var p = new byte[0x8000]; for (byte n = 0; n < 16; n++) for (int c = (1 << n) >> 1, i = 0; i < c; i++) p[c + i] = n; msb_tab_15 = p;}
The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).
Notice that the table entry for 0
, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1
. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:
ulong (64-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0'</summary>public static int HighestOne(this ulong v){ if ((long)v <= 0) return (int)((v >> 57) & 0x40) - 1; // handles cases v==0 and MSB==63 int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20; j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10; return j + msb_tab_15[v >> (j + 1)];}
uint (32-bit) Version
/// <summary> Index of the highest set bit in 'v', or -1 for value '0'</summary>public static int HighestOne(uint v){ if ((int)v <= 0) return (int)((v >> 26) & 0x20) - 1; // handles cases v==0 and MSB==31 int j = (int)((0x0000FFFFU - v) >> 27) & 0x10; return j + msb_tab_15[v >> (j + 1)];}
Various overloads for the above
public static int HighestOne(long v) => HighestOne((ulong)v);public static int HighestOne(int v) => HighestOne((uint)v);public static int HighestOne(ushort v) => msb_tab_15[v >> 1];public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];public static int HighestOne(char ch) => msb_tab_15[ch >> 1];public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];public static int HighestOne(byte v) => msb_tab_15[v >> 1];
This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0
(which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.
That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.
The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.
1 candidates.HighestOne_Tab16A 622,496 2 candidates.HighestOne_Tab16C 628,234 3 candidates.HighestOne_Tab8A 649,146 4 candidates.HighestOne_Tab8B 656,847 5 candidates.HighestOne_Tab16B 657,147 6 candidates.HighestOne_Tab16D 659,650 7 _highest_one_bit_UNMANAGED.HighestOne_U 702,900 8 de_Bruijn.IndexOfMSB 709,672 9 _old_2.HighestOne_Old2 715,81010 _test_A.HighestOne8 757,18811 _old_1.HighestOne_Old1 757,92512 _test_A.HighestOne5 (unsafe) 760,38713 _test_B.HighestOne8 (unsafe) 763,90414 _test_A.HighestOne3 (unsafe) 766,43315 _test_A.HighestOne1 (unsafe) 767,32116 _test_A.HighestOne4 (unsafe) 771,70217 _test_B.HighestOne2 (unsafe) 772,13618 _test_B.HighestOne1 (unsafe) 772,52719 _test_B.HighestOne3 (unsafe) 774,14020 _test_A.HighestOne7 (unsafe) 774,58121 _test_B.HighestOne7 (unsafe) 775,46322 _test_A.HighestOne2 (unsafe) 776,86523 candidates.HighestOne_NoTab 777,69824 _test_B.HighestOne6 (unsafe) 779,48125 _test_A.HighestOne6 (unsafe) 781,55326 _test_B.HighestOne4 (unsafe) 785,50427 _test_B.HighestOne5 (unsafe) 789,79728 _test_A.HighestOne0 (unsafe) 809,56629 _test_B.HighestOne0 (unsafe) 814,99030 _highest_one_bit.HighestOne 824,34530 _bitarray_ext.RtlFindMostSignificantBit 894,06931 candidates.HighestOne_Naive 898,865
Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit
via P/Invoke:
[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]public static extern int RtlFindMostSignificantBit(ulong ul);
It's really too bad, because here's the entire actual function:
RtlFindMostSignificantBit: bsr rdx, rcx mov eax,0FFFFFFFFh movzx ecx, dl cmovne eax,ecx ret
I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short
(16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte
(8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:
public static int HighestOne_Tab8A(ulong v){ if ((long)v <= 0) return (int)((v >> 57) & 64) - 1; int j; j = /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32; j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16; j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8; return j + msb_tab_8[v >> j];}
The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:
const ulong N_bsf64 = 0x07EDD5E59A4E28C2, N_bsr64 = 0x03F79D71B4CB0A89;readonly public static sbyte[]bsf64 ={ 63, 0, 58, 1, 59, 47, 53, 2, 60, 39, 48, 27, 54, 33, 42, 3, 61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22, 4, 62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21, 56, 45, 25, 31, 35, 16, 9, 12, 44, 24, 15, 8, 23, 7, 6, 5,},bsr64 ={ 0, 47, 1, 56, 48, 27, 2, 60, 57, 49, 41, 37, 28, 16, 3, 61, 54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11, 4, 62, 46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45, 25, 39, 14, 33, 19, 30, 9, 24, 13, 18, 8, 12, 7, 6, 5, 63,};public static int IndexOfLSB(ulong v) => v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;public static int IndexOfMSB(ulong v){ if ((long)v <= 0) return (int)((v >> 57) & 64) - 1; v |= v >> 1; v |= v >> 2; v |= v >> 4; // does anybody know a better v |= v >> 8; v |= v >> 16; v |= v >> 32; // way than these 12 ops? return bsr64[(v * N_bsr64) >> 58];}
There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB
functions here--not the deBruijn IndexOfLSB
--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.