Publication: 2024-08-07

Small bit lookup tables for parsers

Today is a small article about one code pattern I like to use when doing parsing logic: small lookup tables.

For instance, if I want to detect whether a string is an identifier, I could write this code:

bool isIdentifier(char const* str, int64_t size)
{
    if (size <= 0)
        return false;
    if (str[0] >= '0' && str[0] <= '9')
        return false; // Must not start by digit

    for (int64_t i = 0; i < size; ++i) {
        unsigned char c = str[i];
        // HERE:
        if (!((c >= '0' && c <= '9')
              || (c >= 'a' && c <= 'z')
              || (c >= 'A' && c <= 'Z')
              || c == '_')
        )
            return false;
    }
    return true;
}

The condition marked with HERE is readable but a bit big, and involves quite a few operations. Moreover, a parser will include a lot of these checks, each different.


I now prefer to use small lookup tables, where each bit corresponds to whether one byte passes the condition or not. These lookup tables require 256 bits = four 64-bit integers, so they are pretty cheap. In C, this becomes:

static uint64_t const LUT_Identifier[4] = {
    0x03ff000000000000, 0x07fffffe87fffffe, 0x0000000000000000, 0x0000000000000000
};
#define LUT_lookup(pTable, idx) ((pTable[(idx) / 64] >> ((idx) % 64)) & 1)

bool isIdentifierLUT(char const* str, int64_t size)
{
    if (size <= 0)
        return false;
    if (str[0] >= '0' && str[0] <= '9')
        return false; // Must not start by digit

    for (int64_t i = 0; i < size; ++i) {
        unsigned char c = str[i];
        if (!LUT_lookup(LUT_Identifier, c))
            return false;
    }
    return true;
}

I find the new condition easier to read, especially when such conditions are nested inside the decoding of more complex structures such as strings supporting escape sequences. Also, there are less CPU instructions and branches in this version (Compiler Explorer).


Remains the question of how to generate these lookup tables. I use a Python snippet, which I simply copy-paste in an interactive terminal.

def gen_bit_lookup(patterns: str, name: str):
    idxs = set()
    for p in patterns.split(" "):
        if len(p) == 3:
            assert p[1] == '-'
            idxs |= set(range(ord(p[0]), ord(p[2]) + 1))
        else:
            assert len(p) == 1
            idxs |= {ord(p)}
    bits = sum(2**idx for idx in idxs)
    u64s = [ (bits // 2**(64*i)) % 2**64 for i in range(4)]
    print(f"// gen_bit_lookup({repr(patterns)}, {repr(name)})")
    print(f"static uint64_t const LUT_{name}[4] = {{")
    print("   ", ", ".join(format(u64, "#018x") for u64 in u64s))
    print("};")

# Usage:
# >>> gen_bit_lookup('a-z A-Z 0-9 _', 'Identifier')
# // gen_bit_lookup('a-z A-Z 0-9 _', 'Identifier')
# static uint64_t const LUT_Identifier[4] = {
#     0x03ff000000000000, 0x07fffffe87fffffe, 0x0000000000000000, 0x0000000000000000
# };

Some notes for reading the script:

  • The pattern argument consists of a space-delimited sequence of either a single character to consider TRUE, or a range of characters (eg. A-Z), each to be considered TRUE.

  • Non-ASCII bytes can be obtained with escape sequences, eg. \x80-\x9F.

  • ord(c) is the codepoint of the character.

  • 2**idx is 2 raised to the idx-th power. Note that Python support arbitrarily-big integers.

  • Because idxs is a set, each idx is unique. Thus, the sum of all 2**idx is equivalent to a bitwise-or.

  • (bits // 2**(64*i)) % 2**64 extracts the i-th chunk of 64 bits.


One more tool available for writing code 😎

Thanks for reading, have a nice day!