3

I have fixed-length arrays like this:

char x[10] = "abcdefghij";   // could also contain non-printable char

How to efficiently test if this array is present or not in a fixed set of 10,000 arrays of the same length? (note: these arrays are either all of them binary data arrays, or all of them null-terminated strings, this is fixed at the beginning)

In Python, I would use a set/hashtable/dict and not a list (because of very fast O(1) lookup):

s = "abcdefghij"
S = {"foo1234567", "bar1234567", ..., "baz9876543"}
print(s in S)  # True or False

How to do the equivalent in C? (not C++)


Note: the linked question How to check if a string is in an array of strings in C? is about a naive way to do it, with no performance requirement (they loop over all strings and use strcmp). Here it's different since there are 10k arrays, one needs to use another method for performance (a hashtable maybe?).

17
  • 1
    There is no standard hashtable/set/dictionary in C or any similar alternative. You have a choice: a) implement it yourself from scratch (hashtable, trie, whatever); b) use an array of strings and manually check whether the needle is somewhere in it; c) find a library which provides a "set of strings" abstraction efficiently. Commented May 14, 2022 at 17:49
  • Does this answer your question? How to check if a string is in an array of strings in C? Commented May 14, 2022 at 17:50
  • @niry No, this linked question is a naive way to do it, with no performance requirement (they loop over all strings and strcmp). Here it's really different, since there are 10k strings, one needs to use another method for performance (hashtable maybe?) Commented May 14, 2022 at 17:57
  • 1
    Note that char x[10] = "abcdefghij"; lacks a NUL terminator, so it isn't a "string". It would have one if you didn't restrict the array size - don't do that. You can't use string handling functions on this array. Commented May 14, 2022 at 17:59
  • 3
    A binary search of 10000 elements only needs about 13 data comparisons, so it's not going to make a huge difference. Commented May 14, 2022 at 18:41

3 Answers 3

2
+100

Note about storing both null-terminated strings and binary data in an array

In your question, you specify that the fixed-length arrays could represent

  • null-terminated strings or
  • binary data.

If your program has no way of knowing which one of these two possibilities is represented by the array contents, then, if the array contents is supposed to represent a null-terminated string, you will have to fill the remainder of the array with null characters (for example using the function strncpy). It won't be sufficient to only write a single null terminating character.

This is important, because if you don't do this, your program will have no way of knowing how to interpret the remaining data after it encounters a byte with the value 0. It won't know whether to

  • interpret this byte as a null-terminating character of a string and ignore all remaining bytes of the string, or

  • interpret this byte as binary data and treat the remaining bytes also as binary data (i.e. not ignore the remaining bytes).

However, if you always fill all remaining bytes after a terminating null character of a string with null characters, then your program won't habe to worry about whether the array contents represents a string or binary data, because it can treat both the same way.

Therefore, in the remainder of my answer, I will assume that if the array contents represents a string, then all remaining bytes after the end of the string will be filled with bytes with the value 0. That way, I can assume that no bytes should ever be ignored.

Hash table solution

In Python, I would use a set/hashtable/dict and not a list (because of very fast O(1) lookup):

It is also possible to use a hash table in C, although you will have to program it yourself or use an already existing library. The C standard library does not provide any hashing functions.

The following code will randomly generate an array of 10,000 fixed-length (not null-terminated) strings of length 10 with the characters a to z, A to Z and 0 to 9, but it will also place three hard-coded words into the array in different places, so you can search for these words later, in order to test the search function. The program will then insert all words into a hash table, and then perform several hard-coded lookups into the hash table.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <stdbool.h>

#define NUM_ARRAYS 10000
#define ARRAY_LENGTH 10
#define HASH_TABLE_SIZE 10000

struct hash_table_entry
{
    struct hash_table_entry *next;
    unsigned char data[ARRAY_LENGTH];
};

//function prototype declarations
void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );
size_t hash_array( const unsigned char array[ARRAY_LENGTH] );
void build_hash_table_from_data( unsigned char data[][ARRAY_LENGTH], size_t data_length, struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
void cleanup_hash_table( struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
bool lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );
bool verbose_lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] );

int main( void )
{
    //declare 2D array for the input data
    static unsigned char data[NUM_ARRAYS][ARRAY_LENGTH];

    //declare hash table and initialize all fields to zero
    static struct hash_table_entry *hash_table[HASH_TABLE_SIZE] = {NULL};

    //seed random number generator
    srand( (unsigned)time( NULL ) );

    //fill array "data" with random data
    printf( "Generating random data..." );
    fflush( stdout );
    fill_with_random_data( data );
    printf( "done\n" );
    fflush( stdout );

    //overwrite a few strings, so that we can search for
    //them later
    _Static_assert( NUM_ARRAYS > 500, "unexpected value" );
    _Static_assert( ARRAY_LENGTH == 10, "unexpected value" );
    memcpy( data[47], "abcdefghij", 10 );
    memcpy( data[218], "klmnopqrst", 10 );
    memcpy( data[419], "uvwxyz0123", 10 );

    //build hash table
    printf( "Building hash table..." );
    fflush( stdout );
    build_hash_table_from_data( data, NUM_ARRAYS, hash_table );
    printf( "done\n" );
    fflush( stdout );

    //perform lookups
    verbose_lookup( (unsigned char *)"uvwxyz0123", hash_table );
    verbose_lookup( (unsigned char *)"hsdkesldhg", hash_table );
    verbose_lookup( (unsigned char *)"abcdefghij", hash_table );
    verbose_lookup( (unsigned char *)"erlsodn3ls", hash_table );
    verbose_lookup( (unsigned char *)"klmnopqrst", hash_table );

    //cleanup
    printf( "Performing cleanup..." );
    fflush( stdout );
    cleanup_hash_table( hash_table );
    printf( "done\n" );
    fflush( stdout );
}

void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    static const unsigned char random_chars[62] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789";

    for ( size_t i = 0; i < NUM_ARRAYS; i++ )
    {
        for ( size_t j = 0; j < ARRAY_LENGTH; j++ )
        {
            data[i][j] = random_chars[rand()%sizeof random_chars];
        }
    }
}

//This is a simple hash function. Depending on the type of data, a
//more complex hashing function may be required, in order to prevent
//hash collisions.
size_t hash_array( const unsigned char array[ARRAY_LENGTH] )
{
    size_t hash = 0;

    for ( size_t i = 0; i < ARRAY_LENGTH; i++ )
    {
        hash = hash * 7 + array[i];
    }

    return hash % HASH_TABLE_SIZE;
}

void build_hash_table_from_data( unsigned char data[][ARRAY_LENGTH], size_t data_length, struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    for ( size_t i = 0; i < data_length; i++ )
    {
        struct hash_table_entry **pp, *p;

        //hash the array and store the result
        size_t hash = hash_array( data[i] );

        //perform hash table lookup
        pp = &hash_table[hash];

        //make pp point to the last null pointer in the list
        while ( ( p = *pp ) != NULL )
            pp = &p->next;

        //allocate memory for linked list node
        p = malloc( sizeof *p );
        if ( p == NULL )
        {
            fprintf( stderr, "Memory allocation failure!\n" );
            exit( EXIT_FAILURE );
        }

        //fill in data for new node
        p->next = NULL;
        memcpy( p->data, data[i], ARRAY_LENGTH );

        //insert new node into linked list
        *pp = p;
    }
}

void cleanup_hash_table( struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    for ( size_t i = 0; i < HASH_TABLE_SIZE; i++ )
    {
        struct hash_table_entry *p;

        p = hash_table[i];

        while ( p != NULL )
        {
            struct hash_table_entry *q = p;

            p = p->next;

            free( q );
        }
    }
}

bool lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    size_t hash;
    struct hash_table_entry *p;

    //find hash
    hash = hash_array( array );

    //index into hash table
    p = hash_table[hash];

    //attempt to find array in linked list
    while ( p != NULL )
    {
        if ( memcmp( array, p->data, ARRAY_LENGTH ) == 0 )
        {
            //array was found
            return true;
        }

        p = p->next;
    }

    //array was not found
    return false;
}

//This function serves as a wrapper function for the
//function "lookup". It prints to "stdout" what it is
//looking for and what the result of the lookup was.
bool verbose_lookup( const unsigned char array[ARRAY_LENGTH], struct hash_table_entry *hash_table[HASH_TABLE_SIZE] )
{
    //print message
    printf( "Searching for %.*s...", ARRAY_LENGTH, array );
    fflush( stdout );

    if ( lookup( array, hash_table ) )
    {
        printf( "found\n" );
        fflush( stdout );
        return true;
    }
    else
    {
        printf( "not found\n" );
        fflush( stdout );
        return false;
    }
}

This program has the following output:

Generating random data...done
Building hash table...done
Searching for uvwxyz0123...found
Searching for hsdkesldhg...not found
Searching for abcdefghij...found
Searching for erlsodn3ls...not found
Searching for klmnopqrst...found
Performing cleanup...done

As you can see, it found all 3 strings that were explicitly inserted, and no other strings were found.

In the code, I am using unsigned char instead of char, because in C, a char * is usually used for passing null-terminated strings. Therefore, it seemed more appropriate to use unsigned char * for passing data that could be binary or not null-terminated.

bsearch solution

For comparison, here is a solution which generates the input array in the same way, but uses bsearch instead of a hash table for searching it:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <stdbool.h>

#define NUM_ARRAYS 10000
#define ARRAY_LENGTH 10

//function prototype declarations
void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );
int compare_func( const void *a, const void *b );
bool verbose_search( const unsigned char array[ARRAY_LENGTH], unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] );

int main( void )
{
    //declare 2D array for the input data
    static unsigned char data[NUM_ARRAYS][ARRAY_LENGTH];

    //seed random number generator
    srand( (unsigned)time( NULL ) );

    //fill array "data" with random data
    printf( "Generating random data..." );
    fflush( stdout );
    fill_with_random_data( data );
    printf( "done\n" );
    fflush( stdout );

    //overwrite a few strings, so that we can search for
    //them later
    _Static_assert( NUM_ARRAYS > 500, "unexpected value" );
    _Static_assert( ARRAY_LENGTH == 10, "unexpected value" );
    memcpy( data[47], "abcdefghij", 10 );
    memcpy( data[218], "klmnopqrst", 10 );
    memcpy( data[419], "uvwxyz0123", 10 );

    //sort the array
    printf( "Sorting array..." );
    fflush( stdout );
    qsort( data, NUM_ARRAYS, ARRAY_LENGTH, compare_func );
    printf( "done\n" );
    fflush( stdout );

    //perform lookups
    verbose_search( (unsigned char *)"uvwxyz0123", data );
    verbose_search( (unsigned char *)"hsdkesldhg", data );
    verbose_search( (unsigned char *)"abcdefghij", data );
    verbose_search( (unsigned char *)"erlsodn3ls", data );
    verbose_search( (unsigned char *)"klmnopqrst", data );
}

void fill_with_random_data( unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    static const unsigned char random_chars[62] =
        "abcdefghijklmnopqrstuvwxyz"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "0123456789";

    for ( size_t i = 0; i < NUM_ARRAYS; i++ )
    {
        for ( size_t j = 0; j < ARRAY_LENGTH; j++ )
        {
            data[i][j] = random_chars[rand()%sizeof random_chars];
        }
    }
}

int compare_func( const void *a, const void *b )
{
    return memcmp( a, b, ARRAY_LENGTH );
}

//This function serves as a wrapper function for the
//function "bsearch". It prints to "stdout" what it is
//looking for and what the result of the search was.
bool verbose_search( const unsigned char array[ARRAY_LENGTH], unsigned char data[NUM_ARRAYS][ARRAY_LENGTH] )
{
    //print message
    printf( "Searching for %.*s...", ARRAY_LENGTH, array );
    fflush( stdout );

    if ( bsearch( array, data, NUM_ARRAYS, ARRAY_LENGTH, compare_func ) != NULL )
    {
        printf( "found\n" );
        fflush( stdout );
        return true;
    }
    else
    {
        printf( "not found\n" );
        fflush( stdout );
        return false;
    }
}

This program has the following output:

Generating random data...done
Sorting array...done
Searching for uvwxyz0123...found
Searching for hsdkesldhg...not found
Searching for abcdefghij...found
Searching for erlsodn3ls...not found
Searching for klmnopqrst...found

The hash table solution is probably faster than the binary search solution though, assuming that a good hash function is selected for the input, so that the number of hash collisions is minimal.

Sign up to request clarification or add additional context in comments.

13 Comments

Thanks a lot. I'm going to test this! PS: the arrays are either all of them binary data, or all of them null-terminated strings, this is a fixed decision at the beginning, so this is no problem in reality.
@Basj: I have now changed struct hash_table_entry to contain the actual data, instead of only a pointer to the data. That way, this pointer does not have to be dereferenced. This should improve performance a bit.
Thanks @AndreasWenzel. (In my use case, I only need to test if it is present or not very fast. In the case it is present, I can perform another longer test to actually retrieve the array)
@Basj: Your idea of using a probabalistic data structure such as a bloom filter is already how a hash table works. If there is no hash table entry (i.e. the hash table pointer points to NULL for that hash value), then you can be sure that the word does not exist in the hash table (no chance of a false negative). However, if the hash table pointer points to non-NULL for that hash value, then that word may exist in the hash table, but there is a chance of a false positive. [continued in next comment]
@Basj: [continued from previous comment] That is why it is necessary to traverse the entire linked list for that hash value in order to determine whether it is a false positive or not. Using a bloom filter has the advantage that it requires less space than a hash table.
|
1

The signature of the bsearch function is as follows:

   void *bsearch(const void *key, const void *base,
                 size_t nmemb, size_t size,
                 int (*compar)(const void *, const void *));

Here, key would be either t1 or t2 and base would be T. nmemb and size would be 9 and 4 respectively.

compar is a pointer to a callback function to do the comparison. This can just be a wrapper around memcmp:

int compare_char4(const void *p1, const void *p2)
{
    return memcmp(p1, p2, 4);
}

Then you call bsearch with these parameters:

printf("%s\n", bsearch(t1, T, 9, 4, compare_char4) ? "found" : "not found");
printf("%s\n", bsearch(t2, T, 9, 4, compare_char4) ? "found" : "not found");

1 Comment

Thanks @dbush! It works great, and much simpler than my solution (which tries to reinvent the wheel)! In terms of performance it seems to be similar to my solution.
0

I think the following binary search works (thanks to @WeatherVane) if the array of arrays is sorted, then the complexity is O(log n). We can use memcmp for the comparison.

int binarysearch(unsigned char *T, unsigned char *t, int num, int size)  // https://en.wikipedia.org/wiki/Binary_search_algorithm#Algorithm
{
    int a = 0;
    int b = num-1;
    int m, res;
    while (a <= b) {
        m = (a + b) / 2;
        res = memcmp(&T[0] + m*size, &t[0], size);
        if (res < 0) {
            a = m + 1;
        } else if (res > 0) {
            b = m - 1;
        } else { 
            return 1;
        } 
    }
    return 0;
}
int main()
{
    unsigned char T[][4] = { 
        { 0x00, 0x6e, 0xab, 0xcd },
        { 0xea, 0x6e, 0xab, 0xcd },
        { 0xeb, 0x6e, 0xab, 0xcd },
        { 0xec, 0x6e, 0xab, 0xcd },
        { 0xed, 0x6e, 0xab, 0xcd },
        { 0xef, 0x6e, 0xab, 0xcd },
        { 0xfa, 0x6e, 0xab, 0xcd },
        { 0xfb, 0x6e, 0xab, 0xcd },
        { 0xff, 0x6e, 0xab, 0xcd }
    };
    unsigned char t1[4] = { 0x33, 0x6e, 0xab, 0xcd };
    unsigned char t2[4] = { 0xfb, 0x6e, 0xab, 0xcd };
    printf("%s\n", binarysearch(T[0], t1, 9, 4) ? "found" : "not found");
    printf("%s\n", binarysearch(T[0], t2, 9, 4) ? "found" : "not found");
}    

Result:

not found
found

2 Comments

consider using a function bsearch from the standard library
I didn't know it was in stdlib, can you post an answer with it @tstanisl for future reference?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.