4

Suppose I am a genome scientist trying to store extremely long strings of characters, each of which represents two bits of information (i.e. each element is either G, A, T, or C). Because the strings are incredibly long, I need to be able to store a string of length N in precisely 2N bits (or rather, N/4 bytes).

With that motivation in mind, I am looking for a generalization of std::bitset (or boost::dynamic_bitset<>) that works on two-bit values instead of single-bit values. I want to store N such two-bit values, each of which can be 0, 1, 2, or 3. I need the data packed as closely as possible in memory, so vector<char> will not work (as it wastes a factor of 4 of memory).

What is the best way to achieve my goal? One option is to wrap the existing bitset templates with customized operator[], iterators, etc., but I'd prefer to use an existing library if at all possible.

20
  • Hmm, what's actually wrong for you with std::bitset<2>?!? No way to use less than an uint8_t IMHO. No doubts that std::bitset will use this as underlying type for decent implementations? Commented Jun 20, 2014 at 0:38
  • 1
    As for your question, I think there probably isn't (but I would like to be proven wrong). Fortunately, isn't not that complicated to write (wrap an array of chars or std::vector<char>, do the bit-fiddling yourself, and provide the interface) Commented Jun 20, 2014 at 0:56
  • 1
    I don't know of any existing libraries, but you should be able to make bitset<2N> work. You could provide an interface that does the necessary index translation. Commented Jun 20, 2014 at 0:56
  • 1
    Meant s.th like std::array<std::bitset<2>,N>?? Commented Jun 20, 2014 at 1:04
  • 1
    'or find' not 'and find' -- lack of opinions is not relevant. Asking 'how do I do X' is ok, asking 'find me a library that does X' is not. Commented Jun 20, 2014 at 1:26

3 Answers 3

2

std::bitset<> is fixed length and you probably do not want that.

I think you should go ahead and wrap std::vector<bool>.

Note that std::vector<bool> is optimised for space, but has the benefit that it is dynamic in size. Presumably you need to read the genome of arbitrary length on from somewhere.

Have a think about whether you need much of an API to access it; you might only need a couple of methods.

@Jefffrey's answer already covers the relevant code, if for bitset<>.

[ I am not familiar with boost::dynamic_bitset<> and what it might give over vector.]

One further thought is whether it might be convenient for you to work with quads of letters, a quad nicely filling a char in space.

class Genome
{
public:
    enum class Letter {A,C,G,T};
    Genome(const std::string& source)
    {
        code_.resize(source.size() * 2);
        for (unsigned index = 0; index != source.size(); ++index)
        {
            char text = source[index];
            Letter letter = textToLetter(text);
            set(index, letter);
        }
    }  
    static Letter textToLetter(char text)
    {
        // Or search through the array `letterText`.
        // Or come up with a neat but unintelligible one liner ...
        Letter letter = Letter::A;
        switch (text)
        {
        case 'A':
            letter = Letter::A;
            break;
        case 'C':
            letter = Letter::C;
            break;
        case 'G':
            letter = Letter::G;
            break;
        case 'T':
            letter = Letter::T;
            break;
        default:
            // Invalid - handle error.
            break;
        }
        return letter;
    }
    static char letterToText(Letter l) 
    {
        return letterText[(unsigned)l];
    }
    // Add bounds checking
    Letter get(unsigned index) const
    {
        unsigned distance = index * 2;
        char numeric = code_[distance] + code_[distance + 1] * 2;
        return Letter(numeric);
    }
    // Add bounds checking
    void set(unsigned index, Letter value)
    {
        unsigned distance = index * 2;
        bool low = (unsigned)value & 1;
        bool high = (bool)((unsigned)value & 2);
        code_[distance] = low;
        code_[distance + 1]  = high;
    }
    unsigned size()
    {
        return code_.size() / 2;
    }
    // Extend by numLetters, initially set to 'A'
    void extend(unsigned numLetters)
    {
        code_.resize(code_.size() + numLetters * 2);
    }
private:

    static char letterText[4];
    std::vector<bool> code_;
};

char Genome::letterText [4] = { 'A', 'C', 'G', 'T' };

int main()
{
    Genome g("GATT");
    g.extend(3);
    g.set(5, Genome::Letter::C);
    for (unsigned i = 0; i != g.size(); ++i)
        std::cout << Genome::letterToText(g.get(i));
    std::cout << std::endl;
    return 0;
}
Sign up to request clarification or add additional context in comments.

1 Comment

boost::dynamic_bitset has no tangible benefit over std::vector<bool>, it's just to satisfy people who can't live with a Boolean vector named std::vector<bool>.
1

You have two choices.

Given:

enum class nucleobase { a, c, g, t };

You have two choices. You can:

  • use a single std::bitset and play with indexing
  • use std::bitset in combination with another container

For the first, you can just define a couple of functions that target the correct number of bits per set/get:

template<std::size_t N>
void set(std::bitset<N>& bits, std::size_t i, nucleobase x) {
    switch (x) {
        case nucleobase::a: bits.set(i * 2, 0); bits.set(i * 2 + 1, 0); break;
        case nucleobase::c: bits.set(i * 2, 0); bits.set(i * 2 + 1, 1); break;
        case nucleobase::g: bits.set(i * 2, 1); bits.set(i * 2 + 1, 0); break;
        case nucleobase::t: bits.set(i * 2, 1); bits.set(i * 2 + 1, 1); break;
    }
}

template<std::size_t N>
nucleobase get(const std::bitset<N>& bits, std::size_t i) {
    if (!bits[i * 2])
        if (!bits[i * 2 + 1]) return nucleobase::a;
        else                  return nucleobase::c;
    else
        if (!bits[i * 2 + 1]) return nucleobase::g;
        else                  return nucleobase::t;
}

Live demo

The above is just an example and a terrible one (it's almost 4AM here and I really need to sleep).

For the second you just need to map alleles and bits:

bit_pair bits_for(nucleobase x) {
    switch (x) {
        case nucleobase::a: return bit_pair("00"); break;
        case nucleobase::c: return bit_pair("10"); break;
        case nucleobase::g: return bit_pair("01"); break;
        case nucleobase::t: return bit_pair("11"); break;
    }
}

nucleobase nucleobase_for(bit_pair x) {
    switch (x.to_ulong()) {
        case 0: return nucleobase::a; break;
        case 1: return nucleobase::c; break;
        case 2: return nucleobase::g; break;
        case 3: return nucleobase::t; break;
        default: return nucleobase::a; break; // just for the warning
    }
}

Live demo

Of course if you need runtime length you can just use boost::dynamic_bitset and std::vector.

Comments

0

Here's what I use for fixed-length k-mers.

#include <cstdint>
#include <cstdlib>
#include <ostream>

enum class nucleotide { A, C, G, T };

inline std::ostream&
operator<<(std::ostream& pOut, nucleotide pNt)
{
    switch (pNt) {
        case nucleotide::A: pOut << 'A'; break;
        case nucleotide::C: pOut << 'C'; break;
        case nucleotide::G: pOut << 'G'; break;
        case nucleotide::T: pOut << 'T'; break;
    }
    return pOut;
}

class kmer_base;

class nucleotide_proxy {
public:
    operator nucleotide() const {
        return nucleotide((*mWord >> (mPosition * 2)) & 3);
    };

    nucleotide_proxy& operator=(nucleotide pNt) {
        uint64_t word = *mWord;
        word &= ~(uint64_t(3) << (mPosition*2));
        word |= uint64_t(pNt) << (mPosition*2);
        *mWord = word;

        return *this;
    };

private:
    friend class kmer_base;

    nucleotide_proxy(uint64_t* pWord, uint8_t pPosition)
        : mWord(pWord), mPosition(pPosition)
    {
    }

    uint64_t* mWord;
    uint8_t mPosition;
};


class kmer_base {
protected:
    nucleotide_proxy access(uint64_t* pWord, size_t pPosition)
    {
        return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
    }

    const nucleotide_proxy access(uint64_t* pWord, size_t pPosition) const
    {
        return nucleotide_proxy(pWord + (pPosition / 32), (pPosition & 31));
    }
};


template<int K>
class kmer : public kmer_base
{
    enum { Words = (K + 31) / 32 };
public:
    nucleotide_proxy operator[](size_t pOutdex) {
        return access(mWords, pOutdex);
    }

    const nucleotide_proxy operator[](size_t pOutdex) const {
        return access(mWords, pOutdex);
    }

private:
    uint64_t mWords[Words];
};

Extending this to dynamic-length k-mere is left as an exercise; it's pretty easy once you have nucleotide_proxy at your disposal. Implementing the reverse complement operator efficiently is also left as an exercise.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.