4

In my project i have to iterate over a large string starting from index=0 and get length k substring. I've implemented string::substr() and wonder if there are other efficient methods.

For example :

std::string S ="ABCDEFGHIJKLMN"

i need to get all substrings of length =5 starting from the beginning of S.Just like "ABCDE", "BCDEF", "CDEFG" and so on..

My implementation is like below:

    void geekfunc(std::string &str)
{
    unsigned int index=0;
    for (; index<=(str.size()-K);++index)
    {
        ++myseqmap[str.substr(index,K)];
    }
}

This function is being called ten million times and I welcome other methods to try.

3
  • I don't get what you are trying to achieve: do you want all the substrings of length k of your input string? Commented Mar 21, 2017 at 8:42
  • 2
    What is the actual problem you want to solve with this function? Please take some time to read about the XY problem and think about how it might relate to your question. Commented Mar 21, 2017 at 8:43
  • I ve edited the question for more clarity. Commented Mar 21, 2017 at 8:48

3 Answers 3

6

If you are using C++17, you can use string_view as your parameter and map key type. This way you won't make copies of the string contents every time you call substr. Just make sure the string you pass to the function is not destroyed or modified while your map it still in use.

std::map<std::string_view, std::size_t> myseqmap;

void geekfunc(std::string_view str)
{
    unsigned int index=0;
    for (; index<=(str.size()-K);++index)
    {
        ++myseqmap[str.substr(index,K)];
    }
}
Sign up to request clarification or add additional context in comments.

Comments

3

Provided you actually need to create copies of the substrings (which string::substr does) I believe you cannot solve this issue with less than Omega(m) calls to the memory manager and Omega(m * k) copying steps total, where m = n - k + 1. This is because the standard requires each string to manage its own memory. Sharing (such as with the copy-on-write idiom) is not permitted, thus each substring will copy its contents from the original.

If a copy is not required and your compiler already supplies std::string_view you could try using that. Unlike string, a string_view only holds a pointer to a character and a size (which is exactly what you are creating your substrings from anyways). The required pointer can be acquired using string::data.

However, when using string_view you have to make sure the original string remains in scope for as long as the container holding the substrings and that it is not altered after the substrings have been created, as this may invalidate the pointers held by the string_views. These can be adressed by wrapping everything in a class together like this:

struct substrings{
    const std::string original;
    container<string_view> substrings;
};

Where container is any container of your choice.

Comments

0

You are searching for K-mers for any given string.

static vector<string> find_kmers(string Text, int k)
{
    vector<string> kmers;
    int n = Text.length();;

    for (int i = 0; i < n-k+1; i++)
       kmers.push_back(Text.substr(i, k));               
    return kmers;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.