C++ substring matching implementation

Question

I have two strings, "hello" an "eo" for instance, and I wish to find duplicate characaters between the two strings, that is: 'e' and 'o' in this example.

My algorithm would go this way

 void find_duplicate(char* str_1, char* str_2, int len1, int len2)
 {
     char c ;

     if(len1 < len2)
     {
        int* idx_1 = new int[len1]; // record elements in little string
        // that are matched in big string
        for(int k = 0 ; k < len1 ; k++)
              idx_1[k] = 0;

        int* idx_2 = new int[len2]; // record if element in str_2 has been 
        // matched already or not
        for(int k = 0 ; k < len2 ; k++)
              idx_2[k] = 0;

        for(int i = 0 ; i < len2 ; i++)
        {     
            c = str_1[i];

            for(int j = 0 ; j < len1 ; j++)
            {
                 if(str_2[j] == c)
                 {
                    if(idx_2[j] == 0) // this element in str_2 has not been matched yet
                    {
                         idx_1[i] = j + 1; // mark ith element in idx as matched in string 2 at pos j
                         idx_2[j] = 1;
                    }
                 }
             }
         }

         // now idx_1 and idx_2 contain matches info, let's remove matches.
         char* str_1_new = new char[len1];
         char* str_2_new = new char[len2];
         int kn = 0;
         for(int k = 0 ; k < len1 ; k++)
         {
            if(idx_1[k] > 0)
            {
                 str_1_new[kn] = str_1[k];
                 kn++;
            }
         }

         kn = 0;
         for(int k = 0 ; k < len2 ; k++)
         {
             if(idx_2[k] > 0)
             {
                 str_2_new[kn] = str_2[k];
                 kn++;
             }
         }
      }
      else
      {
            // same here, switching roles (do it yourself)
       }
  }

i feel my solution is awkward: - symetry of both cases in first if/else and code duplication - time complexity: 2*len1*len2 operations for finding duplicates, then len1 + len2 operations for removal - space complexity: two len1 and two len2 char*.

What if len1 and len2 are not given (with and without resort to STL vector) ?

could you provide your implementation of this algo ?

thanks

I would sort the characters in the strings and the compare them in one sweep. — Bo Persson
– Bo Persson, Commented Oct 6, 2012 at 9:37

Zaroth · Accepted Answer · 2012-10-06 10:27:30Z

3

First of all, it isn't substring matching problem - it is problem of finding common characters between two strings.

Your solution works in O(n*m), where n=len1 and m=len2 in your code. You could easily solve the same problem in O(n+m+c) time by counting characters in each of strings (where c is equal to size of character set). This algorithm is called counting sort.

Sample code implementing this in your case:

#include <iostream>
#include <cstring> // for strlen and memset

const int CHARLEN = 256; //number of possible chars

using namespace std;

// returns table of char duplicates
char* find_duplicates(const char* str_1, const char* str_2, const int len1, const int len2)
{
  int *count_1 = new int[CHARLEN];
  int *count_2 = new int[CHARLEN];
  char *duplicates = new char[CHARLEN+1]; // we hold duplicate chars here
  int dupl_len = 0; // length of duplicates table, we insert '\0' at the end
  memset(count_1,0,sizeof(int)*CHARLEN);
  memset(count_2,0,sizeof(int)*CHARLEN);
  for (int i=0; i<len1; ++i)
  {
    ++count_1[str_1[i]];
  }
  for (int i=0; i<len2; ++i)
  {
    ++count_2[str_2[i]];
  }

  for (int i=0; i<CHARLEN; ++i)
  {
    if (count_1[i] > 0 && count_2[i] > 0)
    {
      duplicates[dupl_len] = i;
      ++dupl_len;
    }
  }
  duplicates[dupl_len]='\0';
  delete count_1;
  delete count_2;
  return duplicates;
}

int main()
{
  const char* str_1 = "foobar";
  const char* str_2 = "xro";
  char* dup =   find_duplicates(str_1, str_2, strlen(str_1), strlen(str_2));
  cout << "str_1: \"" << str_1 << "\" str_2: \"" << str_2 << "\"\n";
  cout << "duplicates: \"" << dup << "\"\n";
  delete dup;
  return 0;
}

Please note that I am also sorting the output here. If you do not want to do that, you can skip the counting of characters in second string and just start comparing duplicates on the go.

If you, however, intend to be able to detect multiple duplicates of the same letter (e.g. if "banana" and "arena" should output "aan" instead of "an"), then you can just substract the number of counts in current solution and adjust the output accordingly.

edited Oct 6, 2012 at 10:27

answered Oct 6, 2012 at 10:10

Zaroth

5787 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Karoly Horvath Over a year ago

O(n+m+c), where c is the size of the charset

log0 Over a year ago

@Zaroth Note that new int[CHARLEN](); would initialize your array with 0 elements. Also you could simplify a lot your code using std::vector (that are 0-initialized and don't need to be explicitly deleted)

Zaroth Over a year ago

@log0 the () trick is a bit voodooish; since this is an educational answer, I tried to make my code as readable as possible. Also, please note that OP requested a solution without using std::vector.

log0 Over a year ago

@KarolyHorvath/Zaroth I think you can just use one count vector and simply add the character any time you notice the count is different to 0. In this case the complexity is indeed O(len1+len2) with memory footprint ~256

log0 Over a year ago

@Zaroth I don't think the OP is opposed to the used of st::vector he was wondering if it was changing the complexity.

|

log0 · Accepted Answer · 2012-10-06 10:36:41Z

0

std::vector<char> duplicates;
for (auto c1: std::string(str_1))
  for (auto c2: std::string(str_2))
    if (c1 == c2)
      duplicates.push_back(c1);

or if you don't have a c++11 compliant compiler.

std::vector<char> duplicates;
std::string s1(str_1);
std::string s2(str_2);
for (std::size_t i = 0; i < s1.size(); i++)
  for (std::size_t j = 0; j < s2.size(); j++)
    if (s1[i] == s2[j])
      duplicates.push_back(s1[i]);

-- based on Zaroth answer

std::vector<int> count(256,0);
for (auto c : std::string(str_1))
 count[c] += 1;
for (auto c : std::string(str_2))
 if (count[c] > 0)
   duplicates.push_back(c);

edited Oct 6, 2012 at 10:36

answered Oct 6, 2012 at 9:37

log0

11k4 gold badges32 silver badges68 bronze badges

9 Comments

kiriloff Over a year ago

for (auto c1: std::string(str_1)), this is C++11, right? what if iam with C++ for vs2010

kiriloff Over a year ago

does this mean that at runtime std::size_t is "known to be" size of a character in string?

log0 Over a year ago

@fonjibe vs2010 support the auto keyword blogs.msdn.com/b/vcblog/archive/2010/04/06/…

user213313 Over a year ago

You have i++ instead of j++ in the inner loop.

log0 Over a year ago

@fonjibe for (auto c1: std::string(str_1)) means: for every element c1 in a temporary std::string constructed with str_1.

|

Collectives™ on Stack Overflow

C++ substring matching implementation

2 Answers 2

10 Comments

9 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Related