Revisions to Find substring count without string functions

added 39 characters in body

Source Link

edited Dec 11, 2016 at 22:52

34.1k
4
77
145

Now, what we'd like to do next would be to jump ahead to the next position where a match is possible. We can look at the character there (a b, if I'm counting correctly). Looking at our needle, we know that only occurs at the second position, so we can jump ahead three more, and look at that character. That's a q, which doesn't occur in the needle, so we can jump ahead 5 more characters after that. LatheLather, rinse, repeat.

This is where the variants start to show up. The original Boyer-Moore built a rather complex 2D table with all those possible jumps encoded into it. The variants simplify that to at least some degree, gaining quite a bit in simplicity in exchange for simpler codea loss of efficiency (that's usually fairly minor).

Now, what we'd like to do next would be to jump ahead to the next position where a match is possible. We can look at the character there (a b, if I'm counting correctly). Looking at our needle, we know that only occurs at the second position, so we can jump ahead three more, and look at that character. That's a q, which doesn't occur in the needle, so we can jump ahead 5 more characters after that. Lathe, rinse, repeat.

This is where the variants start to show up. The original Boyer-Moore built a rather complex 2D table with all those possible jumps encoded into it. The variants simplify that to at least some degree, gaining quite a bit in simplicity in exchange for simpler code.

Now, what we'd like to do next would be to jump ahead to the next position where a match is possible. We can look at the character there (a b, if I'm counting correctly). Looking at our needle, we know that only occurs at the second position, so we can jump ahead three more, and look at that character. That's a q, which doesn't occur in the needle, so we can jump ahead 5 more characters after that. Lather, rinse, repeat.

This is where the variants start to show up. The original Boyer-Moore built a rather complex 2D table with all those possible jumps encoded into it. The variants simplify that to at least some degree, gaining quite a bit in simplicity in exchange for a loss of efficiency (that's usually fairly minor).

added 580 characters in body

Source Link

edited Dec 11, 2016 at 20:56

Jerry Coffin

34.1k
4
77
145

Efficiency

If you want to badly enough, I'm pretty sure you can improve the efficiency of your current code, at least for typical cases--but I'm not sure thatI don't think the result will be \$O(N)\$ even if you do.

Looking at overall complexity, this can be a lot lower, especially if our needle is long and consists of unique characters. On the other hand, if it's short and consists entirely of repeated characters (e.g., cc) we're not going to gain much at all. At the same time, if the needle is short, the \$O(MN)\$ method is already close to the \$O(N)\$ we'd like. It's only when \$M\$ is large that the \$O(MN)\$ method becomes a serious problem--and that's exactly when a Boyer-Moore (or variant) search is most likely to give us the biggest gain (the primary exception being when/if the needle is long, but contains lots of repetition). Since big-O deals with the worst case, I think this is still \$O(MN)\$--but that only arises if the needle consists entirely of repetitions of a single character, and the haystack consists almost entirely of that same character as well. That's fairly degenerate (and probably unusual case, but to use @Coderodde's example, if you're searching for cc in cccc, all the work is mostly wasted, because you're going to do a full comparison at every candidate position in the haystack, so it still ends up as \$O(MN)\$.

Your code

If you want to badly enough, I'm pretty sure you can improve the efficiency of your current code--but I'm not sure that the result will be \$O(N)\$ even if you do.

Looking at overall complexity, this can be a lot lower, especially if our needle is long and consists of unique characters. On the other hand, if it's short and consists entirely of repeated characters (e.g., cc) we're not going to gain much at all. At the same time, if the needle is short, the \$O(MN)\$ method is already close to the \$O(N)\$ we'd like. It's only when \$M\$ is large that the \$O(MN)\$ method becomes a serious problem--and that's exactly when a Boyer-Moore (or variant) search is most likely to give us the biggest gain (the primary exception being when/if the needle is long, but contains lots of repetition).

Efficiency

If you want to badly enough, I'm pretty sure you can improve the efficiency of your current code, at least for typical cases--but I don't think the result will be \$O(N)\$ even if you do.

Looking at overall complexity, this can be a lot lower, especially if our needle is long and consists of unique characters. On the other hand, if it's short and consists entirely of repeated characters (e.g., cc) we're not going to gain much at all. At the same time, if the needle is short, the \$O(MN)\$ method is already close to the \$O(N)\$ we'd like. It's only when \$M\$ is large that the \$O(MN)\$ method becomes a serious problem--and that's exactly when a Boyer-Moore (or variant) search is most likely to give us the biggest gain (the primary exception being when/if the needle is long, but contains lots of repetition). Since big-O deals with the worst case, I think this is still \$O(MN)\$--but that only arises if the needle consists entirely of repetitions of a single character, and the haystack consists almost entirely of that same character as well. That's fairly degenerate (and probably unusual case, but to use @Coderodde's example, if you're searching for cc in cccc, all the work is mostly wasted, because you're going to do a full comparison at every candidate position in the haystack, so it still ends up as \$O(MN)\$.

Your code

added 540 characters in body

Source Link

edited Dec 11, 2016 at 20:44

Jerry Coffin

34.1k
4
77
145

Looking at your code, I'm most puzzled by your use of the found buffer. You're basically copying characters from the input to found as long as the two match, then checking the length of found to see if you've gotten a match.

Instead of that, I think I'd move the check for a match at the current position into a separate function, which would implement an algorithm on this general order:

for pos = 0 to needle.length()
    if (hay_stack[start_pos + pos] != needle[pos])
        return false
return true

Looking at your code, I'm most puzzled by your use of the found buffer. You're basically copying characters from the input to found as long as the two match, then checking the length of found to see if you've gotten a match.

Instead of that, I think I'd move the check for a match at the current position into a separate function, which would implement an algorithm on this general order:

for pos = 0 to needle.length()
    if (hay_stack[start_pos + pos] != needle[pos])
        return false
return true

Source Link

answered Dec 11, 2016 at 20:32

Jerry Coffin

34.1k
4
77
145

Loading

Stack Exchange Network

Return to Answer

Efficiency

Your code

Efficiency

Your code