Count how many occurrences of substrings within a string without counting duplicates

Question

For example, I have a list of terms and a string:

var terms = { "programming language", "programming", "language" };

var content = "A programming language is a formal language that "
    + "specifies a set of instructions that can be used to "
    + "produce various kinds of output.";

I can use Regex.Matches(content, term).Count to count that there are 4 times the list appear in the string:

"programming language": 1 time
"programming": 1 time
"language": 2 times

But there are duplicates, there should be only 2 occurrences.

My current solution is to save the begin index and end index of each occurrence, then compare to the saved occurences wherever it is in range and has already been count. Is there a better way without using start and end indexes?

How do you build your regex? (programming language|programming|language) should do what you want, if you do it right. — Rawling
– Rawling, Commented Jul 6, 2017 at 14:57
Got you, ok and are you running the regex in one go or are you splitting it? If you are splitting it then it's simple, run the more specific regex first and just maintain a hash set of already found terms. Do not run a regex if what it's looking for is contained in the hashset. If it's all running as part of one regex then I can't help you, although I'm sure there's probably a way. — Thomas Cook
– Thomas Cook, Commented Jul 6, 2017 at 15:06
@TimSchmelter Because programming language counts as one term, if I remove it the current example would return 3 occurrences, not 2 as I expected. — MiP
– MiP, Commented Jul 6, 2017 at 15:18
@TimSchmelter I thought Count worked with an accumulator, so programming occur once and language occur twice, after summing doesn't it return 3? — MiP
– MiP, Commented Jul 6, 2017 at 15:23

MiP · Accepted Answer · 2017-07-06 15:50:13Z

1

After suggestions from comments, I have a simple solution using regex, it should work with exact whole word, i.e. programming language can be counted but programming languages cannot:

var pattern = @"(?<!\S)programming language(?![^\s])|(?<!\S)programming(?![^\s])|(?<!\S)language(?![^\s])";
var count = Regex.Matches(content, pattern).Count;

Note: this pattern can only be used when programming language is placed before programming and language terms. If anyone can contribute a better solution, please do so.

edited Jul 6, 2017 at 15:50

answered Jul 6, 2017 at 15:32

MiP

6,5727 gold badges31 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Rawling Over a year ago

You can probably use \b instead of (?<!\S) or (?![^\s]) to detect word edges. Other than that, all you have left to do is find a way to automatically order the search terms...

MiP Over a year ago

@Rawling I'm new at regex, could you please write an example using \b to detect edges?

Rawling Over a year ago

Something like \b(xy|y|z)\b. \b matches the point between a word character (letter, number, underscore) and a non-word character (anything else).

Collectives™ on Stack Overflow

Count how many occurrences of substrings within a string without counting duplicates

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related