0

For example, I have a list of terms and a string:

var terms = { "programming language", "programming", "language" };

var content = "A programming language is a formal language that "
    + "specifies a set of instructions that can be used to "
    + "produce various kinds of output.";

I can use Regex.Matches(content, term).Count to count that there are 4 times the list appear in the string:

  • "programming language": 1 time
  • "programming": 1 time
  • "language": 2 times

But there are duplicates, there should be only 2 occurrences.

My current solution is to save the begin index and end index of each occurrence, then compare to the saved occurences wherever it is in range and has already been count. Is there a better way without using start and end indexes?

9
  • 1
    How do you build your regex? (programming language|programming|language) should do what you want, if you do it right. Commented Jul 6, 2017 at 14:57
  • Show what you´ve tried before. Commented Jul 6, 2017 at 14:57
  • 1
    Got you, ok and are you running the regex in one go or are you splitting it? If you are splitting it then it's simple, run the more specific regex first and just maintain a hash set of already found terms. Do not run a regex if what it's looking for is contained in the hashset. If it's all running as part of one regex then I can't help you, although I'm sure there's probably a way. Commented Jul 6, 2017 at 15:06
  • 1
    @TimSchmelter Because programming language counts as one term, if I remove it the current example would return 3 occurrences, not 2 as I expected. Commented Jul 6, 2017 at 15:18
  • 1
    @TimSchmelter I thought Count worked with an accumulator, so programming occur once and language occur twice, after summing doesn't it return 3? Commented Jul 6, 2017 at 15:23

1 Answer 1

1

After suggestions from comments, I have a simple solution using regex, it should work with exact whole word, i.e. programming language can be counted but programming languages cannot:

var pattern = @"(?<!\S)programming language(?![^\s])|(?<!\S)programming(?![^\s])|(?<!\S)language(?![^\s])";
var count = Regex.Matches(content, pattern).Count;

Note: this pattern can only be used when programming language is placed before programming and language terms. If anyone can contribute a better solution, please do so.

Sign up to request clarification or add additional context in comments.

3 Comments

You can probably use \b instead of (?<!\S) or (?![^\s]) to detect word edges. Other than that, all you have left to do is find a way to automatically order the search terms...
@Rawling I'm new at regex, could you please write an example using \b to detect edges?
Something like \b(xy|y|z)\b. \b matches the point between a word character (letter, number, underscore) and a non-word character (anything else).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.