3

I need to -automatically- generate tags for a text string. In this case, I'll use this string:

var text = 'This text talks about loyalty in the Royal Family with Príncipe Charles';

My current implementation, generates the tags for words that are 6+ characters long, and it works fine.

words = (text).replace(/[^a-zA-Z\s]/g,function(str){return '';});
words = words.match(/\w{6,}/g);
console.log(words);

This will return:

["loyalty","Family","Prince","Charles"]

The problem is that sometimes, a tag should be a specific set of words. I need the result to be:

["loyalty","Royal Family","Príncipe Charles"]

That means, that the replace/match code should test for:

  1. words that are 6 characters long (or more); and/or
  2. if a set of words starts with an uppercase letter, those words should be joined together in the same array element. It doesn't matter if some of the words are less than 6 characters long - but at least one of them has to be 6+, e.g.: "Stop at The UK Guardián in London" should return ["The UK Guardián", "London"]

I'm obviously having trouble in the second requirement. Any ideas? Thanks!

2 Answers 2

7
var text = 'This text talks about loyalty in the Royal Family with Prince Charles. Stop at The UK Guardian in London';

text.match(/(([A-Z]\w*\s*){2,})|(\w{6,})/g)

will return

["loyalty", "Royal Family ", "Prince Charles", "The UK Guardian ", "London"]

To fulfill the second requirement, it's better to run another regexp over the matches found:

var text = 'This is a Short Set Of Words about the Royal Family'

matches = text.match(/(([A-Z]\w*\s*){2,})|(\w{6,})/g)
matches.filter(function(m) {
    return m.match(/\w{6,}/)
});
Sign up to request clarification or add additional context in comments.

6 Comments

This seems to work, but it will also match 'I Am Cool', which is not a match as none of the words have >= 6 characters.
+1, good job with that update. This seems to work just as the OP wants :-)
great solution! just one important thing, the solution should consider special characters. For example, "Princé Hermione" is returning ["Hermione"]; and "superhábilmente" is returning ["superh","bilmente"]
@andufo: that's true. \w, \d and friends are not unicode-aware in javascript (what a shame!)
@andufo: you could replace \w with an explicit character class, like [\wéáè]
|
0

Okay, here's an idea. This is probably not the very best way to do this, but it might be a good start for you.

In order matching strings like Royal Family and Prince Charles, or perhaps even The United Kingdom, you could write a regex that looks for a succession of words starting with a capital letter in succession.

This might look like this: (A-Z(a-z){5,}* )+

You could then use the replace function to generate a new string with the matches removed and then use your original regex to match single words of a minimum length.

Update: In response to the comment about the other users answer, I have added the {5,} modifier to indicate a capital letter followed by five or more lower case letters and a space, one or more times.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.