5

Suppose I've a long string containing newlines and tabs as:

var x = "This is a long string.\n\t This is another one on next line.";

So how can we split this string into tokens, using regular expression?

I don't want to use .split(' ') because I want to learn Javascript's Regex.

A more complicated string could be this:

var y = "This @is a #long $string. Alright, lets split this.";

Now I want to extract only the valid words out of this string, without special characters, and punctuation, i.e I want these:

var xwords = ["This", "is", "a", "long", "string", "This", "is", "another", "one", "on", "next", "line"];

var ywords = ["This", "is", "a", "long", "string", "Alright", "lets", "split", "this"];
2
  • What do you want to split it on? You said s.split(' ') but also you mentioned newlines and tabs. You seem to be looking for a regex tutorial, which isn't really Stack Overflow's focus. Commented Dec 9, 2011 at 6:37
  • @nnnnnn: I'm reading this doc from MDN. But at the same time, I doing some experiment. And this is my first attempt to split sentence into words. Commented Dec 9, 2011 at 6:39

6 Answers 6

9

Here is a jsfiddle example of what you asked: http://jsfiddle.net/ayezutov/BjXw5/1/

Basically, the code is very simple:

var y = "This @is a #long $string. Alright, lets split this.";
var regex = /[^\s]+/g; // This is "multiple not space characters, which should be searched not once in string"

var match = y.match(regex);
for (var i = 0; i<match.length; i++)
{
    document.write(match[i]);
    document.write('<br>');
}

UPDATE: Basically you can expand the list of separator characters: http://jsfiddle.net/ayezutov/BjXw5/2/

var regex = /[^\s\.,!?]+/g;

UPDATE 2: Only letters all the time: http://jsfiddle.net/ayezutov/BjXw5/3/

var regex = /\w+/g;
Sign up to request clarification or add additional context in comments.

5 Comments

Your both examples give wrong result. The result is containing the special characters.
Hey, i thought this was your intention :) if you wish only letters in output: jsfiddle.net/ayezutov/BjXw5/3. var regex = /\w+/g;
+1. That is good. It seems this can be written in many different ways.
Yes, you are right. Basically, for english, \w is a more elegant form of [a-zA-Z0-9], but \w would work with other languages as well.
look like \S = ^\s
2

Use \s+ to tokenize the string.

4 Comments

That doesn't seem to work. I did var re = /\s+/; var words = re.exec(x); What am I doing wrong?
@Nawaz var words = x.split(/\s+/);
@Nawaz Also try var words = y.split(/[^A-Za-z0-9]+/); to strip out punctuation, too.
@Kai: Of that helped for the first string. But it doesn't work with the second string y.
2

exec can loop through the matches to remove non-word (\W) characters.

var A= [], str= "This @is a #long $string. Alright, let's split this.",
rx=/\W*([a-zA-Z][a-zA-Z']*)(\W+|$)/g, words;

while((words= rx.exec(str))!= null){
    A.push(words[1]);
}
A.join(', ')

/*  returned value: (String)
This, is, a, long, string, Alright, let's, split, this
*/

Comments

2

Here is a solution using regex groups to tokenise the text using different types of tokens.

You can test the code here https://jsfiddle.net/u3mvca6q/5/

/*
Basic Regex explanation:
/                   Regex start
(\w+)               First group, words     \w means ASCII letter with \w     + means 1 or more letters
|                   or
(,|!)               Second group, punctuation
|                   or
(\s)                Third group, white spaces
/                   Regex end
g                   "global", enables looping over the string to capture one element at a time

Regex result:
result[0] : default group : any match
result[1] : group1 : words
result[2] : group2 : punctuation , !
result[3] : group3 : whitespace
*/
var basicRegex = /(\w+)|(,|!)|(\s)/g;

/*
Advanced Regex explanation:
[a-zA-Z\u0080-\u00FF] instead of \w     Supports some Unicode letters instead of ASCII letters only. Find Unicode ranges here https://apps.timwhitlock.info/js/regex

(\.\.\.|\.|,|!|\?)                      Identify ellipsis (...) and points as separate entities

You can improve it by adding ranges for special punctuation and so on
*/
var advancedRegex = /([a-zA-Z\u0080-\u00FF]+)|(\.\.\.|\.|,|!|\?)|(\s)/g;

var basicString = "Hello, this is a random message!";
var advancedString = "Et en français ? Avec des caractères spéciaux ... With one point at the end.";

console.log("------------------");
var result = null;
do {
    result = basicRegex.exec(basicString)
    console.log(result);
} while(result != null)

console.log("------------------");
var result = null;
do {
    result = advancedRegex.exec(advancedString)
    console.log(result);
} while(result != null)

/*
Output:
Array [ "Hello",        "Hello",        undefined,  undefined ]
Array [ ",",            undefined,      ",",        undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "this",         "this",         undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "is",           "is",           undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "a",            "a",            undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "random",       "random",       undefined,  undefined ]
Array [ " ",            undefined,      undefined,  " "       ]
Array [ "message",      "message",      undefined,  undefined ]
Array [ "!",            undefined,      "!",        undefined ]
null
*/

Comments

1
var words = y.split(/[^A-Za-z0-9]+/);

Comments

0

In order to extract word-only characters, we use the \w symbol. Whether or not this will match Unicode characters or not is implementation-dependent, and you can use this reference to see what the case is for your language/library.

Please see Alexander Yezutov's answer (update 2) on how to apply this into an expression.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.