13

I need a tokenizer that given a string with arbitrary white-space among words will create an array of words without empty sub-strings.

For example, given a string:

" I dont know what you mean by glory Alice said."

I use:

str2.split(" ")

This also returns empty sub-strings:

["", "I", "dont", "know", "what", "you", "mean", "by", "glory", "", "Alice", "said."]

How to filter out empty strings from an array?

7 Answers 7

18

You probably don't even need to filter, just split using this Regular Expression:

"   I dont know what you mean by glory Alice said.".split(/\b\s+/)
Sign up to request clarification or add additional context in comments.

4 Comments

Off-topic: what mean \b in regex?
Matches a word boundary, such as a space, a newline character, punctuation character or end of string (developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions). Might not be the perfect Regex but for that example it works.
@Mustafa yeah, I know. But it is just a curiosity.
I like the regex, but how to account for ", " (comma, space)
11
 str.match(/\S+/g) 

returns a list of non-space sequences ["I", "dont", "know", "what", "you", "mean", "by", "glory", "Alice", "said."] (note that this includes the dot in "said.")

 str.match(/\w+/g) 

returns a list of all words: ["I", "dont", "know", "what", "you", "mean", "by", "glory", "Alice", "said"]

docs on match()

1 Comment

Good answer. For others' reference, /S+/ matches against groups of characters that are not whitespace, whereas /w+/ matches groups of characters that are alphanumeric+underscore. That's why the period (.) character matches in one but not the other.
7

You should trim the string before using split.

var str = " I dont know what you mean by glory Alice said."
var trimmed = str.replace(/^\s+|\s+$/g, '');
trimmed = str.split(" ")

Comments

2

I recommend .match:

str.match(/\b\w+\b/g);

This matches words between word boundaries, so all spaces are not matched and thus not included in the resulting array.

2 Comments

This works even better: >>> str2 "Humpty Dumpty smiled contemptuously Of course you dont—till I tell you I meant theres a nice knock-down argument for you! " Using: str3 = str2.match(/\b\w+\b/g); Results in: >>> str3 ["Humpty", "Dumpty", "smiled", "contemptuously", "Of", "course", "you", "dont", "till", "I", "tell", "you", "I", "meant", "theres", "a", "nice", "knock", "down", "argument", "for", "you"] So w+ matchs also "—"
@dokondr: What do you count as word characters? If it's everything except spaces, you may want to just use [^ ] instead of \w.
0

see the filter method

http://www.hunlock.com/blogs/Mastering_Javascript_Arrays#quickIDX13

Comments

0

i think empty sub-string happen because there are multiple white-spaces you can use a replace() in a for loop to replace multiple white-spaces with a single white-space then split() to split the program using a single white space like this:

// getting full program from div
var program = document.getElementById("ans").textContent;
//removing multiple spaces
var res = program.replace("  ", " ");
for (i = 0; i <= program.length; i++) {
  var res = res.replace("  ", " ");
}
// spliting each word using space as saperator
var result = res.split(" ");

Comments

0

That is all that we need:

str.trim().split(' ')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.