2

I have a string test="hello how are you all doing, I hope that it's good! and fine. Looking forward to see you.

I am trying to parse the string into words and punctuation marks using javascript. I am able to separate words but then punctuation marks disappear using the regex

var result= test.match(/\b(\w|')+\b/g);

So my expected output is

hello
how 
are 
you
all
doing
,
I
hope
that
it's
good
!
and 
fine
.
Looking
forward
to
see
you

2 Answers 2

14

Simple approach

This first approach if you, and javascript's definition of "word" match. A more customizable approach is below.

Try test.split(/\s*\b\s*/). It splits on word boundaries (\b) and eats whitespace.

"hello how are you all doing, I hope that it's good! and fine. Looking forward to see you."
    .split(/\s*\b\s*/);
// Returns:
["hello",
"how",
"are",
"you",
"all",
"doing",
",",
"I",
"hope",
"that",
"it",
"'",
"s",
"good",
"!",
"and",
"fine",
".",
"Looking",
"forward",
"to",
"see",
"you",
"."]

How it works.

var test = "This is. A test?"; // Test string.

// First consider splitting on word boundaries (\b).
test.split(/\b/); //=> ["This"," ","is",". ","A"," ","test","?"]
// This almost works but there is some unwanted whitespace.

// So we change the split regex to gobble the whitespace using \s*
test.split(/\s*\b\s*/) //=> ["This","is",".","A","test","?"]
// Now the whitespace is included in the separator
// and not included in the result.

More involved solution.

If you want words like "isn`t" and "one-thousand" to be treated as a single word while javascript regex considers them to be two you will need to create your own definition of a word.

test.match(/[\w-']+|[^\w\s]+/g) //=> ["This","is",".","A","test","?"]

How it works

This matches the actual words an punctuation characters separately using an alternation. The first half of the regex [\w-']+ matches whatever you consider to be a word, and the second half [^\w\s]+ matches whatever you consider punctuation. In this example I just used whatever isn't a word or whitespace. I also but a + on the end so that multi-character punctuation (such as ?! which is properly written ‽) is treated as a single character, if you don't want that remove the +.

Sign up to request clarification or add additional context in comments.

5 Comments

hey it worked but can you explain what exactly /\s*\b\s*/ it does ?
Check the character classes. You could try something like /[^\w\s]/ where you are saying anything that is not a word and not a space. It's not perfect but you can add more exceptions to it. Or just list everything you consider punctuation /[!@#$%^&*(),.;:'"/?\\]/.
Hey i got all of it thanks. Just one last thing i have word isn't and it is broken into is ' and t I dont want that i want it to be one complete word. ist't
I added a second approach that can use a custom definition of "word".
How would you modify this to keep all the whitespace?
4

Use this:

[,.!?;:]|\b[a-z']+\b

See the matches in the demo.

For instance, in JavaScript:

resultArray = yourString.match(/[,.!?;:]|\b[a-z']+\b/ig);

Explanation

  • The character class [,.!?;:] matches one character from inside the brackets
  • OR (alternation |)
  • \b match a word boundary
  • [a-z']+ one or more letters or apostrophes
  • \b word boundary

2 Comments

FYI: Added demo and explanation. :)
hey bro it seperates co-education as two seperate words can i have it as one word ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.