0

This is the regular expression I was using for this piece of text:

(?![!',:;?\-\d])(\w[A-Za-z']+)

The flavour of regexp is ECMAScript (JavaScript)

The sample text:

This.Sentence.Has.Some.Funky.Stuff.U.S.S.R.Going.On.And.Contains.Some.   ABBREVIATIONS.Too.

This.Sentence.Has.Some.Funky.Stuff .U.S.S.R. Going.On.And.Contains.Some.   ABBREVIATIONS.Too.

A.S.A.P.?

Ctrl+Alt+Delete  

Mr.Smith bought google.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? A.d.a.m Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer! He also worked at craigslist.org as a b c d e F G H I J business analyst.

It's doing everything I want but I can't also finish the regexp to match the single letters to a b c d e F G H I J where it's [a-zA-Z] in regexp terms.

I don't want the text such as U.S.A to be matched and this is where I'm having trouble.

I've tried the solution here How to include character in regular expression but I couldn't get that to work due to the more complex nature of my issue.

My mission here is to wrap the matching items with anything.

Here's the link for the same regular expression example: https://regex101.com/r/Qdq4AY/4

6
  • You might rule out all that you don't want to match and capture what you want to keep \.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*) regex101.com/r/8O8GG6/1 Commented Nov 27, 2019 at 15:05
  • I want to add the single-letter words like a and a b c d e F G H I J. I don't want to remove U.S.A. yet not match them. Commented Nov 27, 2019 at 15:19
  • What is the regex flavor / tool / language? regex101.com/r/lYdw5i/1 Commented Nov 27, 2019 at 15:23
  • I've updated the OP to include that. It's ECMAScript (JavaScript). Commented Nov 27, 2019 at 15:36
  • 1
    Currently you are getting separate matches which I think you could also get using the capturing group version ideone.com/8ZnCvz Commented Nov 27, 2019 at 15:44

1 Answer 1

1

A few notes about the pattern you tried

  • The pattern (?![!',:;?\-\d])(\w[A-Za-z']+) will not match a single character because this part \w[A-Za-z']+ matches at least 2 characters due to the + quantifier
  • The negative lookahead (?! asserts what is on the right is not any of [!',:;?\-\d] and then matches a word char \w but \w only also matches a digit \d and not the rest.

One option is to match what you don't want to keep the to capture what you want to keep:

\.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*)

In parts

  • \.? Match an optional dot
  • [a-zA-Z](?:\.[a-zA-Z])+\.? Match a single char a-zA-Z followed by repeating 1+ times a dot and a single char and an optional dot
  • | Or
  • \.[a-zA-Z]\. Match a char a-zA-Z between 2 dots
  • | or
  • (?!\d) Assert what is on the right is not a digit
  • (\w[A-Za-z']*) Capture in group 1 matching 1+ word char and repeat 0+ times any of the listed in the character class

Regex demo

For example

const regex = /\.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*)/g;
const str = `This.Sentence.Has.Some.Funky.Stuff.U.S.S.R.Going.On.And.Contains.Some.   ABBREVIATIONS.Too.
 
This.Sentence.Has.Some.Funky.Stuff .U.S.S.R. Going.On.And.Contains.Some.   ABBREVIATIONS.Too.
 
A.S.A.P.?
 
Ctrl+Alt+Delete
 
Mr.Smith bought google.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? A.d.a.m Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't. Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer! He also worked at craigslist.org as a b c d e F G H I J business analyst.`;
let m;

while ((m = regex.exec(str)) !== null) {
  // This is necessary to avoid infinite loops with zero-width matches
  if (m.index === regex.lastIndex) {
    regex.lastIndex++;
  }
  if (undefined !== m[1]) {
    console.log(m[1]);
  }
}

Sign up to request clarification or add additional context in comments.

1 Comment

Perfect, thanks so much. I've been using regexp for years but this one had me stumped. I've just added digits to this so it wraps them also like so (\d?\.?\d+|\.?[a-zA-Z](?:\.[a-zA-Z])+\.?|\.[a-zA-Z]\.|(?!\d)(\w[A-Za-z']*)) but you've answered my original question. Thanks again.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.