3

I have a small Node Script which is web Scraping a Web Page. From that page I am extracting an array of Strings.

I am trying to clean up those Strings (currently with regex and string.replace)

One example String looks like this:

2  Glücklich sind die,die seine Erinnerungen beachten,+die mit ganzem Herzen nach ihm suchen.+\n

My cleaning code looks like this.

string.replace(/\+/g, '').replace(/\*/g, '').replace('\n', '').replace(/(^\d+)/g, '').trim()

The first section removes all "+", the second removes all *, the third removes the new Line and the last one removes the leading number.

The most things work fine but I have some edge cases. This is my Result:

2  Glücklich sind die,die seine Erinnerungen beachten,die mit ganzem Herzen nach ihm suchen.

Problems:

  1. The Leading Number was not removed (when the number has two or more digits it gets always removed, i have no Idea why a Single digit stays the same.)
  2. The first * got removed but because there was no whitespace there is no space anymore ;(. The second * was followed by a white space... so no Problems here.
  3. Same issue with the "+"... no whitespace following so the words stick together

My goal is to parse every String correctly. I have thousands of strings with different combinations but only "+", *, "\n" and the number as special characters.

The String should look like this:

Glücklich sind die, die seine Erinnerungen beachten, die mit ganzem Herzen nach ihm suchen.

Hopefully someone has an idea to accomplish that.

2
  • The ^\d+ pattern should replace a single digit...is it possible there is leading space? Maybe try doing the .trim() first? Also if you know a +/* should always have space after being replaced, you could do this: .replace(/\s*(\+|\*)\s*/g, ' '). That way any existing spaces will be removed with the */+ and you replace it all with a single space. Commented Jan 31, 2020 at 16:17
  • Wow super. Putting the trim first fixed the digit replacing ;). Maybe there was whitespace like you said. And no ;( a "+" is not always followed by a whitespace ;(. Commented Jan 31, 2020 at 20:24

3 Answers 3

2

You could use an alternation | with a character class [+*\n] to match either one of the characters or 1+ digits ^\d+ at the start of the string.

[+*\n]|^\d+

Regex demo

In the replacement use a space. Afterwards, replace all the 2 or more spaces with a single space.

let pattern = /[+*\n]|^\d+/g;
let string = "2  Glücklich sind die,*die seine Erinnerungen* beachten,+die mit ganzem Herzen nach ihm suchen.+\n";
string = string
  .replace(pattern, " ")
  .replace(/[ ]{2,}/g, " ")
  .trim();

console.log(string);


If the digits at the start of the string can be preceded by optional whitespace chars, you could match those as well by matching 0+ times a whitespace char except a newline ^[^\S\r\n]*\d+

let pattern = /[+*\n]|^[^\S\r\n]*\d+/g;
let string = "  2  Glücklich sind die,*die seine Erinnerungen* beachten,+die mit ganzem Herzen nach ihm suchen.+\n";
string = string
  .replace(pattern, " ")
  .replace(/[ ]{2,}/g, " ")
  .trim();

console.log(string);

Sign up to request clarification or add additional context in comments.

Comments

1

You can achieve all your goals with a fairly short regex, and a single call to String.prototype.replace:

let cleanStr = str => str.replace(/^[0-9\s]*|[+*\r\n]/g, '');

console.log(cleanStr('2  Glücklich sind die,die seine Erinnerungen beachten,+die mit ganzem Herzen nach ihm suchen.+\n'));

This regex detects either ^[0-9\s]* or [+*\r\n] (and these sequences will be replaced with the empty string).

^[0-9\s]* replaces any number of consecutive digit or whitespace characters at the beginning of the string.

^[+*\r\n] removes any "+", "*", or newline characters (including \r, which could be significant in windows environments) which occur anywhere in the string.

3 Comments

Yes the replacing works, but with your solution the whitespace after the "," is missing
The input is missing the space
ohhhh ;) the char at this position is * but the markdown editor formatted the text accordingly....
0

Perhaps this?

let str = `2  Glücklich sind die,*die seine Erinnerungen* beachten,+die mit ganzem Herzen nach ihm suchen.+\n`

str = str.replace(/[\*\+]/g," ")
         .replace(/^\d+(\s+)?/,"") // or add .trim()
         .replace(/\n?/,"")
         .replace(/\s{2,}/g," ")
console.log(str)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.