0

I have a tokenizer function that takes a string, a regex pattern for split and a arbitrary list of regex patterns to be protected from tokenization. To achieve that I'm using placeholder ____SSS____ to avoid those patterns to get split:

function tokenize(str,default_pattern,protected_patterns) {
       const screen = new RegExp('(?:' + protected_patterns.map(s => '(?:' + s + ')').join('|') + ')', "gi");
       var screened = [];
       str = str.replace(screen, s => {
       var i = screened.push(s) - 1;
       return '____SSS____' + i + '____SSS____'; // chose a non-separator as screener, so that these placeholders don't get split.
      });
      res = str.split(default_pattern).map(s => s.replace(/____SSS____(\d+)____SSS____/, (_, i) => screened[i]))
      return res;
    }

By example, if I want to prevent that the pattern yo-ho to get split, I will do:

tokenize("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS____]+/i, ["\\byo-ho\\b"])
(8) ["Podia", "ser", "yo-ho", "mi", "amor", "ahora", "ya", "acabó"]

Of course I have to add the placeholder format ____SSS____(\d+)____SSS___ in the regex, otherwise the split takes place:

patterns("Podia ser yo-ho, mi amor ahora ya acabó", /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/i, ["\\byo-ho\\b"])
(9) ["Podia", "ser", "SSS", "SSS", "mi", "amor", "ahora", "ya", "acabó"]

Now, for different languages I may have different split rules like

{
    "es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ]+/,
    "fr" : /[^a-z0-9äâàéèëêïîöôùüûœç]+/i
}

and I would like to dynamically add the ____SSS____(\d+)____SSS___ to each of them, but I do not find the right way to obtain this, so that the result should look like:

 {
      "es" : /[^a-zA-Zá-úÁ-ÚñÑüÜ____SSS____(\d+)____SSS___]+/,
      "fr" :  /[^a-z0-9äâàéèëêïîöôùüûœç____SSS____(\d+)____SSS___]+/i
 }

that will make the tokenizer with protected patterns to work properly.

1 Answer 1

1

You can simply capture the existing split rule like this:
(.+)(\].*)
and append your placeholder in-between the first and second capture group.

https://regex101.com/r/QCFnLS/1

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.