split string contain html tags

Question

i have this html string:

this simple the<b>html string</b> text test that<b>need</b>to<b>spl</b>it it too

i want to split it and have result array like this :

this simple 
the<b>html string<b>
text test 
that<b>need</b>to<b>spl</b>it
it too

i tried this way :

     var string ='this simple the<b>html string</b> text test that<b>need</b>to<b>spl</b>it it too';
     var regex =  XRegExp('((?:[\\p{L}\\p{Mn}]+|)<\\s*.*?[^>]*>.*?<\/.*?>(?:[\\p{L}\\p{Mn}]+|))', "g");
 
    result = string.split(regex);

it didn't work i don't want split word by word is there way to do it ...

yes i want to match whole words that contain multi tag or one tag and split the string as shown in in array i provide — جومارت ميرزا
– جومارت ميرزا, Commented Aug 15, 2020 at 16:00
That makes no sense, you have word the in two "object arrays" that have no tags around it. And it — ikiK
– ikiK, Commented Aug 15, 2020 at 16:15
string.split(/(?:^|\s+)([^\s<>]+(?:\s+[^\s<>]+)*)(?:\s+|$)/).filter(Boolean) (demo) — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 15, 2020 at 16:30
string.split(/((?<=\s)\w+<\w>.*?<\/\w>.*?(?=\s))/); - You can also try this. — rootkonda
– rootkonda, Commented Aug 15, 2020 at 16:31

Ryszard Czech · Accepted Answer · 2020-08-15 19:47:59Z

1

Use

string.split(/\s*(?<!\S)([^\s<>]+(?:\s+[^\s<>]+)*)(?!\S)\s*/).filter(Boolean);

Capturing group will enable saving the matches as part of the resulting array.

REGEX EXPLANATION

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?<!                     look behind to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-behind
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^\s<>]+                 any character except: whitespace (\n,
                             \r, \t, \f, and " "), '<', '>' (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \s+                      whitespace (\n, \r, \t, \f, and " ")
                               (1 or more times (matching the most
                               amount possible))
--------------------------------------------------------------------------------
      [^\s<>]+                 any character except: whitespace (\n,
                               \r, \t, \f, and " "), '<', '>' (1 or
                               more times (matching the most amount
                               possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    \S                       non-whitespace (all but \n, \r, \t, \f,
                             and " ")
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))

JavaScript:

const string = 'this simple the<b>html string</b> text test that<b>need</b>to<b>spl</b>it it too';
const regex= /\s*(?<!\S)([^\s<>]+(?:\s+[^\s<>]+)*)(?!\S)\s*/;
console.log(string.split(regex).filter(Boolean));

Output:

[
  "this simple",
  "the<b>html string</b>",
  "text test",
  "that<b>need</b>to<b>spl</b>it",
  "it too"
]

answered Aug 15, 2020 at 19:47

Ryszard Czech

18.7k4 gold badges27 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

جومارت ميرزا Over a year ago

what if tag contain values or attributes like :"the<b style ='color:red'>html string</b>",

جومارت ميرزا Over a year ago

and also what if string had only this string : "the<b class ='test test2>html string</b>" i want also to get it in regx

Wiktor Stribiżew Over a year ago

@جومارتميرزا Try string.split(/\s*((?:[^\s<]*<\w[^>]*>[\s\S]*?<\/\w[^>]*>)+[^\s<]*)\s*/)

جومارت ميرزا Over a year ago

thank you again for you concern this really what i want it really solved big issues with me thanks again

جومارت ميرزا Over a year ago

i need help plz what if the match contain new line like this regex101.com/r/20zEyO/3

|

Collectives™ on Stack Overflow

split string contain html tags

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related