trasform with a regex a string in a html text

Question

How to trasform correctly a string like this:

html attr = "value" attr2 = 'UnmatchInSubstrings' some \escapedTag content subtag subcontent /subtag br / /html

in:

<html attr = "value" attr2 = 'UnmatchInSubstrings'> some escapedTag content <subtag>subcontent</subtag> <br /> </html>

Requirements:

Does not match tags in substrings (text in "" and '')
Use the character \ to escape a tag you want as normal text. The escaped tag losts its slash.
Match unclosed tags like br /

I have tried with a regex like the follow, that does not works as excepted:

/([^\\]\S+[\s[\"|\']+\s\S[\"|\']+]*)+/g

.

For my attempts, I'm using regex101.com

Thank you in advance and sorry if it is not well comprehensible :)

DON'T try parsing HTML with regexes.. Are you absolutely certain this is how it has to be done? Do you have control over how the string you want to parse, if formatted? — Cerbrus
– Cerbrus, Commented Feb 5, 2014 at 10:31
For those who'd say "it's not HTML he's parsing": Same difference. He s trying to parse something that represents HTML. Same complexity (Or even worse, as there are no <>) — Cerbrus
– Cerbrus, Commented Feb 5, 2014 at 10:33
How do you will know that some isn't an attribute? Or br isn't part of content but a tag? You'll need AI, I think. — DontVoteMeDown
– DontVoteMeDown, Commented Feb 5, 2014 at 10:33
How are you meant to use a HTML parser if it's not valid HTML that he's parsing? — Vasili Syrakis
– Vasili Syrakis, Commented Feb 5, 2014 at 10:37
ManuelDiIorio: Do you have any access to the way the input string is built? @VasiliSyrakis: That's not what I'm saying. — Cerbrus
– Cerbrus, Commented Feb 5, 2014 at 10:40

npinti · Accepted Answer · 2014-02-05 11:00:16Z

1

To do what you want, you would need to write your own mapper. So in short, you would have a list of keywords, such as html, table, etc through which you would need to match your strings.

Ideally you would also have a stack onto which you push/pop keywords as you find open/close tags. The parser would also need to be intelligent enough to exclude your escape sequences as well as strings within quotation marks, so that you won't end up with "I know <html>".

edited Feb 5, 2014 at 11:00

answered Feb 5, 2014 at 10:40

npinti

52.2k5 gold badges74 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Manuel Di Iorio Over a year ago

I think that char-by-char parsing is the only solution.. just the thing I wanted to avoid (a lot of rows of code)... Thanks anyway

npinti Over a year ago

@ManuelDiIorio: Word by word should be enough I think. Replacing br / with br/ and splitting by space should reduce the complexity as well.

Collectives™ on Stack Overflow

trasform with a regex a string in a html text

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related