1

How to trasform correctly a string like this:

html attr = "value" attr2 = 'UnmatchInSubstrings' some \escapedTag content subtag subcontent /subtag br / /html

in:

<html attr = "value" attr2 = 'UnmatchInSubstrings'> some escapedTag content <subtag>subcontent</subtag> <br /> </html>

Requirements:

  1. Does not match tags in substrings (text in "" and '')
  2. Use the character \ to escape a tag you want as normal text. The escaped tag losts its slash.
  3. Match unclosed tags like br /

I have tried with a regex like the follow, that does not works as excepted:

/([^\\]\S+[\s[\"|\']+\s\S[\"|\']+]*)+/g

.

For my attempts, I'm using regex101.com

Thank you in advance and sorry if it is not well comprehensible :)

10
  • 1
    DON'T try parsing HTML with regexes.. Are you absolutely certain this is how it has to be done? Do you have control over how the string you want to parse, if formatted? Commented Feb 5, 2014 at 10:31
  • 1
    For those who'd say "it's not HTML he's parsing": Same difference. He s trying to parse something that represents HTML. Same complexity (Or even worse, as there are no <>) Commented Feb 5, 2014 at 10:33
  • 1
    How do you will know that some isn't an attribute? Or br isn't part of content but a tag? You'll need AI, I think. Commented Feb 5, 2014 at 10:33
  • 1
    How are you meant to use a HTML parser if it's not valid HTML that he's parsing? Commented Feb 5, 2014 at 10:37
  • 1
    ManuelDiIorio: Do you have any access to the way the input string is built? @VasiliSyrakis: That's not what I'm saying. Commented Feb 5, 2014 at 10:40

1 Answer 1

1

To do what you want, you would need to write your own mapper. So in short, you would have a list of keywords, such as html, table, etc through which you would need to match your strings.

Ideally you would also have a stack onto which you push/pop keywords as you find open/close tags. The parser would also need to be intelligent enough to exclude your escape sequences as well as strings within quotation marks, so that you won't end up with "I know <html>".

Sign up to request clarification or add additional context in comments.

2 Comments

I think that char-by-char parsing is the only solution.. just the thing I wanted to avoid (a lot of rows of code)... Thanks anyway
@ManuelDiIorio: Word by word should be enough I think. Replacing br / with br/ and splitting by space should reduce the complexity as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.