2

I'm stripping out all style attributes from some html. I could use the regex

/style=("[^"]"|'[^']')/

But I wonder if this is inefficient (due to the negative matching). I also know it's vulnerable to style attributes (e.g. background-image) that can contain quotes.

Is there a regex I can use to match valid style strings or, like parsing html with regex, is this a task too difficult for a regex to perform in general?

*edit Here is (I think) the trickiest style string in the html I'm scraping

style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: 'Verdana','sans-serif'; mso-fareast-font-family: 'Times New Roman'"
5
  • stackoverflow.com/questions/1732348/… Commented Apr 17, 2012 at 10:39
  • @Sibster I'm aware of that question & answer, but my question is a lot narrower than that Commented Apr 17, 2012 at 10:44
  • You may want to check out my updated answer. Commented Apr 17, 2012 at 11:09
  • @wheresrhys You can also have attributes w/o quotes: style=font-weight:bold is valid. Commented Apr 17, 2012 at 11:13
  • @Boldewyn If it were up to me there wouldn't be any style attributes at all... unfortunately though, I'm having to scrape the html from a third party so have no control over whether or not the quotes are there Commented Apr 17, 2012 at 22:07

4 Answers 4

2

I don't think, that negative matching is slow in every case. After all, when you provide the starting point with style= the following bytes are compared to the pattern anyway.

You must, however, cater for the case, where attributes are not enclosed in quotes.

/style=(".*?"|'.*?'|[^"'][^\s]*)/s

should match all productions of HTML attribute syntax. However, make sure, that the dot matches all characters including newlines (hence the /s) in your regex engine. I also used non-greedy quantifiers *?. These can possibly also be not implemented.

There is the special case of style= without any following value, that is not represented above to keep it simpler.

Sign up to request clarification or add additional context in comments.

Comments

0

Try / style\=[\"\']?([a-zA-Z0-9 \:\-\#\(\)\.\_\/\;\'\,]+)\;?[\"\']? /ig

It supposed to find every style attribute I know.

http://jsfiddle.net/DULyx/3/ - check here

3 Comments

urls might be quoted though.
Good effort, but it fails on style='FONT-FAMILY: "Verdana"'. In general I think a regex would have to be of the form /("[allvalidchars and ']+"|('[allvalidchars and "]+')/ to avoid this pitfall, which is very irritating as it means either a) duplicating the character class or b) storing it as a string elsewher and having to escape things properly before concatenating and passing into new RegExp(). And even then it's vulnerable to e.g. style='FONT-FAMILY: \'Verdana\''.
According to cases you suggest, there is no regexp to do that. Since you want to define a rule for searching - rules must be obayed by the css writer. Once the script doesn't follow a rule - how can you search through it?
0

You shouldn't be processing HTML as a string. All you need in JS is elt.style='';. If you have the chance to run your stuff through XSLT it's a one-liner.

Comments

0
function trim (str) {
    return str.replace(/^\s\s*/, '').replace(/\s\s*$/, '');
}

function getStyle(element){
    return parseRules(element.getAttribute('style'))
}

function parseRules(rules){
  var parsed_rules= {}
      rules.split(';').map(function(rule){
          return rule.split(':').map(function(rule,index){
            // HERE YOU CAN TRY TO CLEAN THE RULES
            return trim( rule )
          })
      }).filter( function(rule){
            // HERE YOU CAN TEST THAT THE RULE IS VALID
          return rule.length == 2 && ( (rule[0]!="") || (rule[1]!="") )
      }).forEach(function(rule){
        parsed_rules[rule[0]] = rule[1]
      })


  return parsed_rules
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.