Regex replace whitespace in HTML document

Question

I saw many similar question, but still not found the answer.
How should look the regex, that needs to replace all whitespaces (include newline) in HTML, but ignore the tag?

Currently I use Regex.Replace(content, @"\s+", ""); but in removes spaces in JavaScript that exists on page and than the page not works.

Thank you.

EDIT: After some question in responses, here a little bit more details: What I'm doing is HTTP module that "minifies" HTML output on our site. We have a web site with very dynamic content that came from many different sources. The final goal, is to reduce page size and reduce network traffic. It's a highly loaded web site so it's important to us to complete that.

Actually we are using MbCompression library for JS and CSS minification, but it not supports to minify HTML output (at least i didn't found).

@jrummell We are using, but we are removing the whitespaces before the compression and in addition compression is not always supported. — Alex Dn
– Alex Dn, Commented Oct 15, 2012 at 14:12
Removing redundant whitespace before compression saves very little. It would be better to not produce it at all, but removing it after the fact when you then go ahead and gzip anyway will not save you any measurable amount. — perh
– perh, Commented Oct 15, 2012 at 15:12

Michal Klouda · Accepted Answer · 2012-10-15 13:43:12Z

2

There is really no way to write a single (reasonable) regexp to do this. Especially not if you want to support javascript and css. You need to have a real parser.

edited Oct 15, 2012 at 13:43

Michal Klouda

14.6k7 gold badges58 silver badges79 bronze badges

answered Oct 15, 2012 at 13:38

perh

1,72811 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alex Dn Over a year ago

Can you advise any parser that can do it?

perh Over a year ago

htmlagilitypack.codeplex.com perhaps? Parse the HTML into a DOM tree, and then do the whitespace trimming on textnodes.

Mike Samuel · Accepted Answer · 2012-10-15 13:48:22Z

1

If you can find a decent HTML parser, I would do it via DOM manipulation. If you can't, then something like

Regex.Replace(content, "(?i)(<script(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</script\\s*>|<style(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</style\\s*>|<textarea(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</textarea\\s*>|</?[a-z](?:[^>\"']|\"[^\"]*\"]|'[^']*')*>|\\S+)|\\s+", "$1");

should do it. It will not remove spaces inside tags or inside embedded JS, CSS, or inside textareas but will remove newlines in text nodes.

answered Oct 15, 2012 at 13:48

Mike Samuel

121k30 gold badges230 silver badges255 bronze badges

3 Comments

Alex Dn Over a year ago

As I'm thinking now, we also use HtmlDocument from AgilityPack. Do you know if it supports such option?

Mike Samuel Over a year ago

@AlexDn, stackoverflow.com/questions/846994/how-to-use-html-agility-pack suggests that htmlDoc.DocumentNode.SelectSingleNode("//body") will get you the body, and then you can traverse that to find all text nodes not inside <script> elements and the like, and elide white-space however you like.

Alex Dn Over a year ago

Ok, thanks, looks like I will use the solution with HtmlDocument traverse.

ZZ-bb · Accepted Answer · 2012-10-15 13:47:28Z

1

What's your goal? Browsers ignore a lot of whitespace when rendering pages so I'm guessing you want to clean up your source code. If so, check if the program you use offers some solution to this. For example Dreamweaver has a tool to reformat source code.

Tidy could be one option but it looks like it's a bit more than a simple code formatting tool.

answered Oct 15, 2012 at 13:47

ZZ-bb

2,1651 gold badge24 silver badges33 bronze badges

Comments

Chris · Accepted Answer · 2012-10-15 13:41:22Z

Surely you should be replacing it with a space at least, not just removing whitespace entirely. For HTML that should be fine but if you are talking about having strings in javascript with multiple spaces not being collapsed then you need to think of another method since regular expressions won't work out easily whether you are in script, in a string, etc.

That having been said I'm not sure of a good reason to do this. If you are worried about the size of the file then just tell your server to use compression which I suspect by now every browser supports well enough and the pages will basically be zipped by the server and unzipped on the client. Its a bit more work for the server so it depends if you care about bandwidth or CPU more.

Jashwant · Accepted Answer · 2012-10-15 13:45:35Z

0

Regex.Replace(document.body.innerHTML, @"\s+", "");

using document.body.innerHTML instead may work. I am not sure.

edited Oct 15, 2012 at 13:45

Jashwant

29.2k16 gold badges76 silver badges110 bronze badges

answered Oct 15, 2012 at 13:40

mmuratusta

1009 bronze badges

1 Comment

Alex Dn Over a year ago

I need it in C# (server side)

sainiuc · Accepted Answer · 2012-10-15 16:40:54Z

0

Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

There are risks related to tags, unclosed tags etc. I hope you have some control over the 'dynamic content that comes from different sources' as you've put it. I also hope that you've tried everything else and this comes as a last resort.

answered Oct 15, 2012 at 16:40

sainiuc

1,69711 silver badges13 bronze badges

Collectives™ on Stack Overflow

Regex replace whitespace in HTML document

6 Answers 6

2 Comments

3 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

3 Comments

Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related