0

I saw many similar question, but still not found the answer.
How should look the regex, that needs to replace all whitespaces (include newline) in HTML, but ignore the tag?

Currently I use Regex.Replace(content, @"\s+", ""); but in removes spaces in JavaScript that exists on page and than the page not works.

Thank you.

EDIT: After some question in responses, here a little bit more details: What I'm doing is HTTP module that "minifies" HTML output on our site. We have a web site with very dynamic content that came from many different sources. The final goal, is to reduce page size and reduce network traffic. It's a highly loaded web site so it's important to us to complete that.

Actually we are using MbCompression library for JS and CSS minification, but it not supports to minify HTML output (at least i didn't found).

6
  • Are you asking about JavaScript, or C#? Commented Oct 15, 2012 at 13:42
  • Have a look here, a famous SO question Commented Oct 15, 2012 at 13:45
  • Why not GZIP instead? Commented Oct 15, 2012 at 14:03
  • @jrummell We are using, but we are removing the whitespaces before the compression and in addition compression is not always supported. Commented Oct 15, 2012 at 14:12
  • Removing redundant whitespace before compression saves very little. It would be better to not produce it at all, but removing it after the fact when you then go ahead and gzip anyway will not save you any measurable amount. Commented Oct 15, 2012 at 15:12

6 Answers 6

2

There is really no way to write a single (reasonable) regexp to do this. Especially not if you want to support javascript and css. You need to have a real parser.

Sign up to request clarification or add additional context in comments.

2 Comments

Can you advise any parser that can do it?
htmlagilitypack.codeplex.com perhaps? Parse the HTML into a DOM tree, and then do the whitespace trimming on textnodes.
1

If you can find a decent HTML parser, I would do it via DOM manipulation. If you can't, then something like

Regex.Replace(content, "(?i)(<script(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</script\\s*>|<style(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</style\\s*>|<textarea(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</textarea\\s*>|</?[a-z](?:[^>\"']|\"[^\"]*\"]|'[^']*')*>|\\S+)|\\s+", "$1");

should do it. It will not remove spaces inside tags or inside embedded JS, CSS, or inside textareas but will remove newlines in text nodes.

3 Comments

As I'm thinking now, we also use HtmlDocument from AgilityPack. Do you know if it supports such option?
@AlexDn, stackoverflow.com/questions/846994/how-to-use-html-agility-pack suggests that htmlDoc.DocumentNode.SelectSingleNode("//body") will get you the body, and then you can traverse that to find all text nodes not inside <script> elements and the like, and elide white-space however you like.
Ok, thanks, looks like I will use the solution with HtmlDocument traverse.
1

What's your goal? Browsers ignore a lot of whitespace when rendering pages so I'm guessing you want to clean up your source code. If so, check if the program you use offers some solution to this. For example Dreamweaver has a tool to reformat source code.

Tidy could be one option but it looks like it's a bit more than a simple code formatting tool.

Comments

0

Surely you should be replacing it with a space at least, not just removing whitespace entirely. For HTML that should be fine but if you are talking about having strings in javascript with multiple spaces not being collapsed then you need to think of another method since regular expressions won't work out easily whether you are in script, in a string, etc.

That having been said I'm not sure of a good reason to do this. If you are worried about the size of the file then just tell your server to use compression which I suspect by now every browser supports well enough and the pages will basically be zipped by the server and unzipped on the client. Its a bit more work for the server so it depends if you care about bandwidth or CPU more.

Comments

0
Regex.Replace(document.body.innerHTML, @"\s+", "");

using document.body.innerHTML instead may work. I am not sure.

1 Comment

I need it in C# (server side)
0
Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

There are risks related to tags, unclosed tags etc. I hope you have some control over the 'dynamic content that comes from different sources' as you've put it. I also hope that you've tried everything else and this comes as a last resort.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.