0

I have the following HTML:

<h1>Text Text</h1>      <h2>Text Text</h2>

I am still trying to get a handle on regular expressions, and trying to create one that would eliminate the spacing between the tags.

I would like the final result to be:

<h1>Text Text</h1><h2>Text Text</h2>

Any help would be greatly appreciated!

UPDATE

I would like to strip out all white spaces, tabs and new lines. So if I have:

<div>    <h1>Text Text</h1>      <h2>Text Text</h2>     </div>

I would like it to end up as:

<div><h1>Text Text</h1><h2>Text Text</h2></div>
2
  • All whitespace or only spaces and tabs? If you preserve newlines do you still want to eliminate spaces and tabs? For all tag names or specifically h1 then h2? Commented Sep 1, 2009 at 14:58
  • Good point! I just want to eliminate the white spaces, new lines and tabs. Commented Sep 1, 2009 at 15:07

3 Answers 3

1

If it's just this specific case, here's a suitable regex to find all the spaces:

Regex regexForBreaks = new Regex(@"h1>[\s]*<h2", RegexOptions.Compiled);

However, I think a regex is the wrong approach here if this is a more general case. For example, it's possible for tags to be nested within other tags, and then your problem needs a little more detail to figure out the right answer. As Jamie Zawinski said, "Some people, when confronted with a problem, think, 'I know, I'll use regular expressions.' Now they have two problems."

Sign up to request clarification or add additional context in comments.

3 Comments

Not sure I understand that last bit. Remove h1 and h2 and you've got the general case, what additional problem do you percieve?
Good point! I just want to eliminate the white spaces, new lines and tabs.
@AnthonyWJones: You can't do that. Imagine this case: "<pre><div>foo</div> bar <div>baz</div></pre>". The whitespace is intentional here and removing it will change the meaning.
0

One alternative to using a regex or string replace is the Html Agility pack.

Here's a rough guess:

/// <summary>
///  Regular expression built for C# on: Tue, Sep 1, 2009, 03:56:27 PM
///  Using Expresso Version: 3.0.2766, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  <h1>
///      <h1>
///  [1]: A numbered capture group. [.+]
///      Any character, one or more repetitions
///  </h1>
///      </h1>
///  Match expression but don't capture it. [\s*]
///      Whitespace, any number of repetitions
///  <h2>
///      <h2>
///  [2]: A numbered capture group. [.+]
///      Any character, one or more repetitions
///  </h2>
///      </h2>
///  
///
/// </summary>
public static Regex regex = new Regex(
      "<h1>(.+)</h1>(?:\\s*)<h2>(.+)</h2>",
    RegexOptions.Singleline
    | RegexOptions.CultureInvariant
    | RegexOptions.Compiled
    );


// This is the replacement string
public static string regexReplace = 
      "<h1>$1</h1><h2>$2</h2>";

Comments

0

How about: Regex.Replace(str, @">\s+<","><")

4 Comments

Misses situations where you have legitimate square bracket characters in between elements: <element> > </element>
Addendum: By "misses", I mean it's overzealous. It will remove the space between > and `</element> even though it should not.
Is "<element> > </element>" even valid HTML? Don't you have to use a reference (&gt;) for angled braces inside the text of an element?
The closing bracket is valid, the open bracket isn't.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.