0

I have a simple requirement to extract text in html. Suppose the html is

<h1>hello</h1> ... <img moduleType="calendar" /> ...<h2>bye</h2> 

I want to convert it into three parts

<h1>hello</h1> 
<img moduleType="calendar" />
<h2>bye</h2> 

The aim is to extract text in two categories, simple html and special tags with <img moduleType="Calendar".

5
  • /me sigh... another "how to parse html with regex" question... Commented Apr 22, 2010 at 19:11
  • What language are you coding in? There's likely a better solution than regular expressions, many languages have DOM parsers. Also, you might want to accept answers on some of your other questions to improve the quality/quantity of future answers. Commented Apr 22, 2010 at 19:12
  • 5
    stackoverflow.com/questions/1732348/… Commented Apr 22, 2010 at 19:16
  • Check the answers. Commented Apr 23, 2010 at 0:06
  • stackoverflow.com/questions/1732348/… Commented Apr 23, 2010 at 14:25

3 Answers 3

1

Don't do that; HTML can be broken in many beautiful ways. Use beautiful soup instead.

Sign up to request clarification or add additional context in comments.

Comments

0

It depends on the language and context you are using. I do something similar on my CMS, my approach is first find tags and then attributes.

Get tags

"<img (.*?)/>"

Then I search through the result for specific attributes

'title="(.*?)"'

If you want to find all attributes you could easily change the explicit title to the regex [a-z], or non-whitespace character, and then loop through those results as well.

1 Comment

Fighting against the downvotes you'll get -- Welcome to SO ;-) Include known problems/limitations in your answer. HTML parsing with regular expressions is almost always stomped on.
0

I actually try to do similar thing as asp.net compiler to compile the mark up into server control tree, regular expression is heavily used by asp.net compiler. I have a temporary solution, although not nice, but seems ok.

//string source = "<h1>hello</h1>";
string source = "<h1>hello<img moduleType=\"calendar\" /></h1> <p> <img moduleType=\"calendar\" /> </p> <h2>bye</h2> <img moduleType=\"calendar\" /> <p>sss</p>";
Regex exImg = new Regex("(.+?)(<img.*?/>)");

var match = exImg.Match(source);
int lastEnd = 0;
while (match.Success)
{
    Console.WriteLine(match.Groups[1].Value);
    Console.WriteLine(match.Groups[2].Value);
    lastEnd = match.Index + match.Length;
    match = match.NextMatch();
}
Console.WriteLine(source.Substring(lastEnd, source.Length - lastEnd ));


Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.