0

I am doing some php html parsing and this is the code i have right now

function get_tag($htmlelement,$attr, $value, $xml ,$arr) {
    $attr = preg_quote($attr);
    $value = preg_quote($value);
    if($attr!='' && $value!='')
    {
    $tag_regex = '/<'.$htmlelement.'[^>]*'.$attr.'="'.$value.'">(.*?)<\\/'.$htmlelement.'>/si';
    preg_match($tag_regex,$xml,$matches);
    }
    else
    {
    $tag_regex = '/'.$htmlelement.'[^>]*"(.*?)\/'.$htmlelement.'/i';
    preg_match_all($tag_regex,$xml,$matches);
    }
    if($arr)
        return $matches;
    else 
        return $matches[1];
}
$htmlcontent = file_get_contents("doc.html");
$extract = get_tag('tbody','id', 'open', $htmlcontent,false);

$trows = get_tag('tr','', '', $htmlcontent,false);

The rows that has to be parsed/ the content in $extract can be viewed here http://pastebin.com/ydiAdiuC.

Basically, i am reading the html content and getting the tag tbody from the html. Now i want to take each tr and td values in the tbody and use it in my page. Any idea how to use, i think i am not using the right method of implementing preg_match_all.

1

1 Answer 1

6

Use PHP's DOM Parsers for this. Not Regular Expressions.

A quick approach:

  • Load in the HTML
  • Get the tbody tag.
  • Get the tr tags within.
Sign up to request clarification or add additional context in comments.

3 Comments

Could you give me a short code? The html tags aren't closed proper and i have no control on the htmlcontent.
@joza: run Tidy over it first in case it's totally broken. Otherwise tell DomDocument to ignore errors.
@joza, invalid markup will be an issue. See hakre's comment for a way to get around this. Invalid markup would be a nightmare for regular expressions and one of the main reasons they have trouble parsing HTML.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.