0

I have over 500 pages (static) containing content structures this way,

<section>
Some text 
<strong>Dynamic Title (Different on each page)</strong> 
<strong>Author name (Different on each page)</strong> 
<strong>Category</strong>
(<b>Content</b> <b>MORE TEXT HERE)</b>
</section> 

And I need to extract the data as formatted below, using PHP Simple HTML DOM Parser

$title = <strong>Dynamic Title (Different on each page)</strong> 
$authot = <strong>Author name (Different on each page)</strong> 
$category = <strong>Category</strong>
$content = (<b>Content</b> <b>MORE TEXT HERE</b>)

I have failed so far and can't get my head around it, appreciate any advice or code snippet to help me going on.

EDIT 1, I have now solved the part with strong tags using,

$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
 $content[] = $a->innertext;
}

$title= $content[0];                
$author= $content[1];

the only remaining issue is --> How to extract content within parentheses? using similar method?

5
  • 1
    What code have you made so far? Commented Jun 10, 2014 at 12:42
  • 2
    What code have you used so far that is failing? There might be a chance you almost had it. If you post it, folks here might be able to troubleshoot it or point out the problem. Commented Jun 10, 2014 at 12:42
  • The first problem is how to loop through those strong tags? I have this code but it select a random one, $html = file_get_html($url); foreach($html->find('strong') as $e) $field = $e->outertext; echo $field; Commented Jun 10, 2014 at 12:51
  • Don't post code in comments... Include it in your 1st qpost/question! Commented Jun 10, 2014 at 12:59
  • i have edited my answer to address your last question Commented Jun 10, 2014 at 13:55

3 Answers 3

2

OK first you want to get all of the tags Then you want to search through those again for the tags and tags Something like this:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
$strong = array();

// Find all <sections>
foreach($html->find('section') as $element) {

    $section = $element->src;

    // get <strong> tags from <section>
    foreach($section->find('strong') as $strong) {
        $strong[] = $strong->src;
    }
     $title = $strong[0];
     $authot = $strong[1];
     $category = $strong[2];

}

To get the parts in parentheses - just get the b tag text and then add the () brackets. Or if you're asking how to get parts in between the brackets - use explode then remove the closing bracket:

$pieces = explode("(", $title);
$different_on_each_page = str_replace(")","",$pieces[1]);
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, Just before you posted your answer I ended up with -->Posted in my question above.
0
$html_code = 'html';
$dom = new \DOMDocument();
$dom->LoadHTML($html_code);
$xpath = new \DOMXPath($this->dom);
$nodelist = $xpath->query("//strong");
for($i = 0; $i < $nodelist->length; $i++){
    $nodelist->item($i)->nodeValue; //gives you the text inside
}

2 Comments

that isn't PHP Simple HTML DOM
that is php. It uses DomDocument class in php. php.net/manual/en/class.domdocument.php. Just take it into php file, substitute the html with your own string and put an echo in front of $nodelist->item($i)->nodeValue;. You will see it echoes all strong contents onto the screen.
0

My final code that works now looks like this.

$html = file_get_html($url);
$links = array();
foreach($html->find('strong') as $a) {
 $content[] = $a->innertext;
}

$title= $content[0];                
$author= $content[1];
$category = $content[2];


$details = file_get_html($url)->plaintext; 
$input = $details;
preg_match_all("/\(.*?\)/", $input, $matches);
print_r($matches[0]);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.