1

First of all I want to describe the idea - anyone know that any CMS or a simple website has some kind of blocks like the list of articles for example on the main page of wordpress where shown each in a block of information: Title, author, content, date etc. So the main idea is how to find and separate such blocks of HTML and append each of them to an array. I thought first need to clear them from: classes, ids and styles. step1:

<div id="box1">
    <h3 class="title_style">Title1</h3>
    <p>content for box1</p>
    <div class="author">Author Name1<span class="style_date">date1<span>any text</div>
</div>
<div id="box2">
    <h3 class="title_style">Title2</h3>
    <p>content for box2</p>
    <div class="author">Author Name2<span class="style_date">date2<span>any text2</div>
</div>

to

<div>
    <h3>Title1</h3>
    <p>content for box1</p>
    <div>Author Name1<span>date1<span>any text</div>
</div>
<div>
    <h3>Title2</h3>
    <p>content for box2</p>
    <div>Author Name2<span>date2<span>any text2</div>
</div>

Step2: I need to find each block and write them to an array so I can to put each block to a row in the table like this: (note that this blocks are present on almost any site so it doesn't matter what tags it has, they just repeat with different content and attributes, only the structure is the same)

<table>
    <tr id="block1">
        <td>Title1</td>
        <td>content for box1</td>
        <td>Author Name1</td>
        <td>date1</td>
        <td>any text</td>
    </tr>
    <tr id="block2">
        <td>Title2</td>
        <td>content for box2</td>
        <td>Author Name2</td>
        <td>date2</td>
        <td>any text</td>
    </tr>
</table>

Any ideas ? I need the logic how to do this, not the code itself.

1
  • SimpleXML or a similar library should do the trick. It will yield an array or a data structure containing all the nodes in the HTML...you can simply loop over that and output it in any format you like. Commented Feb 25, 2013 at 12:17

2 Answers 2

2

You can walk the DOM of the document using PHP's DOMDocument class.

So you can do something like this:

    $str = <<<STR
      <div id="box1">
        <h3 class="title_style">Title1</h3>
        <p>content for box1</p>
        <div class="author">Author Name1<span class="style_date">date1</span>any text</div>
      </div>
      <div id="box2">
       <h3 class="title_style">Title2</h3>
       <p>content for box2</p>
       <div class="author">Author Name2<span class="style_date">date2</span>any text2</div>
      </div>
    STR;

    $dom = new DOMDocument();
    $dom->loadHTML($str);

$divs = $dom->getElementsByTagName('div');

foreach ($divs as $div) {
  //read child elements
}
Sign up to request clarification or add additional context in comments.

4 Comments

The block can start with any tag like <p> or <h1> or any others, I can't use the <div> tag
the main problem is how to find where it starts and where it ends.
@user1844923 try to do it first..or you should pay him to do this simple thing
Well if each "block" always contains an <h3 class="title_style"> you can easily use XPath to find the enclosing block element. If you actually try it and update your question to show how far you got, some people here will help you get the XPath expression right (since it can be a little tricky).
1

Try this library Simple HTML Dom Parser.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.