Find and separate the HTML blocks to an array

Question

First of all I want to describe the idea - anyone know that any CMS or a simple website has some kind of blocks like the list of articles for example on the main page of wordpress where shown each in a block of information: Title, author, content, date etc. So the main idea is how to find and separate such blocks of HTML and append each of them to an array. I thought first need to clear them from: classes, ids and styles. step1:

<div id="box1">
    <h3 class="title_style">Title1</h3>
    <p>content for box1</p>
    <div class="author">Author Name1<span class="style_date">date1<span>any text</div>
</div>
<div id="box2">
    <h3 class="title_style">Title2</h3>
    <p>content for box2</p>
    <div class="author">Author Name2<span class="style_date">date2<span>any text2</div>
</div>

to

<div>
    <h3>Title1</h3>
    <p>content for box1</p>
    <div>Author Name1<span>date1<span>any text</div>
</div>
<div>
    <h3>Title2</h3>
    <p>content for box2</p>
    <div>Author Name2<span>date2<span>any text2</div>
</div>

Step2: I need to find each block and write them to an array so I can to put each block to a row in the table like this: (note that this blocks are present on almost any site so it doesn't matter what tags it has, they just repeat with different content and attributes, only the structure is the same)

<table>
    <tr id="block1">
        <td>Title1</td>
        <td>content for box1</td>
        <td>Author Name1</td>
        <td>date1</td>
        <td>any text</td>
    </tr>
    <tr id="block2">
        <td>Title2</td>
        <td>content for box2</td>
        <td>Author Name2</td>
        <td>date2</td>
        <td>any text</td>
    </tr>
</table>

Any ideas ? I need the logic how to do this, not the code itself.

SimpleXML or a similar library should do the trick. It will yield an array or a data structure containing all the nodes in the HTML...you can simply loop over that and output it in any format you like. — Till Helge
– Till Helge, Commented Feb 25, 2013 at 12:17

ebohlman · Accepted Answer · 2013-03-09 19:56:38Z

2

You can walk the DOM of the document using PHP's DOMDocument class.

So you can do something like this:

    $str = <<<STR
      <div id="box1">
        <h3 class="title_style">Title1</h3>
        <p>content for box1</p>
        <div class="author">Author Name1<span class="style_date">date1</span>any text</div>
      </div>
      <div id="box2">
       <h3 class="title_style">Title2</h3>
       <p>content for box2</p>
       <div class="author">Author Name2<span class="style_date">date2</span>any text2</div>
      </div>
    STR;

    $dom = new DOMDocument();
    $dom->loadHTML($str);

$divs = $dom->getElementsByTagName('div');

foreach ($divs as $div) {
  //read child elements
}

edited Mar 9, 2013 at 19:56

ebohlman

15k5 gold badges35 silver badges35 bronze badges

answered Feb 25, 2013 at 12:20

Husman

6,93910 gold badges32 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user1844923 Over a year ago

The block can start with any tag like <p> or <h1> or any others, I can't use the <div> tag

user1844923 Over a year ago

the main problem is how to find where it starts and where it ends.

Anirudha Over a year ago

@user1844923 try to do it first..or you should pay him to do this simple thing

ebohlman Over a year ago

Well if each "block" always contains an <h3 class="title_style"> you can easily use XPath to find the enclosing block element. If you actually try it and update your question to show how far you got, some people here will help you get the XPath expression right (since it can be a little tricky).

Dino Babu · Accepted Answer · 2013-02-25 12:18:06Z

1

Try this library Simple HTML Dom Parser.

answered Feb 25, 2013 at 12:18

Dino Babu

5,8093 gold badges26 silver badges33 bronze badges

Collectives™ on Stack Overflow

Find and separate the HTML blocks to an array

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related