2

I need to split the content in php into an (json-)array. I.e. I want to get out of this:

<p>Text Level 0</p>
<section class="box_1">
    <header class="trigger"><h2>Title</h2></header>
    <div class="content">
        <div class="box_2">
            <div class="class"></div>
            <div class="content">
                <p>Text Level 2</p>
                <p>More Text Level 2</p>
            </div>
        </div>
        <div class="box_2">
            <div class="class"></div>
            <div class="content">
                <p>Text Level 2</p>
                <div class="box_3">
                    <div class="content">
                        <p>Text Level 3</p>
                    </div>
                </div>
            </div>
        </div>
    </div>
</section>
<p>Another Text</p>

This result:

0: "Text Level 0"; 2: "Text Level 2\nMore Text Level 2"; 2: "Text Level 2"; 3: "Text Level 3"; 0: "Another Text";

That means I need the "Level" of the Text, and the Text itself. But I don't know how to do that. Should I use RegExp or should I parse the content (i.e. simple_html_dom.php)?

Something like:

  • Check for every p-element in "content"-class
  • Check the closest "box_*"-class -> Level-number
  • Summarize all elements of the same "content"
  • If p-element is not in "content" -> Level 0

But how can I do that in php?

4
  • 1
    Why not use javascript?? Also if you're echoing that content in a php page. Why not just build the array without caring where the elements are in the dom? I guess my real question is....what context are you getting that DOM from that you need to parse it with PHP? Commented Jun 29, 2014 at 8:31
  • 1
    Your title mentions splitting into an array. But your desired result isn't a valid array -- you can't have the index 2 multiple times in the same array. Commented Jun 29, 2014 at 8:43
  • You are right, the example isn't a correct array. But this should only be the content... {"id":0, "level":"0", "content": "Text Level 0"} and so on... Commented Jun 29, 2014 at 9:07
  • @Kylek: The data comes from a SQL-DB. And the content should be transformed and send vie json to an mobile app. That's why javascript isn't an option. Commented Jun 29, 2014 at 9:13

2 Answers 2

1

A lot of people here distrust parsing html with regex—and with good reason in most cases. The preferred solution is a DOM parser. That being said, if you want to handle this specific input with regex, it is entirely possible. Here is one of several ways to do it:

(?s)<p>\K.*?(?=</p>)

Sample PHP Code

(See the output at the bottom of the online demo):

$regex = '~(?s)<p>\K.*?(?=</p>)~';
preg_match_all($regex, $yourstring, $matches);
print_r($matches[0]);

$m[0] is the array of matches (see output). You can then transform it to whatever other format you like.

Output:

[0] => Text Level 0
[1] => Text Level 2
[2] => More Text Level 2
[3] => Text Level 2
[4] => Text Level 3
[5] => Another Text

Explanation

  • <p> matches the opening tag
  • The \K tells the engine to drop what was matched so far from the final
  • .*? lazily matches any chars (this is the match) up to...
  • a point where the lookahead (?=</p>) can assert that what follows is the closing tag.

Reference

Sign up to request clarification or add additional context in comments.

5 Comments

But element [1] and [2] should be summarized to "Text Level 2\nMore Text Level 2" because both belong to the same "content"-container...
Entirely feasible, but I don't have the time right now. If you think this may be helpful and a step in the right direction, let me know and I might look at it again later.
If you think a regex solution is possible... that would be great. The elements of the same content-container should be put together (as mentioned above) and I need the number of the closest parent with the class "box_*". I would need the *. So this information would be my output data: the box-number (=level) and the content.
Not understanding which number you want. From <section class="box_1"> or from <div class="box_2"> ?
Always the first/nearest class. So for the second match: box_2. For "Text Level 3" -> box_3
1

RegEX

[\w\s\d]+(?=\<\/p)

$re = "/[\w\s\d]+(?=\<\/p)/"; 
$str = "<p>Text Level 0</p>"; //Sample from Your large string

preg_match_all($re, $str, $matches);

Demo

OP don't need this in JS but I hope some one can help him by converting this into php. I am not so proficient in php.

var domString = '<p>Text Level 0</p><section class="box_1"><div class="content"><div class="box_2"><div class="class"></div><div class="content"><p>Text Level 2</p><p>More Text Level 2</p></div></div><div class="box_2"><div class="class"></div><div class="content"><p>Text Level 2</p><div class="box_3"><div class="content"><p>Text Level 3</p></div></div></div></div></div></section><p>Another Text</p>'

var result = domString.match(/[\w\s\d]+(?=\<\/p)/g)

var parentTagSubString = function(str,startTagStr,endTagStr,refSearchStr) {
    posRefSearchStr = str.indexOf(refSearchStr);
    var posStartParentTag = str.lastIndexOf(startTagStr, posRefSearchStr)
    var posEndParentTag = str.indexOf(endTagStr, posRefSearchStr)
    return str.substring(posStartParentTag,posEndParentTag + endTagStr.length)
}
//explanation parentTagSubString function
// given a string - "refSearchStr"
// Search towards its left for "startTagStr"
// and
// search towards right for "endTagStr"
// within the string - "str"

for(var i=0;i<result.length;i++) {
    var found = parentTagSubString(domString, "box_", "<p>", result[i])
    //If p-element is not in "content" -> Level 0
    //as mentioned by OP
    if((found.indexOf(result[i]) == 3) || (found.indexOf(result[i]) == -1)) {
        console.log('level is 0 : ', result[i])
    } else {
        //we searched backward till Box and if box found
        //it must be at starting point
        if(found.indexOf("box_") == 0) {
            //search for immediate number after "box_"
           console.log("Level is: ", found.match(/[\d]+/).join(''), " ", result[i]) 
        }
    }
}

//Sample Output
//level is 0 :  Text Level 0
//Level is:  2   Text Level 2
//Level is:  2   More Text Level 2
//Level is:  2   Text Level 2
//Level is:  3   Text Level 3
//level is 0 :  Another Text 

6 Comments

But how do I get the level-number of the element?
Ok, but definition of Level is not clear from your ques. On what basis should I have to level them?
For each match, I have to get the closest parent with the class "box_". And * will give the level. If there isn't such a class "box_" as a parent, than the level is "0". i.e. the second match has the parent "box_2" -> Level = 2
I tried to fetch levels but RegEx and output are becoming complex and I am not so familiar with php. Can JavaScript help you?
Unfortunatly not :-( I need this in php
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.