0

First of all, I found some threads here on SO, for example here, but it's not exactly what I am looking for.

Here is a sample of text that I have:

Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook

The desired output:

2012-12-13
Peter Novak
books,cinema,facebook

I need to save this information into our database, but I don't know, how to detect between the <b> tags the value (eg. Date) and then immediately the value (in this case : 2012-12-13)...

I would be grateful for every help with this, thank you!

2
  • 1
    It's going to be messy because the html isn't semantic. Is there any other way of retrieving the information? Xml? Commented Jan 4, 2013 at 19:29
  • is there are parent element surrounding the HTML you've posted? Commented Jan 4, 2013 at 20:17

4 Answers 4

1

Since there's not much DOM to traverse, there's not much a DOM traversal tool can do with this.

This should work:

1) Remove everything before the b tag.

2) Remove the b tags. A DOM traversal tool can do this, but if they are pure text, even a regex can do it, and it can remove the colon and the subsequent whitespace in the same pass: <b\s*>[^<]+</b\s*>:\s*

3) Change sequences of br tags to bare newlines (do you really want to?). The DOM traversal tool can do this, but so can regexes: (?:<br\s*/?>)+

$html = preg_replace('#^[^<]+#', "", $html);
$html = preg_replace('#<b\s*>[^<]+</b\s*>:\s*#', "", $html);
$html = preg_replace('#(?:<br\s*/?>)+#', "\n", $html);
Sign up to request clarification or add additional context in comments.

Comments

0

If <b>Date</b>, <b>Name</b>, <b>Hobby</b> and the <br />'s will always be there in that way, I suggest you use strpos() and substr().

For instance, to get the date:

// Get start position, +13 because of "<b>Date</b>: "
$dateStartPos = strpos($yourText, "<b>Date</b>") + 13;
// Get end position, use dateStartPos as offset
$dateEndPos = strpos($yourText, "<br />", $dateStartPos);
// Cut out the date, the length is the end position minus the start position
$date = substr($yourText, $dateStartPos, ($dateEndPos - $dateStartPos));

Comments

0

Assuming that the format is consistent, then explode can work for you:

<?php
$text = "Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook";
$tokenized = explode(': ', $text);
$tokenized[1] = explode("<br", $tokenized[1]);
$tokenized[2] = explode("<br", $tokenized[2]);
$tokenized[3] = explode("<br", $tokenized[3]);

$date = $tokenized[1][0];
$name = $tokenized[2][0];
$hobby = $tokenized[3][0];

echo $date;
echo $name;
echo $hobby;

?>

Comments

0

Using PHP Simple HTML DOM Parser you can achieve this easily (just like jQuery)

include('simple_html_dom.php');
$html = str_get_html('Some text bla bla bla bla<br /><b>Date</b>: 2012-12-13<br /><br /><b>Name</b>: Peter Novak<br /><b>Hobby</b>: books,cinema,facebook');

Or

$html = file_get_html('http://your_page.com/');

then

foreach($html->find('text') as $t){
    if(substr($t, 0, 1)==':')
    {
        // do whatever you want
        echo substr($t, 1).'<br />';
    }
}

The output of the example is given below

2012-12-13
Peter Novak
books,cinema,facebook

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.