How to use extract data from this string

Question

I am not good at writing pattern to extract data. I have long document, and below is the specific string that I need to extract.

<p><span id="minPrice">XXXX<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ</span></a></span>

I want to extract XXXX, YYYY, and ZZZZ value.

My first step is to get XXXX<a href="YYYYY" target="_blank"><span>¥ZZZZZ

$pattern = '/<p><span id="minPrice">^</span></a></span>/';
preg_match($pattern, $data, $matches);
echo ($matches[1]);

But it does not work. So how to extract XXXX, YYYY, and ZZZZ :(

the document that i have is full of error encoding chars so that I can not use loadHTML. It just returns error.

UPDATE 1: So I am able to do

        var_dump(libxml_use_internal_errors(true));
        $DOM = new DOMDocument;
        $DOM->loadHTML($data);
        $items = $DOM->getElementById('minPrice');

And $items is

 DOMElement Object
(
    [tagName] => span
    [schemaTypeInfo] => 
    [nodeName] => span
    [nodeValue] => 最安価格(税込)：¥131,649
    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] => (object value omitted)
    [lastChild] => (object value omitted)
    [previousSibling] => 
    [nextSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => span
    [baseURI] => 
    [textContent] => 最安価格(税込)：¥131,649
)

The html is

<span id="minPrice">
    �ň����i(�ō�)�F
    <a href="http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku" target="_blank">
        <span>&yen;131,649</span>
    </a>
</span>

How can I extract http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku and 131,649 ?

Regex is not the correct tool for parsing an HTML/XML instead you can use DOMDocument — Narendrasingh Sisodia
– Narendrasingh Sisodia, Commented Mar 18, 2016 at 8:55
@John: Did you try to declare libxml_use_internal_errors(true); when reading the HTML in? — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Mar 18, 2016 at 8:57
@John have a look at this, it might help you approach it correctly — DevDonkey
– DevDonkey, Commented Mar 18, 2016 at 8:57

Wiktor Stribiżew · Accepted Answer · 2016-03-18 09:44:29Z

You can use the following code line to enable internal error handling for the DOM parser:

libxml_use_internal_errors(true);

Then, you can access the data you need with this sample code:

$html = <<<DATA
<p><span id="minPrice">最安価格(税込)：<a href="http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku" target="_blank"><span>&yen;131,649</span></a></span>
DATA;

$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);
$spans = $xpath->query('//span[@id="minPrice"]');   // Get all spans with ID=minPrice
$a = array();
foreach($spans as $span) { 
    foreach($span->childNodes as $child) {          // Check the child nodes
        if ($child->nodeName == "a") {
            array_push($a, $child->getAttribute("href"));
        }
    }
    array_push($a, preg_replace('~^.*?(\d+(?:,\d+)*)$~u', '$1', $child->nodeValue));
}

print_r($a);

Result:

Array
(
    [0] => http://kakaku.com/shop/1115/?pdid=K0000693648&lid=shop_itemview_saiyasukakaku
    [1] => 131,649
)

I used a regex to extract the number at the end of the string, but you can use an explode with the yen symbol, too.

$num = explode(html_entity_decode("&yen;"), $child->nodeValue)[1];
array_push($a, $num);

See another demo

apokryfos · Accepted Answer · 2016-03-18 09:00:06Z

0

This could be done with regular expressions and the regular expression to get that exact match is :

$regex = "/<p><span id=\"minPrice\">(.*?)<a href=\"(.*?)\" target=\"_blank\"><span>&yen;(.*)<\/span><\/a>/";
preg_match($regex, $data, $matches);

However, as mentioned in the comments, regex is not an appropriate tool to do this task. This regex will probably fail if the document is long and nests these matchable patterns (i.e. if XXXX is another one of these paragraphs). You should probably see how you can fix this document to make it proper XHTML and then use a proper XML parser. You can mitigate this by running this regex on each line of input (assuming it's split into lines properly), but still, not ideal.

answered Mar 18, 2016 at 9:00

apokryfos

40.9k11 gold badges85 silver badges128 bronze badges

Comments

Rahul · Accepted Answer · 2016-03-18 09:02:30Z

0

Use this Regexp -

/<p><span.*id=\"minPrice\">(.*)<a.*href="(.*?)".*>.*<span>.*;(.*?)<\/span>.*/

Result -

XXXX
YYYYY
ZZZZZ

edited Mar 18, 2016 at 9:02

answered Mar 18, 2016 at 8:57

Rahul

7261 gold badge12 silver badges49 bronze badges

Comments

Пётр Литвинович · Accepted Answer · 2016-03-18 10:17:04Z

Man use it and sorry for my bad english! PHP Simple HTML DOM Parser and download lib This alternative. Code:

require_once '/simple_html_dom.php';

//here put content or block or DOM  
$html = str_get_html('<p><span id="minPrice">最安価格(税込)<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ</span></a></span>');
//OR
//USE get_file_content if need
//$html = file_get_html('example.html');
//select links, and use first element
$link = $html->find('p span#minPrice a',0);//select links, and use first element
//get url
$href =  $link->href;
//get text in span
$span_in_link = $link->find('span',0)->plaintext;
//delete <a></a>
$link->outertext = '';
 //get text in span
$span_id_minPrice = $html->find('p span#minPrice',0)->plaintext;
//delete  &yen;
$span_in_link =  str_replace('&yen;','',$span_in_link);
 //result
echo $span_id_minPrice.'<br>';//最安価格(税込)
echo $href.'<br>';//YYYYY
echo $span_in_link.'<br>';//ZZZZZ

if you have this > 1, then use it:

 //select all span
$html = str_get_html('
            <p><span id="minPrice">XXXX<a href="YYYYY" target="_blank"><span>&yen;ZZZZZ</span></a></span>
            <p><span id="minPrice">XXXX2<a href="YYYYY2" target="_blank"><span>&yen;ZZZZZ2</span></a></span>
            ');
    $all_span = $html->find('p span#minPrice');
     $data = array();
    foreach($all_span as $element)
    {
        $array = array();
        $link = $element->find('a',0);//select links, and use first element
        //get url
        $href =  $link->href;
        //get text in span
        $span_in_link = $link->plaintext;
        //delete a
        $link->innertext = '';
        //get text in span
        $span_id_minPrice = $element->plaintext;
        //delete  &yen;
        $span_in_link =  str_replace('&yen;','',$span_in_link);

        $array['span#minPrice'] = $span_id_minPrice ;
        $array['href'] =  $href;
        $array['span_in_link'] =  $span_in_link;

        $data [] = $array;

    }

    echo '<pre>';
    print_r($data);

Result:

Array (

[0] => Array
    (
        [span#minPrice] => XXXX 
        [href] => YYYYY
        [span_in_link] => ZZZZZ 
    )

[1] => Array
    (
        [span#minPrice] => XXXX2 
        [href] => YYYYY2
        [span_in_link] => ZZZZZ2 
    )

)

Collectives™ on Stack Overflow

How to use extract data from this string

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related