0

Guys i'm working on a script which is parsing HTML output data from a links with curl.

Here is the HTML DOM parser - http://simplehtmldom.sourceforge.net

Let me show you my parser:

<?PHP
include_once('./simple_html_dom.php');
$url = "http://www.sportsdirect.com/muddyfox-cycling-short-sleeved-jersey-mens-636266?colcode=63626622";
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_SSLVERSION, 3);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13');
 $str = curl_exec($curl);  
 curl_close($curl); 

$html= str_get_html($str);   


$SIZEID = 'UK: 8-13 Kids / EU: 25-32 Kids';
$occurencies = preg_match_all('/(?<=\"SizeName\":\"' . preg_quote($SIZEID, "/") . '")\S+/i', $str, $match);


foreach($html->find('#ulColourImages li') as $selectnocolor)    
$colvarid = $selectnocolor->colvar-id;
$tooltiptext = $selectnocolor->tooltiptext;     


echo "$tooltiptext - $colvarid";

So when i fetch the page that i need i get plain text from which i have to get specific parts.

Here is the complete text: http://pastebin.com/FwK9Z8CP

Let me describe what i need.

In the text there are total 3 occurrences of this specific word ColVarId.

After every ColVarId there are several "SellPrice":"PRICEHERE".

For example in the text "SellPrice":"£4.49" and this SellPrice word is giving me the information about the price. That's all what i want to achieve in final, i want to get the price contained in specific "SellPrice":"MYTargetText"

What i want to do, but don't know how:

For example, I want to get the all text after the second occurrence of ColVarId word and then from the extracted text i want to select for example the third occurence of SellPrice which is in structure like this for example "SellPrice":"£4.49" and in this example the price is 4.49. So i want to get the price contained there. How can i make it ?

I hope i described my question well and you understand what i want to achieve in final.

Thanks in advance.

1
  • 1
    Seems like a json string?! Commented Mar 31, 2015 at 14:10

4 Answers 4

2

Since this is php, how about using json_decode instead? While the regular expressions look reliable, json_decode will be a lot more dependable and provide much more functionality to access other properties in the object if you need to in the future.

In the solution below, I use the preg_replace to string out the javaScript assignment at the beginning of the string. I then decode the remaining json so I have the data as an object.

$colourJavascript = preg_replace('/^[^=]+=/', '', $colourJavascript);

$data = json_decode($colourVariantsInitialData);

print_r($data[0]->SizeVariants[0]->ProdSizePrices->SellPrice);
print_r($data[0]->SizeVariants[1]->ProdSizePrices->SellPrice);
print_r($data[0]->SizeVariants[2]->ProdSizePrices->SellPrice);

If you need the numeric value, instead of the currency formatted as in the sample data you can use NumberFormatter to extract the value.

$formatter = new NumberFormatter("en-GB", \NumberFormatter::CURRENCY);
$priceRaw = $data[0]->SizeVariants[0]->ProdSizePrices->SellPrice;

print_r($formatter->parse($priceRaw)); 

Full Gist

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for the advice. But when i do what you suggest i get results like this £4.99 how i can remove these symbols and get only the number like 4.99 ? Thanks!
I adjusted my answer.
It seems okey, but now i have other problem which i have to resolve so i can use your advice. When i use $colourVariantsInitialData='My text here'; there is no problem everything is okey, but when i do that $colourVariantsInitialData=$str; it's not giving me any results, why ? It seems that preg_match_all can search $str but json_decode can't ? P.S. You can see what is $str in my question. Complete code where i get blank result: pastebin.com/9WBCDmgJ
Aahh. I didn't look closely enough at your sample text that it was a full javaScript assignment, not just json. Well it's simple enough to strip out the assignment.
1

First try to avoid simple_html_dom that is the worst parser ever (the slowest) and not so simple. Take the time to learn how to use DOMDocument and DOMXPath (there is a ton of tutorials about XPath 1.0) to do the same kind of jobs (note that once you learn that for php, you can use it for a lot of other languages since this is implemented everywhere).

The second step consists to extract the json string and to build a json object.

A general advice: When you have formated datas under the nose, using this format, it is more handy than a string approach.

$url = 'http://www.samplehost.com/samplepage.php';

// discard notices and warnings about badly formated html 
libxml_use_internal_errors(true);
$dom = new DOMDocument; 
// or get the file content via curl and use $dom->loadHTML($content);
$dom->loadHTMLFile($url); 

$xp = new DOMXPath($dom);
// '//' means everywhere in the DOM tree, 'script' is the target node,
// and [...] encloses conditions about this node:
// normalize-space is used here to trim leading spaces,
// the dot refers to the current node content
$qry = '//script[starts-with(normalize-space(.), "var colourVariantsInitialData")]';

// an xpath query returns a nodeList, to get the first (and unique here)
// item of the list, you need to use ->item(0)
$rawtxt = $xp->query($qry)->item(0)->nodeValue;

// extraction of the json string and creation of a json object 
$jsonStart = strpos($rawtxt, '[');
$jsonEnd = strrpos($rawtxt, ']');

$collections = json_decode(substr($rawtxt, $jsonStart, $jsonEnd - $jsonStart + 1));

// Then you can easily extract what you want from the json object 
echo "collection id: " . $collections[1]->ColVarId . "\n";

foreach ($collections[1]->SizeVariants as $item) {
    printf("%-30s\t%s\n", $item->SizeName, $item->ProdSizePrices->SellPrice);
}

3 Comments

Thank you so much for the detailed description. It seems that this will be the way that i'll follow to achieve my goal. I have 1 more questions about your answer. 1: How can i get only one result printing only the price value contained in "SellPrice":"THE_VALUE_I_NEED" for a specific SizeName. For example, how can i take the price after "SizeName":"7 (41)". Let me show you the script that i have after your answer: pastebin.com/92587qLi . The $url are pointing a correct URL adress where the HTML output contains "SizeName":"7 (41)". Thanks once again!
Please checkout my new version of the script: pastebin.com/eubaGDGv Copy it and paste it it should work. But i still don't know how to get only one result for a specific sizename :)
@TonyStark: display the json object to see how it looks: var_dump($collections);
1

The example you linked to at Pastebin looks like JavaScript, not HTML. Completely different language. You absolutely should not use a regex to parse a data format that is natively supported by PHP.

Ideally it should be parsed in JavaScript. If you must parse it in PHP, then strip off the JavaScript portions (var colourVariantsInitialData= at the beginning, and the semicolon at the end), and slurp the JSON part into a PHP array using json_decode(). For example:

<?php

$s = file_get_contents("http://example.com/path/to/data.json");

preg_match('/^[^=]+ *= *(.*);$/', $s, $a);

$output = json_decode($a[1]);

// Now simply go find SellPrice inside ColVarId.

2 Comments

Please checkout my question. I've added my HTML parser.
Ah, so you're scraping HTML for a JSON object inside the page. Can I ask, if you've already attained permission (per their T&C document) from SportsDirect.com Retail Ltd to use their data, can they not provide you with an API that will allow you to access it more directly?
0

DISCLAIMER: This will only work with PHP, and only if you really ARE TO to parse it using regex.

Here is your regex that extracts 3 "SellPrice":"" strings:

 ColVarId.*?\K("SellPrice":"[^"]+")

Here is a demo.

The use of \K in PHP is possible as it uses PCRE library. \K omits the entire match up to this operator. And you receive your SellPrice details.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.