1

I'm learning RegEx and site crawling, and have the following question which, if answered, should speed my learning process up significantly.

I have fetched the form element from a web site in htmlencoded format. That is to say, I have the $content string with all the tags intact, like so:

$content = "<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
...
</select>
</form>

I would like to fetch all the options on the site, in this manner:

array("One Town" => "one", "Another Town" => "two", "Yet Another Town" => "three" ...);

Now, I know this can easily be done by manipulating the string, slicing it an dicing it, searching for substrings within each string, and so on, until I have everything I need. But I'm certain there must be a simpler way of doing it with regex, which should fetch all the results from a given string instantly. Can anyone help me find a shortcut for this? I have searched the web's finest regex sites, but to no avail.

Many thanks

0

5 Answers 5

6

See Best methods to parse HTML. Find the DOM solution below:

$dom = new DOMDocument;
$dom->loadHTMLFile('http://example.com');
$options = array();
foreach($dom->getElementsByTagName('option') as $option) {
    $options[$option->nodeValue] = $option->getAttribute('value');
}

This can be done with Regex too, but I dont find it practical to write a reliable HTML parser with Regex when there is plenty of native and 3rd party parsers readily available for PHP.

Sign up to request clarification or add additional context in comments.

1 Comment

While the above method did not work as well as I expected it to, using Zend_Dom suggested in the post you linked was the way to go, since I build projects in ZF anyway. Excellent, thank you very much!
0

I think it would be easier to use DomXPath, rather than use Regular expressions for this. You could try something like this (not tested so might need some tweaks)...

<?php
$content = '<form name="sth" action="">
            <select name="city">
            <option value="one">One town</option>
            <option value="two">Another town</option>
            <option value="three">Yet Another town</option>
            </select>
            </form>';

$doc = new DOMDocument;
$doc->loadhtml($content);
$xpath = new DOMXPath($doc);
$options = $xpath->evaluate("/html/body//option");
for ($i = 0; $i < $options->length; $i++) {
        $option = $options->item($i);
        $values[] =  $option->getAttribute('value');                
}
var_dump($values);
?>

Comments

0
<?php

$content = '<form name="sth" action="">
<select name="city">
<option value="one">One town</option>
<option value="two">Another town</option>
<option value="three">Yet Another town</option>
</select>
</form>';

preg_match_all('@<option value=\"(.*)\">(.*)</option>@', $content,$matches);

echo "<pre>";
print_r($matches);
?>

Now $matches contains the arrays you are looking for and you can process them to the result one very easily.

2 Comments

Using regex is not advisable. Above code fails with <option selected="selected" value="xyz">hello, world</option>
Not advisable - Yes, but I thought from Swader's post that he wants an Regex example.
0

Using SimpleXML:

libxml_use_internal_errors(true);
$load = simplexml_load_string($content);
foreach ($load->xpath('//select/option') as $path)
    var_dump((string)$path[0]);

Comments

0

If it's really coherent HTML then a simple regex will do:

 preg_match('/<option\s+value="([^">]+)">([^<]+)/i', ...

However it's often simpler and more reliable to use phpQuery or QueryPath.

 $options = qp($html)->find("select[name=city]")->find("option");
 foreach ($options as $o) {
      $result[ $o->attr("value") ] = $o->text();
 }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.