1

I'm using regular expressions to extract data from a website, but now I found a problem.

This is part of the original HTML that I want to parse. I want to extract the text after "descuentos-" and the city, after the "<a href>".

<div id="cities2_2">
  <a href = "http://website.com/descuentos-espana/">Badajoz</a>
  <a href = "http://website.com/descuentos-espana/">Badalona</a>
  <a href = "http://website.com/descuentos-barcelona/">Barcelona</a>
  <a href = "http://website.com/descuentos-bilbao/">Bilbao</a>
  <a href = "http://website.com/descuentos-espana/">Burgos</a>
</div>
</div>
<div class="capa_cities" onmouseover="act_formato(3, 2);"
     onmouseout="desact_formato(3, 2);">
<h2 id="title_city3_2">C</h2>
<div id="cities3_2">
  <a href = "http://website.com/descuentos-espana/">Cáceres</a>
  <a href = "http://website.com/descuentos-cadiz/">Cádiz</a>
  <a href = "http://website.com/descuentos-espana/">Cartagena</a>
  <a href = "http://website.com/descuentos-espana/">Castellón</a>
  <a href = "http://website.com/descuentos-espana/">Ceuta</a>
  <a href = "http://website.com/descuentos-espana/">Ciudad Real</a>
  <a href = "http://website.com/descuentos-cordoba/">Córdoba</a>
  <a href = "http://website.com/descuentos-espana/">Cuenca</a>

I could look for <a href = "http://website.com/descuentos-(.*)">, but there are others that match the pattern in the website. So I now have this pattern:

#<div id="cities[0-9]+_2">(<a href = "http://website.com/descuentos-(.*?)/">(.*?)</a>)*#

I'd like to have it recursive. I mean: for each "<a href = "http://website.com/descuentos-(.* )/">(.*)</a>" found, search for the two small patterns inside.

Is there a way to achieve this in regex, or I have to reprocess it through preg_match_all?

7
  • Did you consider using an html parser? Commented Sep 24, 2013 at 9:12
  • Don't use regex to parse HTML. Use a HTML-parser. Commented Sep 24, 2013 at 9:15
  • For other easier chunks, like the most I'll be extracting, I thought it would be overkill. For this in particular, yes, but I'd prefer to use the same everywhere. Commented Sep 24, 2013 at 9:15
  • What would you like to search for, after you get the matches? Commented Sep 24, 2013 at 9:19
  • @mavrosxristoforos I don't understand what you are asking for. I need the small patterns only. Commented Sep 24, 2013 at 9:24

1 Answer 1

2

Option 1 : quick way: Yes, use preg_match_all()

preg_match_all('#<a href = "http://website.com/descuentos-(.*?)/">.*?</a>#', $str, $matches);

echo "<pre>";
print_r($matches);
echo "</pre>";

returns:

Array
(
    [0] => Array
        (
            [0] => Badajoz
            [1] => Badalona
            [2] => Barcelona
            [3] => Bilbao
            [4] => Burgos
            [5] => Cáceres
            [6] => Cádiz
            [7] => Cartagena
            [8] => Castellón
            [9] => Ceuta
            [10] => Ciudad Real
            [11] => Córdoba
            [12] => Cuenca
        )

    [1] => Array
        (
            [0] => espana
            [1] => espana
            [2] => barcelona
            [3] => bilbao
            [4] => espana
            [5] => espana
            [6] => cadiz
            [7] => espana
            [8] => espana
            [9] => espana
            [10] => espana
            [11] => cordoba
            [12] => espana
        )

    [2] => Array
        (
            [0] => Badajoz
            [1] => Badalona
            [2] => Barcelona
            [3] => Bilbao
            [4] => Burgos
            [5] => Cáceres
            [6] => Cádiz
            [7] => Cartagena
            [8] => Castellón
            [9] => Ceuta
            [10] => Ciudad Real
            [11] => Córdoba
            [12] => Cuenca
        )

)

Time elapsed: 0.000104904174805 

Option 2: DOM Parser: ($str is your text);

$dom = new DomDocument();
$dom->loadHTML($str);

$links = $dom->getElementsByTagName('a');

foreach($links as $link){
    $href = $link->getAttribute('href');

    echo $href." ### ";//prints the href
    preg_match('#descuentos-(.*)/#', $href, $match);
    echo $link->nodeValue." - ".$match[1]."<br/>";
}

Output (add the utf-8 headers to see the correct chars):

http://website.com/descuentos-espana/ ### Badajoz - espana
http://website.com/descuentos-espana/ ### Badalona - espana
http://website.com/descuentos-barcelona/ ### Barcelona - barcelona
http://website.com/descuentos-bilbao/ ### Bilbao - bilbao
http://website.com/descuentos-espana/ ### Burgos - espana
http://website.com/descuentos-espana/ ### Cáceres - espana
http://website.com/descuentos-cadiz/ ### Cádiz - cadiz
http://website.com/descuentos-espana/ ### Cartagena - espana
http://website.com/descuentos-espana/ ### Castellón - espana
http://website.com/descuentos-espana/ ### Ceuta - espana
http://website.com/descuentos-espana/ ### Ciudad Real - espana
http://website.com/descuentos-cordoba/ ### Córdoba - cordoba
http://website.com/descuentos-espana/ ### Cuenca - espana
Time elapsed: 0.000319004058838 
Sign up to request clarification or add additional context in comments.

2 Comments

Talking about regex, I cannot use the "a href" form because there are other coincident things that are not what I need, that's why I have to search for the parent tag and do a bit of recursion. About parser, I'll give it a try.
Then it's a reason more to use the DOM Parser definetively, if all those tags are inside a element with a specific class or id, you cant target that element easily with getElementById() or getElementByClassName(), to get its children and loop through them forgetting about the pieces of html you don't want from the document

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.