Search for multiple patterns inside a pattern

Question

I'm using regular expressions to extract data from a website, but now I found a problem.

This is part of the original HTML that I want to parse. I want to extract the text after "descuentos-" and the city, after the "<a href>".

<div id="cities2_2">
  <a href = "http://website.com/descuentos-espana/">Badajoz</a>
  <a href = "http://website.com/descuentos-espana/">Badalona</a>
  <a href = "http://website.com/descuentos-barcelona/">Barcelona</a>
  <a href = "http://website.com/descuentos-bilbao/">Bilbao</a>
  <a href = "http://website.com/descuentos-espana/">Burgos</a>
</div>
</div>
<div class="capa_cities" onmouseover="act_formato(3, 2);"
     onmouseout="desact_formato(3, 2);">
<h2 id="title_city3_2">C</h2>
<div id="cities3_2">
  <a href = "http://website.com/descuentos-espana/">Cáceres</a>
  <a href = "http://website.com/descuentos-cadiz/">Cádiz</a>
  <a href = "http://website.com/descuentos-espana/">Cartagena</a>
  <a href = "http://website.com/descuentos-espana/">Castellón</a>
  <a href = "http://website.com/descuentos-espana/">Ceuta</a>
  <a href = "http://website.com/descuentos-espana/">Ciudad Real</a>
  <a href = "http://website.com/descuentos-cordoba/">Córdoba</a>
  <a href = "http://website.com/descuentos-espana/">Cuenca</a>

I could look for <a href = "http://website.com/descuentos-(.*)">, but there are others that match the pattern in the website. So I now have this pattern:

#<div id="cities[0-9]+_2">(<a href = "http://website.com/descuentos-(.*?)/">(.*?)</a>)*#

I'd like to have it recursive. I mean: for each "<a href = "http://website.com/descuentos-(.* )/">(.*)</a>" found, search for the two small patterns inside.

Is there a way to achieve this in regex, or I have to reprocess it through preg_match_all?

For other easier chunks, like the most I'll be extracting, I thought it would be overkill. For this in particular, yes, but I'd prefer to use the same everywhere. — markmb
– markmb, Commented Sep 24, 2013 at 9:15
What would you like to search for, after you get the matches? — mavrosxristoforos
– mavrosxristoforos, Commented Sep 24, 2013 at 9:19
@mavrosxristoforos I don't understand what you are asking for. I need the small patterns only. — markmb
– markmb, Commented Sep 24, 2013 at 9:24

aleation · Accepted Answer · 2013-09-24 09:44:48Z

2

Option 1 : quick way: Yes, use preg_match_all()

preg_match_all('#<a href = "http://website.com/descuentos-(.*?)/">.*?</a>#', $str, $matches);

echo "<pre>";
print_r($matches);
echo "</pre>";

returns:

Array
(
    [0] => Array
        (
            [0] => Badajoz
            [1] => Badalona
            [2] => Barcelona
            [3] => Bilbao
            [4] => Burgos
            [5] => Cáceres
            [6] => Cádiz
            [7] => Cartagena
            [8] => Castellón
            [9] => Ceuta
            [10] => Ciudad Real
            [11] => Córdoba
            [12] => Cuenca
        )

    [1] => Array
        (
            [0] => espana
            [1] => espana
            [2] => barcelona
            [3] => bilbao
            [4] => espana
            [5] => espana
            [6] => cadiz
            [7] => espana
            [8] => espana
            [9] => espana
            [10] => espana
            [11] => cordoba
            [12] => espana
        )

    [2] => Array
        (
            [0] => Badajoz
            [1] => Badalona
            [2] => Barcelona
            [3] => Bilbao
            [4] => Burgos
            [5] => Cáceres
            [6] => Cádiz
            [7] => Cartagena
            [8] => Castellón
            [9] => Ceuta
            [10] => Ciudad Real
            [11] => Córdoba
            [12] => Cuenca
        )

)

Time elapsed: 0.000104904174805

Option 2: DOM Parser: ($str is your text);

$dom = new DomDocument();
$dom->loadHTML($str);

$links = $dom->getElementsByTagName('a');

foreach($links as $link){
    $href = $link->getAttribute('href');

    echo $href." ### ";//prints the href
    preg_match('#descuentos-(.*)/#', $href, $match);
    echo $link->nodeValue." - ".$match[1]."<br/>";
}

Output (add the utf-8 headers to see the correct chars):

http://website.com/descuentos-espana/ ### Badajoz - espana
http://website.com/descuentos-espana/ ### Badalona - espana
http://website.com/descuentos-barcelona/ ### Barcelona - barcelona
http://website.com/descuentos-bilbao/ ### Bilbao - bilbao
http://website.com/descuentos-espana/ ### Burgos - espana
http://website.com/descuentos-espana/ ### CÃ¡ceres - espana
http://website.com/descuentos-cadiz/ ### CÃ¡diz - cadiz
http://website.com/descuentos-espana/ ### Cartagena - espana
http://website.com/descuentos-espana/ ### CastellÃ³n - espana
http://website.com/descuentos-espana/ ### Ceuta - espana
http://website.com/descuentos-espana/ ### Ciudad Real - espana
http://website.com/descuentos-cordoba/ ### CÃ³rdoba - cordoba
http://website.com/descuentos-espana/ ### Cuenca - espana
Time elapsed: 0.000319004058838

edited Sep 24, 2013 at 9:44

answered Sep 24, 2013 at 9:38

aleation

4,8341 gold badge23 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

markmb Over a year ago

Talking about regex, I cannot use the "a href" form because there are other coincident things that are not what I need, that's why I have to search for the parent tag and do a bit of recursion. About parser, I'll give it a try.

aleation Over a year ago

Then it's a reason more to use the DOM Parser definetively, if all those tags are inside a element with a specific class or id, you cant target that element easily with getElementById() or getElementByClassName(), to get its children and loop through them forgetting about the pieces of html you don't want from the document

Collectives™ on Stack Overflow

Search for multiple patterns inside a pattern

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related