Extract text between multilevel repetitive xml tags using Php

Question

I am trying to extract text between Multilevel XML tags.
This is the data file
<eSearchResult> <Count>7117</Count> <RetMax>10</RetMax> <RetStart>0</RetStart> <QueryKey>1</QueryKey> <WebEnv> NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995 </WebEnv> <IdList> <Id>24887359</Id> <Id>24884828</Id> <Id>24884718</Id> <Id>24884479</Id> <Id>24882343</Id> <Id>24879340</Id> <Id>24871662</Id> <Id>24870721</Id> <Id>24864115</Id> <Id>24863809</Id> </IdList> <TranslationSet/> <TranslationStack> <TermSet> <Term>BRCA1[tiab]</Term> . . . </TranslationStack> </eSearchResult>
I just want to extract the ten ids between <ID></ID> tags enclosed inside <IdList></IdList>. Regex gets me just the first value out of the ten. preg_match_all('~<Id>(.+?)<\/Id>~', $temp_str, $pids) the xml data is stored in the $temp_Str variable and I am trying to get the values stored in $pids Any other suggestions to go about this ?

Can you add the php code that has the regex?

mrk
– mrk

2014-06-03 18:43:16 +00:00
Commented Jun 3, 2014 at 18:43 — mrk
– mrk, Commented Jun 3, 2014 at 18:43
@mrk added in the post above.

Vignesh
– Vignesh

2014-06-03 18:45:35 +00:00
Commented Jun 3, 2014 at 18:45 — Vignesh
– Vignesh, Commented Jun 3, 2014 at 18:45

mrk · Accepted Answer · 2014-06-04 00:18:20Z

1

Using preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php), I've included a regex that matches on digits within an <Id> tag. The trickiest part (I think), is in the foreach loop, where I iterate $out[1]. This is because, from the URL above,

Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on.

preg_match_all('/<Id>\s*(\d+)\s*<\/Id>/',
   "<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
</TranslationStack>
</eSearchResult>",
$out,PREG_PATTERN_ORDER);
foreach ($out[1] as $o){
      echo $o;
      echo "\n";
}
?>

edited Jun 4, 2014 at 0:18

answered Jun 3, 2014 at 19:01

mrk

5,1273 gold badges29 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Vignesh Over a year ago

If I want to search for someother tag between which there are alpha numeric multi line sentences, How will I tweak this regex?

mrk Over a year ago

Assuming WebEnv looks like this: <WebEnv> NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995 NCID_2_222222 </WebEnv>, then the regex, '/<WebEnv>\s*([0-9A-Za-z\.\_\n]+)\s*<\/WebEnv>/s' will do the trick. The key is the ending '/s' modifier to switch to "single line" mode. Read more about regex modifiers here: php.net/manual/en/reference.pcre.pattern.modifiers.php

ridgerunner Over a year ago

No need for the U modifier - its just slowing down the match and confusing things. That, and its never needed. In other words: Never use the U ungreedy flag!

S. Dixon · Accepted Answer · 2014-06-03 19:04:25Z

1

You should use php's xpath capabilities, as explained here:

http://www.w3schools.com/php/func_simplexml_xpath.asp

Example:

<?php
$xml = simplexml_load_file("searchdata.xml");
$result = $xml->xpath("IdList/Id");
print_r($result);
?>

XPath is flexible, can be used conditionally, and is supported in a wide variety of other languages as well. It is also more readable and easier to write than regex, as you can construct conditional queries without using lookaheads.

edited Jun 3, 2014 at 19:04

answered Jun 3, 2014 at 18:46

S. Dixon

8512 gold badges12 silver badges27 bronze badges

2 Comments

Vignesh Over a year ago

The data I have is in a file pointer. Its retrieved from a online search from Pubmed website. All my xml data is in a variable. Is there a way I can use a variable in the xpath functionality? @stephen

S. Dixon Over a year ago

@Vignesh as a disclaimer, I don't have PHP setup on this machine, so I won't be able to fully test these code snippets. According to the documentation for simplexml_load_file you should be able to pass in a URL to the xml file. Please look here: php.net/manual/en/function.simplexml-load-file.php

alpha bravo · Accepted Answer · 2014-06-03 18:49:21Z

0

use this pattern (?:\<IdList\>|\G)\s*\<Id\>(\d+)\<\/Id\> with g option
Demo

answered Jun 3, 2014 at 18:49

alpha bravo

7,9681 gold badge24 silver badges25 bronze badges

1 Comment

Vignesh Over a year ago

$pattern = "(?:\<IdList\>|\G)\s*\<Id\>(\d+)\<\/Id\>"; preg_match_all($pattern, $string, $matches); This gives me unexpected ? error

ThW · Accepted Answer · 2014-06-03 21:38:39Z

0

Do not use PCRE to parse XML. Here are CSS Selectors and even better Xpath to fetch parts of an XML DOM.

If you want any Id element in the first IdList of the eSearchResult

/eSearchResult/IdList[1]/Id

As you can see Xpath "knows" about the actual structure of an XML document. PCRE does not.

You need to create an Xpath object for a DOM document

$dom = new DOMDocument();
$dom->loadXml($xmlString);
$xpath = new DOMXpath($dom);

$result = [];
foreach ($xpath->evaluate('/eSearchResult/IdList[1]/Id') as $id) [
  $result[] = trim($id->nodeValue);
}
var_dump($id);

answered Jun 3, 2014 at 21:38

ThW

19.5k3 gold badges25 silver badges47 bronze badges

Collectives™ on Stack Overflow

Extract text between multilevel repetitive xml tags using Php

4 Answers 4

3 Comments

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related