I am trying to extract text between Multilevel XML tags.
This is the data file
<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
.
.
.
</TranslationStack>
</eSearchResult>
I just want to extract the ten ids between <ID></ID> tags enclosed inside <IdList></IdList>.
Regex gets me just the first value out of the ten.
preg_match_all('~<Id>(.+?)<\/Id>~', $temp_str, $pids)
the xml data is stored in the $temp_Str variable and I am trying to get the values stored in $pids
Any other suggestions to go about this ?
-
Can you add the php code that has the regex?mrk– mrk2014-06-03 18:43:16 +00:00Commented Jun 3, 2014 at 18:43
-
@mrk added in the post above.Vignesh– Vignesh2014-06-03 18:45:35 +00:00Commented Jun 3, 2014 at 18:45
4 Answers
Using preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php), I've included a regex that matches on digits within an <Id> tag. The trickiest part (I think), is in the foreach loop, where I iterate $out[1]. This is because, from the URL above,
Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on.
preg_match_all('/<Id>\s*(\d+)\s*<\/Id>/',
"<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
</TranslationStack>
</eSearchResult>",
$out,PREG_PATTERN_ORDER);
foreach ($out[1] as $o){
echo $o;
echo "\n";
}
?>
3 Comments
<WebEnv> NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995 NCID_2_222222 </WebEnv>, then the regex, '/<WebEnv>\s*([0-9A-Za-z\.\_\n]+)\s*<\/WebEnv>/s' will do the trick. The key is the ending '/s' modifier to switch to "single line" mode. Read more about regex modifiers here: php.net/manual/en/reference.pcre.pattern.modifiers.phpU modifier - its just slowing down the match and confusing things. That, and its never needed. In other words: Never use the U ungreedy flag!You should use php's xpath capabilities, as explained here:
http://www.w3schools.com/php/func_simplexml_xpath.asp
Example:
<?php
$xml = simplexml_load_file("searchdata.xml");
$result = $xml->xpath("IdList/Id");
print_r($result);
?>
XPath is flexible, can be used conditionally, and is supported in a wide variety of other languages as well. It is also more readable and easier to write than regex, as you can construct conditional queries without using lookaheads.
2 Comments
Do not use PCRE to parse XML. Here are CSS Selectors and even better Xpath to fetch parts of an XML DOM.
If you want any Id element in the first IdList of the eSearchResult
/eSearchResult/IdList[1]/Id
As you can see Xpath "knows" about the actual structure of an XML document. PCRE does not.
You need to create an Xpath object for a DOM document
$dom = new DOMDocument();
$dom->loadXml($xmlString);
$xpath = new DOMXpath($dom);
$result = [];
foreach ($xpath->evaluate('/eSearchResult/IdList[1]/Id') as $id) [
$result[] = trim($id->nodeValue);
}
var_dump($id);