0

I am trying to extract text between Multilevel XML tags.
This is the data file
<eSearchResult> <Count>7117</Count> <RetMax>10</RetMax> <RetStart>0</RetStart> <QueryKey>1</QueryKey> <WebEnv> NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995 </WebEnv> <IdList> <Id>24887359</Id> <Id>24884828</Id> <Id>24884718</Id> <Id>24884479</Id> <Id>24882343</Id> <Id>24879340</Id> <Id>24871662</Id> <Id>24870721</Id> <Id>24864115</Id> <Id>24863809</Id> </IdList> <TranslationSet/> <TranslationStack> <TermSet> <Term>BRCA1[tiab]</Term> . . . </TranslationStack> </eSearchResult>
I just want to extract the ten ids between <ID></ID> tags enclosed inside <IdList></IdList>. Regex gets me just the first value out of the ten. preg_match_all('~<Id>(.+?)<\/Id>~', $temp_str, $pids) the xml data is stored in the $temp_Str variable and I am trying to get the values stored in $pids Any other suggestions to go about this ?

2
  • Can you add the php code that has the regex? Commented Jun 3, 2014 at 18:43
  • @mrk added in the post above. Commented Jun 3, 2014 at 18:45

4 Answers 4

1

Using preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php), I've included a regex that matches on digits within an <Id> tag. The trickiest part (I think), is in the foreach loop, where I iterate $out[1]. This is because, from the URL above,

Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on.

preg_match_all('/<Id>\s*(\d+)\s*<\/Id>/',
   "<eSearchResult>
<Count>7117</Count>
<RetMax>10</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>
NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995
</WebEnv>
<IdList>
<Id>24887359</Id>
<Id>24884828</Id>
<Id>24884718</Id>
<Id>24884479</Id>
<Id>24882343</Id>
<Id>24879340</Id>
<Id>24871662</Id>
<Id>24870721</Id>
<Id>24864115</Id>
<Id>24863809</Id>
</IdList>
<TranslationSet/>
<TranslationStack>
<TermSet>
<Term>BRCA1[tiab]</Term>
</TranslationStack>
</eSearchResult>",
$out,PREG_PATTERN_ORDER);
foreach ($out[1] as $o){
      echo $o;
      echo "\n";
}
?>
Sign up to request clarification or add additional context in comments.

3 Comments

If I want to search for someother tag between which there are alpha numeric multi line sentences, How will I tweak this regex?
Assuming WebEnv looks like this: <WebEnv> NCID_1_457044331_130.14.22.215_9001_1401819380_1399850995 NCID_2_222222 </WebEnv>, then the regex, '/<WebEnv>\s*([0-9A-Za-z\.\_\n]+)\s*<\/WebEnv>/s' will do the trick. The key is the ending '/s' modifier to switch to "single line" mode. Read more about regex modifiers here: php.net/manual/en/reference.pcre.pattern.modifiers.php
No need for the U modifier - its just slowing down the match and confusing things. That, and its never needed. In other words: Never use the U ungreedy flag!
1

You should use php's xpath capabilities, as explained here:

http://www.w3schools.com/php/func_simplexml_xpath.asp

Example:

<?php
$xml = simplexml_load_file("searchdata.xml");
$result = $xml->xpath("IdList/Id");
print_r($result);
?> 

XPath is flexible, can be used conditionally, and is supported in a wide variety of other languages as well. It is also more readable and easier to write than regex, as you can construct conditional queries without using lookaheads.

2 Comments

The data I have is in a file pointer. Its retrieved from a online search from Pubmed website. All my xml data is in a variable. Is there a way I can use a variable in the xpath functionality? @stephen
@Vignesh as a disclaimer, I don't have PHP setup on this machine, so I won't be able to fully test these code snippets. According to the documentation for simplexml_load_file you should be able to pass in a URL to the xml file. Please look here: php.net/manual/en/function.simplexml-load-file.php
0

use this pattern (?:\<IdList\>|\G)\s*\<Id\>(\d+)\<\/Id\> with g option
Demo

1 Comment

$pattern = "(?:\<IdList\>|\G)\s*\<Id\>(\d+)\<\/Id\>"; preg_match_all($pattern, $string, $matches); This gives me unexpected ? error
0

Do not use PCRE to parse XML. Here are CSS Selectors and even better Xpath to fetch parts of an XML DOM.

If you want any Id element in the first IdList of the eSearchResult

/eSearchResult/IdList[1]/Id

As you can see Xpath "knows" about the actual structure of an XML document. PCRE does not.

You need to create an Xpath object for a DOM document

$dom = new DOMDocument();
$dom->loadXml($xmlString);
$xpath = new DOMXpath($dom);

$result = [];
foreach ($xpath->evaluate('/eSearchResult/IdList[1]/Id') as $id) [
  $result[] = trim($id->nodeValue);
}
var_dump($id);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.