2

I am trying to extract the list of ingredients from a xml FDA label in their database. Somehow I can't get the getNodeSet function to produce the appropriate list of nodes.

Question: How do I fix the code to be able to extract the values of the ingredient names.

This is the code that doesn't work, which breaks at the getNodeSet function, giving a 0 length list. I tried the [graciously proposed] solution on this answer (flatten xml), and many others with no luck.

library(XML)
URL <- "http://www.accessdata.fda.gov/spl/data/e9b9a189-d9e3-42e3-b41c-846a94ebeb37/e9b9a189-d9e3-42e3-b41c-846a94ebeb37.xml"
xmlDocu <- xmlParse(URL)
xmlIngredients <- getNodeSet(xmlDocu, "//ingredient/ingredientSubstance/name")
Ingredients <- xmlSApply(xmlIngredients, function(x) xmlSApply(x, xmlValue))
dfIngredients <- data.frame(t(Ingredients),row.names=NULL)

This is what a portion interest from the deeply nested xml file looks like:

              <ingredient classCode="ACTIB">
                   <quantity>
                      <numerator value="160" unit="mg"/>
                      <denominator value="5" unit="mL"/>
                   </quantity>
                   <ingredientSubstance>
                      <code code="362O9ITL9D" codeSystem="2.16.840.1.113883.4.9"/>
                      <name>Acetaminophen</name>
                      <activeMoiety>
                         <activeMoiety>
                            <code code="362O9ITL9D" codeSystem="2.16.840.1.113883.4.9"/>
                            <name>acetaminophen</name>
                         </activeMoiety>
                      </activeMoiety>
                   </ingredientSubstance>
                </ingredient>
                <ingredient classCode="IACT">
                   <ingredientSubstance>
                      <code code="3QPI1U3FV8" codeSystem="2.16.840.1.113883.4.9"/>
                      <name>BUTYLPARABEN</name>
                   </ingredientSubstance>
                </ingredient>

1 Answer 1

3

I believe your problem is related to the namespaces defined in the xml file. The are a couple ways of dealing with this problem. What I prefer is to use xml2 package and strip out the namespaces and then parse the file:

library(xml2)

URL <- "http://www.accessdata.fda.gov/spl/data/e9b9a189-d9e3-42e3-b41c-846a94ebeb37/e9b9a189-d9e3-42e3-b41c-846a94ebeb37.xml"
xmlDocu <- read_xml(URL)
#find namespace
ns<-xml_ns(xmlDocu)

#I find it easier to strip the name space and use accordingly
xml_ns_strip(xmlDocu)
xml_find_all(xmlDocu, "//ingredient")
xml_text(xml_find_all(xmlDocu, "//ingredient/ingredientSubstance/name"))

I find the syntax of the rvest and xml2 easier to use then the XML package.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.