0

I am working with a xml file with the following structure below which I am trying to display each unique into a dataframe. I know I can retrieve each child attribute (e.g., ) using the xpathApply function but notice that the //channel//item//category[@domain='tag'] contains different counts. How would I be able to put these categories all in one cell separated by a comma? Would you loop over each child attribute ?

Here's a test.xml

test.xml <- "<channel>
        <item>
        <title>Article Name 1</title>
        <creator>User1</creator>
        <post_id>1000</post_id>
        <category domain='tag' nicename='red'>Red</category>
        <category domain='store' nicename='clothes'>Clothes</category>
        </item>     
        <item>
        <title>Article Name 2</title>
        <creator>User3</creator>
        <post_id>232</post_id>
        <category domain='tag' nicename='blue'>Blue</category>
        <category domain='tag' nicename='green'>Green</category>
     <category domain='tag' nicename='yellow'>Yellow</category>
        <category domain='store' nicename='clothes'>Other</category>
        </item> 
        <item>
        <title>Article Name 3</title>
        <creator>User4</creator>
        <post_id>4532</post_id>
        <category domain='tag' nicename='red'>Red</category>
        <category domain='tag' nicename='blue'>Blue</category>
        <category domain='store' nicename='clothes'>Food</category>
        </item>         
    </channel>"

xml <- xmlParse(test.xml)

The end goal should look like this:

title creator tag store
Article 1 User 1 Red Clothes
Article 2 User 3 Blue, Green Other
Article 3 User 4 Red, Blue Food

1 Answer 1

1

Here is a solution using the xml2 package. It is straight forward, read the "item" parent nodes, and parse out the title and creator. Then using lapply process each parent node to parse and merge the multiple child nodes together. Finally merger everything together.

library(xml2)
library(dplyr)
#read page and parent nodes
page <- read_xml(test.xml)
items <- page %>% xml_find_all("item")

#get title and creator (assuming 1 per parent)
title <- items %>% xml_find_first("title") %>% xml_text()
creator <- items %>% xml_find_first("creator") %>% xml_text()

#find the multip;e tag and store nodes per parent
#collapse the multiples into 1 value
dfs <- lapply(items, function(node){
   tag <- node %>% xml_find_all(xpath='.//category[@domain="tag"]') %>% xml_text()
  tag <- paste(tag, collapse = ", ")
  
 store <- node %>% xml_find_all(xpath='.//category[@domain="store"]') %>% xml_text()
 store <- paste(store, collapse = ", ")
 
 data.frame(tag, store)
})

#combine everything into 1 data frame
finalanswer <- data.frame(title, creator, bind_rows(dfs))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.