1

I'm trying to convert a XML with a complex structure of nodes into a data frame using R. This is a brief example of the XML file:

<products>
    <product>
        <id>1</id>
        <data>
            <data_value>
                <number>12345</number>
                <city>London</city>
            </data_value>
        </data>
        <attributes>
            <p_attribute>
                <name>Name_1</name>
                <value>Value_1</value>
            </p_attribute>
            <p_attribute>
                <name>Name_2</name>
                <value>Value_2</value>
            </p_attribute>
        </attributes>
    </product>
    <product>
        <id>2</id>
        <data>
            <data_value>
                <number>98765</number>
                <city>London</city>
            </data_value>
        </data>
        <attributes>
            <p_attribute>
                <name>Name_9</name>
                <value>Value_9</value>
            </p_attribute>
            <p_attribute>
                <name>Name_8</name>
                <value>Value_8</value>
            </p_attribute>
        </attributes>
    </product>
</products>

When I try to convert this file to a data frame, I use the following code (XML library)

library(XML)
doc=xmlParse("file.xml")
xmldf=xmlToDataFrame(nodes = getNodeSet(doc, "//product"))

And after that, the final result is this data frame that you can see bellow:

  id        data                 attributes
1  1 12345London Name_1Value_1Name_2Value_2
2  2 98765London Name_9Value_9Name_8Value_8

How can I get a different data frame, eliminating the complex structure of the XML file to get a result like this?

  id number   city name.1 value.1 name.2 value.2
1  1  12345 London Name_1 Value_1 Name_2 Vlaue_2
2  2  98765 London Name_9 Value_9 Name_8 Value_8

2 Answers 2

6

I'm less familiar with the XML package but have used the xml2 package more. It fits in with the tidyverse packages, so it works well with a purrr-based approach I'll use here. For each <product> node, I'm calling a function that extracts all its child id, number, city, name, and value nodes and pulls out their text. I did it by product because I wanted to get a small data frame for each in order to make sure all the IDs stay together with the name & value nodes, allowing those to have different lengths. Finally, map_dfr binds the list of data frames row-wise.

library(tidyr)
library(purrr)
library(xml2)


products <- read_xml("text.xml") %>%
  xml_find_all("//product")

prod_df <- map_dfr(products, function(p_node) {
  list(".//id", ".//number", ".//city", ".//name", ".//value") %>%
    set_names(stringr::str_extract, "\\w+") %>%
    map(~xml_find_all(p_node, .)) %>%
    map(xml_text) %>%
    as_tibble()
})

prod_df
#> # A tibble: 4 x 5
#>   id    number city   name   value  
#>   <chr> <chr>  <chr>  <chr>  <chr>  
#> 1 1     12345  London Name_1 Value_1
#> 2 1     12345  London Name_2 Value_2
#> 3 2     98765  London Name_9 Value_9
#> 4 2     98765  London Name_8 Value_8

Personally, I'd recommend working in this format, especially since you might have differing numbers of name-value pairs for different products. But if you really need a wide format, you can mark off an observation number for each product's children, then reshape.

prod_df %>%
  dplyr::group_by(id, number, city) %>%
  dplyr::mutate(obs = dplyr::row_number()) %>%
  pivot_wider(names_from = obs, values_from = c(name, value), names_sep = ".")
#> # A tibble: 2 x 7
#> # Groups:   id, number, city [2]
#>   id    number city   name.1 name.2 value.1 value.2
#>   <chr> <chr>  <chr>  <chr>  <chr>  <chr>   <chr>  
#> 1 1     12345  London Name_1 Name_2 Value_1 Value_2
#> 2 2     98765  London Name_9 Name_8 Value_9 Value_8
Sign up to request clarification or add additional context in comments.

Comments

0

Hi JCMendes you can use the tidyverse to solve that

This is unfortunately not very extensible I would also recommend that you work with long data

 x <- xmldf %>% 
    mutate(number = data %>% str_extract("[:digit:]{1,}"),
           city = data %>% str_extract("[:alpha:]{1,}"),
           characterss = str_split(attributes,"(?=[[:upper:]])"),
           name = characterss %>% map(keep,str_detect,"Name"),
           value= characterss %>% map(keep,str_detect,"Value")) %>%
    select(-attributes,-data,-characterss) %>%
    unnest(name) %>%
    unnest(value) %>% 
    group_by(id, number, city) %>% 
    dplyr::mutate(obs = dplyr::row_number()) %>%
    pivot_wider(names_from = obs, values_from = c(name, value), names_sep = ".")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.