Turning xml file into dataframe

Question

I'm trying to extract data from an xml file. I'm extracting the nodes separately with the following code:

entity_uin <- xml_text(xml_find_all(xml, ".//Entity/EntityUin"))
entity_name <- xml_text(xml_find_all(xml, ".//Entity/EntityName"))
entity_zip_code <- xml_text(xml_find_all(xml, ".//Entity/EntityZipCode"))

This way I'm getting three character vectors. Then, I'm trying to create a tibble from these character vectors with the following code:

xml <- tibble(entity_uin, entity_name, entity_zip_code)

Unfortunately, this doesn't work because the three character vectors are with unequal lengths. Can anyone suggest a solution?

You should iterate over each <Entity> node, then extract its children inside that loop. This guarantees each row is for exactly one <Entity> — Omprakash S
– Omprakash S, Commented Jun 17 at 7:32
Please consider editing your question to make this reproducible for others, e.g. include a small (yet valid) xml document in a separate code block and include complete code to reproduce your exact issue. Right now we don't really know if xml is a document from read_xml() or a node(set) from xml_find_all() / xml_find_first(). Or if there are some subnodes missing in some Entity nodes or are you dealing with some kind of deeper structure (e.g. nested entities) or something else. — margusl
– margusl, Commented Jun 17 at 8:51

margusl · Accepted Answer · 2025-06-17 19:51:13Z

Assuming(!) that some Entity nodes in your document are not complete and error is raised because some of your column vectors are shorter than others, you could first get a set of parent nodes with xml_find_all() and then extract details from those with xml_find_first().

xml_find_first() output is always the same size as the input, missing matches are filled with NAs, resulting vectors are aligned and can be passed to tibble():

library(xml2)
example_xml <- 
'<?xml version="1.0" encoding="UTF-8"?>
<Entities>
  <Entity>
    <EntityUin>123456</EntityUin>
    <EntityName>ABC Corp</EntityName>
    <EntityZipCode>10001</EntityZipCode>
  </Entity>
  <Entity>
    <EntityUin>789012</EntityUin>
    <EntityName>XYZ Inc</EntityName>
    <!-- Missing EntityZipCode -->
  </Entity>
  <Entity>
    <EntityUin>345678</EntityUin>
    <EntityName>Sample LLC</EntityName>
    <EntityZipCode>90210</EntityZipCode>
  </Entity>
</Entities>'

entities <- 
  read_xml(example_xml) |>
  xml_find_all("/Entities/Entity")

tibble::tibble(
  entity_uin      = xml_find_first(entities, "./EntityUin")     |> xml_text(),
  entity_name     = xml_find_first(entities, "./EntityName")    |> xml_text(), 
  entity_zip_code = xml_find_first(entities, "./EntityZipCode") |> xml_text()
)
#> # A tibble: 3 × 3
#>   entity_uin entity_name entity_zip_code
#>   <chr>      <chr>       <chr>          
#> 1 123456     ABC Corp    10001          
#> 2 789012     XYZ Inc     <NA>           
#> 3 345678     Sample LLC  90210

Collectives™ on Stack Overflow

Turning xml file into dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related