-1

I'm trying to extract data from an xml file. I'm extracting the nodes separately with the following code:

entity_uin <- xml_text(xml_find_all(xml, ".//Entity/EntityUin"))
entity_name <- xml_text(xml_find_all(xml, ".//Entity/EntityName"))
entity_zip_code <- xml_text(xml_find_all(xml, ".//Entity/EntityZipCode"))

This way I'm getting three character vectors. Then, I'm trying to create a tibble from these character vectors with the following code:

xml <- tibble(entity_uin, entity_name, entity_zip_code)

Unfortunately, this doesn't work because the three character vectors are with unequal lengths. Can anyone suggest a solution?

2
  • 2
    You should iterate over each <Entity> node, then extract its children inside that loop. This guarantees each row is for exactly one <Entity> Commented Jun 17 at 7:32
  • 2
    Please consider editing your question to make this reproducible for others, e.g. include a small (yet valid) xml document in a separate code block and include complete code to reproduce your exact issue. Right now we don't really know if xml is a document from read_xml() or a node(set) from xml_find_all() / xml_find_first(). Or if there are some subnodes missing in some Entity nodes or are you dealing with some kind of deeper structure (e.g. nested entities) or something else. Commented Jun 17 at 8:51

1 Answer 1

2

Assuming(!) that some Entity nodes in your document are not complete and error is raised because some of your column vectors are shorter than others, you could first get a set of parent nodes with xml_find_all() and then extract details from those with xml_find_first().

xml_find_first() output is always the same size as the input, missing matches are filled with NAs, resulting vectors are aligned and can be passed to tibble():

library(xml2)
example_xml <- 
'<?xml version="1.0" encoding="UTF-8"?>
<Entities>
  <Entity>
    <EntityUin>123456</EntityUin>
    <EntityName>ABC Corp</EntityName>
    <EntityZipCode>10001</EntityZipCode>
  </Entity>
  <Entity>
    <EntityUin>789012</EntityUin>
    <EntityName>XYZ Inc</EntityName>
    <!-- Missing EntityZipCode -->
  </Entity>
  <Entity>
    <EntityUin>345678</EntityUin>
    <EntityName>Sample LLC</EntityName>
    <EntityZipCode>90210</EntityZipCode>
  </Entity>
</Entities>'

entities <- 
  read_xml(example_xml) |>
  xml_find_all("/Entities/Entity")

tibble::tibble(
  entity_uin      = xml_find_first(entities, "./EntityUin")     |> xml_text(),
  entity_name     = xml_find_first(entities, "./EntityName")    |> xml_text(), 
  entity_zip_code = xml_find_first(entities, "./EntityZipCode") |> xml_text()
)
#> # A tibble: 3 × 3
#>   entity_uin entity_name entity_zip_code
#>   <chr>      <chr>       <chr>          
#> 1 123456     ABC Corp    10001          
#> 2 789012     XYZ Inc     <NA>           
#> 3 345678     Sample LLC  90210
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.