0

I am pretty new to R. I scraped a website that required login yesterday, the page is xml format like below.

<result status="success">
  <code>1</code>
  <note>success</note>
  <teacherList>
    <teacher id="D95">
      <name>Mary</name>
      <department id="420">
        <name>Math</name>
      </department>
      <department id="421">
        <name>Statistics</name>
      </department>
    </teacher>
    <teacher id="D73">
      <name>Adam</name>
      <department id="412">
        <name>English</name>
      </department>
    </teacher>
  </teacherList>
</result> 

Recently I just Converted an XML to a list.

library(XML)
library(rvest)
library(plyr)
library(dplyr)
library(httr)
library(pipeR)
library(xml2)

url.address <- "http://xxxxxxxxxxxxxxxxx"
session <-html_session(url.address)
form <-html_form(read_html(url.address))[[1]]
filled_form <- set_values(form,
                          "userid" = "id",
                          "Password" = "password")
s <- submit_form(session,filled_form)
z = read_xml(s$response)
z1 = as_list(z)
z2 <- z1$teacherList

Now I need to extract data from a list and make it as a data frame. By the way, some people belong to 2 departments, but some only belong to 1. A part of the list z2 looks like below:

z2[[1]]

$name
$name[[1]]
[1] "Mary"


$department
$department$name
$department$name[[1]]
[1] "Math"


attr(,"id")
[1] "420"

$department
$department$name
$department$name[[1]]
[1] "statistics"


attr(,"id")
[1] "421"

attr(,"id")
[1] "D95236"

When I extracted them one by one, it took too long:

attr(z2[[1]],"id")

"D95"

z2[[1]][[1]][[1]] 

"Mary"

z2[[1]][[2]][[1]][[1]] 

"Math"

attr(z2[[1]][[2]], "id") 

"420"

z2[[1]][[3]][[1]][[1]] 

"statistics"

attr(z2[[1]][[3]], "id")

"421"

attr(z2[[2]],"id")

"D73"

z2[[2]][[1]][[1]] 

"Adam"

z2[[2]][[2]][[1]][[1]]

"English"

attr(z2[[2]][[2]],"id")

"412"

So I tried to write a loop:

for (x in 1:2){
  for (y in 2:3){
  a <- attr(z2[[x]],"id")
  b <- z2[[x]][[1]][[1]]
  d <- z2[[x]][[y]][[1]][[1]]
  e <- attr(z2[[x]][[y]],"id")
  g <- cbind(print(a),print(b),print(d),print(e))
  }}

but it doesn't work at all since some of the people only belong to one department. The result I expected:

enter image description here

Any advice would be appreciated!

dput(head(z2, 10))

structure(list(teacher = structure(list(name = list("Mary"), 
    department = structure(list(name = list("Math")), .Names = "name", id = "420"), 
    department = structure(list(name = list("statistics")), .Names = "name", id = "421")), .Names = c("name", 
"department", "department"), id = "D95"), teacher = structure(list(
    name = list("Adam"), department = structure(list(name = list(
        "English")), .Names = "name", id = "412")), .Names = c("name", 
"department"), id = "D73"), teacher = structure(list(name = list(
    "Kevin"), department = structure(list(name = list("Chinese")), .Names = "name", id = "201")), .Names = c("name", 
"department"), id = "D101"), teacher = structure(list(name = list(
    "Nana"), department = structure(list(name = list("Science")), .Names = "name", id = "205")), .Names = c("name", 
"department"), id = "D58"), teacher = structure(list(name = list(
    "Nelson"), department = structure(list(name = list("Music")), .Names = "name", id = "370")), .Names = c("name", 
"department"), id = "D14"), teacher = structure(list(name = list(
    "Esther"), department = structure(list(name = list("Medicine")), .Names = "name", id = "361")), .Names = c("name", 
"department"), id = "D28"), teacher = structure(list(name = list(
    "Mia"), department = structure(list(name = list("Chemistry")), .Names = "name", id = "326")), .Names = c("name", 
"department"), id = "D17"), teacher = structure(list(name = list(
    "Jack"), department = structure(list(name = list("German")), .Names = "name", id = "306")), .Names = c("name", 
"department"), id = "D80"), teacher = structure(list(name = list(
    "Tom"), department = structure(list(name = list("French")), .Names = "name", id = "360")), .Names = c("name", 
"department"), id = "D53"), teacher = structure(list(name = list(
    "Allen"), department = structure(list(name = list("Spanish")), .Names = "name", id = "322")), .Names = c("name", 
"department"), id = "D18")), .Names = c("teacher", "teacher", 
"teacher", "teacher", "teacher", "teacher", "teacher", "teacher", "teacher", 
"teacher"))
6
  • It will not be possible to help unless you provide a reproducible example of your data. try dput(head(z2, 10)) and paste the result into your question. Commented Sep 6, 2017 at 15:42
  • @lmo sorry! Just added :) Commented Sep 6, 2017 at 16:28
  • 1
    please do not paste images of code. And please read how to make a great reproducible example Commented Sep 6, 2017 at 16:30
  • @lmo Just upload it now. I am sorry that I haven't figured out how to post the output, so I uploaded the image. Sorry for the inconvenience. Commented Sep 6, 2017 at 17:01
  • @C8H10N4O2 Hi! I am very sorry, just started using it two days ago. I know this should not be my excuses. I will try to figure out how to do asap. Commented Sep 6, 2017 at 17:02

2 Answers 2

2

This was a bit crazy to construct, but I think it more or less conforms with the desired output posted in a previous version of the post. I had to use sapply within the lapply function to pull out the second ID variable.

do.call(rbind,             # rbind list of data.frames output by lapply
        lapply(unname(z2), # loop through list, first drop outer names
               function(x) { # begin lapply function
                 temp <- unlist(x) # unlist inner elements to a vector
                 data.frame(name=temp[names(temp) == "name"], # subset on names
                            dept=temp[names(temp) == "department.name"], # subset on dept
                            id=attr(x, "id"), # extract one id
                            id2=unlist(sapply(x, attr, "id")), # extract other id
                            row.names=NULL) # end data.frame function, drop row.names
                            })) # end lapply function, lapply, and do.call

this returns

     name       dept   id id2
1    Mary       Math  D95 420
2    Mary statistics  D95 421
3    Adam    English  D73 412
4   Kevin    Chinese D101 201
5    Nana    Science  D58 205
6  Nelson      Music  D14 370
7  Esther   Medicine  D28 361
8     Mia  Chemistry  D17 326
9    Jack     German  D80 306
10    Tom     French  D53 360
11  Allen    Spanish  D18 322

The structure of the second list differs in a number of ways from the initial example. First: one nest is removed. That is, the depth of the new list is one less than that of the initial example. It would be as if you provided z2[[1]] for the initial list. Second, the second example is missing what I called id initially (values such as D95 and D101).

With a bit of manipulation of the original code, I got this to work with

lapply(list(z3), # loop through list, first drop outer names
       function(x) { # begin lapply function
           temp <- unlist(x) # unlist inner elements to a vector
           data.frame(name=temp[names(temp) == "name"], # subset on names
                      dept=temp[names(temp) == "department.name"], # subset on dept
                      # id=attr(x, "id"), # extract one id
                      id2=unlist(sapply(x, attr, "id")), # extract other id
                      row.names=NULL) # end data.frame function, drop row.names
       })

The changes to the code address what I mentioned before z2 is replaced by list(z3) as the first argument to lapply, which constructs the needed list depth. Also, the line of the inner function id=attr(x, "id"), has been commented out as id2 does not exist.

Sign up to request clarification or add additional context in comments.

10 Comments

this is very neat! Thank you so much :)
I tried to use the do.call function to solve another one which has less structure, but it went error like "Error in data.frame(names(temp) == "name", division = temp[names(temp) == : arguments imply differing number of rows: 1, 0 ". Would you kindly tell me which parts go wrong? Even just a hint will be great :) I just list it above
@Ching I modified the code to work with your second example. It is important to understand your underlying data structure when working on a problem. Here, you should have seen that the the id variable was missing from this data when you printed a small example on your screen.
Hi! It didn't work when I use list(z3), but when I do it with unname(z3) it works again. That's pretty strange. I did notice the difference of structure.
Wait! I knew why it didn't work, the second structure I listed above was the wrong one(which I didn't notice earlier). Thank you so much! You really save me. This structure really drives me crazy :p
|
0

XML is generally really easy to deal with in R

Use library(XML) and library(plyr) to avoid having to write loops:

Step one is to read in the XML

I saved your sample XML as a .xml file called Demo.xml. You can also pass xmlParse a URL.

rawXML <- xmlParse("Demo.xml")

Then convert XML to list:

xmlList <- xmlToList(rawXML)

Then convert list to data frame with plyr

df1 <- ldply(xmlList, data.frame)

This is the general process, if you provide sample data we can refine it to match your specific use case.

Here's the resulting summary output. Is this what you're looking for?

 str(df1)
'data.frame':   4 obs. of  12 variables:
 $ .id                        : chr  "code" "note" "teacherList" ".attrs"
 $ X..i..                     : Factor w/ 2 levels "1","success": 1 2 NA 2
 $ teacher.name               : Factor w/ 1 level "Mary": NA NA 1 NA
 $ teacher.department.name    : Factor w/ 1 level "Math": NA NA 1 NA
 $ teacher.department..attrs  : Factor w/ 1 level "420": NA NA 1 NA
 $ teacher.department.name.1  : Factor w/ 1 level "Statistics": NA NA 1 NA
 $ teacher.department..attrs.1: Factor w/ 1 level "421": NA NA 1 NA
 $ teacher..attrs             : Factor w/ 1 level "D95": NA NA 1 NA
 $ teacher.name.1             : Factor w/ 1 level "Adam": NA NA 1 NA
 $ teacher.department.name.2  : Factor w/ 1 level "English": NA NA 1 NA
 $ teacher.department..attrs.2: Factor w/ 1 level "412": NA NA 1 NA
 $ teacher..attrs.1           : Factor w/ 1 level "D73": NA NA 1 NA

3 Comments

Hi! Actually I scraped a login website which is xml format, so it's much difficult to parse it I guess :) I just added more info now!
@Ching Can you provide a link to the XML sheet?
I think it's pretty close, I will check it asap and let you know :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.