9

Objects can be saved and read like so

# Save as file
saveRDS(iris, "mydata.RDS")

# Read back in 
readRDS("mydata.RDS")

But this doesn't seem to work for objects made with xml2::read_html()

Example

library(rvest)
someobject <- read_html("https://stackoverflow.com/")
saveRDS(someobject, "someobject.RDS")

Which creates a file, but not as expected i.e.

readRDS("someobject.RDS")
Error in doc_is_html(x$doc) : external pointer is not valid

What's going on and what's the simplest way of saving an html object so that it can be loaded back in with minimal code/fuss?

4 Answers 4

9

To answer "what's going on": saveRDS is trying to serialize the object being saved. Here, the object someobject is a list with elements someobject$doc and someobject$node. The type of the elements is externalptr (external pointer), which means they reference a C data structure held in memory. When external pointers are serialized, the reference is lost. Hence the error "external pointer is not valid".

You could serialize someobject using as.character() and pass that to saveRDS:

saveRDS(as.character(someobject), "someobject.RDS")

Then recreate the object using readRDS and read_html:

someobject <- read_html(readRDS("someobject.RDS"))

But it's easier to use write_html() as others suggested.

Some discussion in this Github issue thread.

Sign up to request clarification or add additional context in comments.

Comments

3

We can use write_xml and read_html from xml2 package

before <- read_html("https://stackoverflow.com/")
xml2::write_xml(before, "someobject1.xml")
after <- xml2::read_html("someobject1.xml")

However, identical returns FALSE

identical(before, after)
#[1] FALSE

but the query on both of them seem to return the same result

library(rvest)
before %>%  html_nodes("div")
after %>% html_nodes("div")

Comments

3

As far as I know the methods using XML and RDS files seem to be off by the same number of characters. I did a comparison and it seems like the differences between the original and the loaded version are in the body nodes.

url <-  "https://stackoverflow.com/"
html <- read_match(url)
html_node(html, "body")  %>% html_text() %>%  unlist() -> OBT
nchar(OBT)

28879

xml2::write_xml(html, "someobject1.xml")
html_node(html, "body")  %>% html_text() %>%  unlist() -> BT1
nchar(BT1)

28893

html   %>% toString %>% saveRDS(., "someobject.RDS")
after2 <- readRDS("someobject.RDS") %>% read_html
html_node(html, "body")  %>% html_text() %>%  unlist()-> BT2
nchar(BT2)

28893

This shows that the two loaded objects have the same number of characters. If we remove a "\n" characters from all text objects the number should be the same.

BT1 %>% str_remove_all(.,"\n") %>% nchar(.)

27733

BT2 %>% str_remove_all(.,"\n") %>% nchar(.) 

27733

OBT %>% str_remove_all(.,"\n") %>% nchar(.) 

27733

4 Comments

Nice investigation, can you provide the results of the code in the bottom block?
I added to results, and I added the url because the results depend on the URL.
Interesting. Looks like write_xml() might add those line breaks. You might be able to do this to see if there's a pattern to where the \n characters are being inserted?
I had a quick look, and it seems to do with the layout. For example at the bottom of the page is table with four columns: Stack overflow, Products, Company and Stack Exchange Network. If you click on the bottom one on of the fourth column(Others), and the click Technology. Everyone of the column headers (except the most left one) will have an added "\n".
0

Use toString() to convert xml_document class to character before saving, like so

library(rvest)
someobject <- read_html("https://stackoverflow.com/")

someobject  %>% toString %>% saveRDS(., "someobject.RDS")
newobject <- readRDS("someobject.RDS") %>% read_html

Note that these objects are not perfectly identical (I am not sure why).

identical(someobject, newobject)
# [1] FALSE

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.