1

Consider this simple example

library(rvest)
library(tidyverse)
library(dplyr)
library(lubridate)
library(tibble)

mytib <- tibble(mylink = c('https://en.wikipedia.org/wiki/List_of_software_bugs',
                           'https://en.wikipedia.org/wiki/Software_bug'))


mytib <- mytib %>% mutate(html.data = map(mylink, ~read_html(.x)))

> mytib
# A tibble: 2 x 2
  mylink                                              html.data 
  <chr>                                               <list>    
1 https://en.wikipedia.org/wiki/List_of_software_bugs <xml_dcmn>
2 https://en.wikipedia.org/wiki/Software_bug          <xml_dcmn>

> mytib$html.data[1]
[[1]]
{html_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta charset="UTF-8">\n<title> ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject mw-editable page-List_of_software_b ...

As you can see, my tibble correctly contains the html code of the two different wikipedia pages stored in the column mylink. The problem is that I am not able to store this hard-worked scraping to disk. A simple read_csv will fail

> mytib %>% write_csv('mydata.csv')
Error in stream_delim_(df, path, ..., bom = bom, quote_escape = quote_escape) : 
  Don't know how to handle vector of type list.

while write to rds will not work correctly

mytib %>% write_rds('mydata.rds')
test <- read_rds('mydata.rds')
test$html.data[1]

> test$html.data[1]
[[1]]
Error in doc_type(x) : external pointer is not valid

What should I do? In which format should I store my data? Thanks!

2 Answers 2

1

The reason for this has been discussed here.
As a workaround, you can convert xmlDoc to string in order to save it :

mytib <- mytib %>% mutate(html.data = map(mylink, ~toString(read_html(.x))))
mytib %>% write_rds('mydata.rds')
test <- read_rds('mydata.rds')
test$html.data[[1]]
[1] "<!DOCTYPE html>\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n<meta charset=\"UTF-8\">\n<title>List of software bugs - Wikipedia</title>\n

You can then recreate a xml document :

test %>% mutate(xmlDoc = map(html.data,~read_html(.x))
# A tibble: 2 x 3
  mylink                                              html.data xmlDoc    
  <chr>                                               <list>    <list>    
1 https://en.wikipedia.org/wiki/List_of_software_bugs <chr [1]> <xml_dcmn>
2 https://en.wikipedia.org/wiki/Software_bug          <chr [1]> <xml_dcmn>
Sign up to request clarification or add additional context in comments.

Comments

1

Do you really need to store entire html in csv? html as in itself isn't useful, you may want to extract relevant parts that are needed and store it in a column. For example, extracting the title here.

library(dplyr)     
library(rvest)
library(purrr)

mytib %>% 
  mutate(html.data = map(mylink, read_html), 
         title = map_chr(html.data,~.x %>% html_nodes('title') %>% html_text)) %>%
  select(-html.data) %>%
  write.csv('data.csv', row.names = FALSE)

1 Comment

yes, i want to keep everything. csv or any other format actually. thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.