How to use read_html reading a character vector of url

Question

I am using rvest package, and belowing are the codes:

library(rvest)
url <- 'https://www.zhihu.com/people/excited-vczh'
webpage <- read_html(url)
profile_data <- html_nodes(webpage, '.Profile-sideColumnItemLink') 
profile_data_text <- html_text(profile_data)

The codes read one single url and parse. What if I have a charactor vector which storing multiple urls. How should I put these urls to the above codes? For instance, urlist is a charactor storing 1000 urls. How can I change my codes to scrapy all specific content in urlist?

Ashley Baldry · Accepted Answer · 2019-02-20 12:19:09Z

1

You could just use lapply to run through each URL to grab the text you need:

library(rvest)
urlist <- rep('https://www.zhihu.com/people/excited-vczh', 100)
profile_data_list <- lapply(urlist, function(x) {
  webpage <- read_html(x)
  profile_data <- html_nodes(webpage, '.Profile-sideColumnItemLink') 
  html_text(profile_data)
})

answered Feb 20, 2019 at 12:19

Ashley Baldry

8799 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Anakin Over a year ago

Thank you very much for your help. I fixed the codes with lapply. But "Error in open.connection(x, "rb") : HTTP error 404." came out. What are the potential problems?

Ashley Baldry Over a year ago

Looks like a URL issue where it isn't valid. My first guess would be it's got a typo. Add print(x) in the lapply to check which one it is.

Collectives™ on Stack Overflow

How to use read_html reading a character vector of url

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related