1

I am using rvest package, and belowing are the codes:

library(rvest)
url <- 'https://www.zhihu.com/people/excited-vczh'
webpage <- read_html(url)
profile_data <- html_nodes(webpage, '.Profile-sideColumnItemLink') 
profile_data_text <- html_text(profile_data)

The codes read one single url and parse. What if I have a charactor vector which storing multiple urls. How should I put these urls to the above codes? For instance, urlist is a charactor storing 1000 urls. How can I change my codes to scrapy all specific content in urlist?

1 Answer 1

1

You could just use lapply to run through each URL to grab the text you need:

library(rvest)
urlist <- rep('https://www.zhihu.com/people/excited-vczh', 100)
profile_data_list <- lapply(urlist, function(x) {
  webpage <- read_html(x)
  profile_data <- html_nodes(webpage, '.Profile-sideColumnItemLink') 
  html_text(profile_data)
})
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much for your help. I fixed the codes with lapply. But "Error in open.connection(x, "rb") : HTTP error 404." came out. What are the potential problems?
Looks like a URL issue where it isn't valid. My first guess would be it's got a typo. Add print(x) in the lapply to check which one it is.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.