R Loop over each URL, scrape, parse, extract nodes and put into a data frame

Question

Using the following packages: require(stringr) require(RCurl) require(XML)

I am able to connect to the desired web page, and extract information needed.

> url="https://www.realtor.com/realestateagents/33415/pg-1" doc =
> getURLContent(url, verbose = TRUE) #gets the doc , verbose = show me
> me what you are doing) doc = htmlParse(doc)
> # name =  getNodeSet(doc,  "//div[@itemprop = 'name']") name = sapply(name, xmlValue)
> # phone =  getNodeSet(doc,  "//div[@itemprop= 'telephone']") phone = sapply(phone, xmlValue)

I generated a list of urls

urlList = c("https://www.realtor.com/realestateagents/33415/pg-1",
                "https://www.realtor.com/realestateagents/33415/pg-2")

    urlList = as.list(urlList)

I would like to loop over each url, capture the same nodes and place the results in one data frame consisting of columns called Name and Phone.

I tried the following with no success

Reduce(function(...) merge(..., all=T), 
       lapply(urls_list, function(x) {
         data.frame(urlList=x, 

                     # d<- htmlParse(getURLContent(x))
                    d<-htmlParse(d)
                    d1 =  getNodeSet(d,  "//div[@itemprop = 'name']")
                    name = sapply(name, xmlValue)

       })) -> results

Thank you for all your help

I tried the following with no success ... is not helpful for us. What errors or undesired results occurred? — Parfait
– Parfait, Commented Feb 13, 2019 at 19:18

Matt Jewett · Accepted Answer · 2019-02-18 18:56:05Z

2

I think something like this should work to get you the information you're after.

library(rvest)

zip.codes <- c("33415", "33413")

results <- list()

result.index <- 0

for(zip in zip.codes){

  url <- paste0("https://www.realtor.com/realestateagents/", zip ,"/pg-1" )

  page <- read_html(url)

  max.pages <- as.numeric(max(page %>% 
                                html_nodes(xpath = '//*[@class="page"]') %>% 
                                html_nodes("a") %>% 
                                html_text))

  for(i in c(1:max.pages)){
    print(paste("Processing Zip Code", zip, "- Page", i, "of", max.pages))

    result.index <- result.index + 1

    url <- paste0("https://www.realtor.com/realestateagents/", zip,"/pg-", i)

    page <- read_html(url)

    df <- data.frame(AgentID = page %>% 
                               html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                               xml_attr("data-agent-id"),
                     AgentName = page %>% 
                               html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                               xml_attr("data-agent-name"),
                     AgentAddr = page %>% 
                               html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                               xml_attr("data-agent-address"),
                     AgentPhone = sub("tel:", "", page %>% 
                                                  html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                                                  xml_attr("href")),
                     PhoneType = page %>% 
                                 html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                                 xml_attr("data-agent-num-type"),
                     AgentWebSite = page %>% 
                                    html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                                    xml_attr("data-agent-web-url"))

    results[[result.index]] <- df
  }
}

df <- do.call(rbind, results)

edited Feb 18, 2019 at 18:56

answered Feb 13, 2019 at 17:12

Matt Jewett

3,3791 gold badge17 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Parfait Over a year ago

Never use rbind in a loop. It leads to excessive memory copying. See Patrick Burns' R Interno: Circle 2 - Growing Objects.

Matt Jewett Over a year ago

Generally good advice, especially when dealing with objects of varying length. I've tested this script both performing the rbind in the loop, and creating a list of data frames in the loop with the rbind after. In this particular case I actually found having the rbind inside the loop executed slightly faster, and there was no difference in the memory usage during execution of each method.

Parfait Over a year ago

OP's needs can extend beyond this which can be a sample of a larger project. At small scales, differences are immaterial but a future reader can take your code to loop through thousands of pages. SO answers should usually aim for best practices.

J P Over a year ago

Thank you.The code works well. Is is possible to nest this inside a loop that looks at multiple zip codes. for example if i wanted the data from the zips bellow also on that same data frame. zipcodelist =c("33415", "33413"). Also, can I make the system wait a 10 seconds after each extraction? Thank you.

Matt Jewett Over a year ago

I've edited my answer to work with multiple zip codes, and switched the rbind to outside the loop, like @Parfait recommended. If you really want to add in a 10 second delay between transactions, then just a line with Sys.sleep(10) after the results[[result.index]] <- df line

Collectives™ on Stack Overflow

R Loop over each URL, scrape, parse, extract nodes and put into a data frame

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related