1

Using the following packages: require(stringr) require(RCurl) require(XML)

I am able to connect to the desired web page, and extract information needed.

> url="https://www.realtor.com/realestateagents/33415/pg-1" doc =
> getURLContent(url, verbose = TRUE) #gets the doc , verbose = show me
> me what you are doing) doc = htmlParse(doc)
> # name =  getNodeSet(doc,  "//div[@itemprop = 'name']") name = sapply(name, xmlValue)
> # phone =  getNodeSet(doc,  "//div[@itemprop= 'telephone']") phone = sapply(phone, xmlValue)

I generated a list of urls

urlList = c("https://www.realtor.com/realestateagents/33415/pg-1",
                "https://www.realtor.com/realestateagents/33415/pg-2")

    urlList = as.list(urlList)

I would like to loop over each url, capture the same nodes and place the results in one data frame consisting of columns called Name and Phone.

I tried the following with no success

Reduce(function(...) merge(..., all=T), 
       lapply(urls_list, function(x) {
         data.frame(urlList=x, 

                     # d<- htmlParse(getURLContent(x))
                    d<-htmlParse(d)
                    d1 =  getNodeSet(d,  "//div[@itemprop = 'name']")
                    name = sapply(name, xmlValue)

       })) -> results

Thank you for all your help

1
  • I tried the following with no success ... is not helpful for us. What errors or undesired results occurred? Commented Feb 13, 2019 at 19:18

1 Answer 1

2

I think something like this should work to get you the information you're after.

library(rvest)

zip.codes <- c("33415", "33413")

results <- list()

result.index <- 0

for(zip in zip.codes){

  url <- paste0("https://www.realtor.com/realestateagents/", zip ,"/pg-1" )

  page <- read_html(url)

  max.pages <- as.numeric(max(page %>% 
                                html_nodes(xpath = '//*[@class="page"]') %>% 
                                html_nodes("a") %>% 
                                html_text))

  for(i in c(1:max.pages)){
    print(paste("Processing Zip Code", zip, "- Page", i, "of", max.pages))

    result.index <- result.index + 1

    url <- paste0("https://www.realtor.com/realestateagents/", zip,"/pg-", i)

    page <- read_html(url)

    df <- data.frame(AgentID = page %>% 
                               html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                               xml_attr("data-agent-id"),
                     AgentName = page %>% 
                               html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                               xml_attr("data-agent-name"),
                     AgentAddr = page %>% 
                               html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                               xml_attr("data-agent-address"),
                     AgentPhone = sub("tel:", "", page %>% 
                                                  html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                                                  xml_attr("href")),
                     PhoneType = page %>% 
                                 html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                                 xml_attr("data-agent-num-type"),
                     AgentWebSite = page %>% 
                                    html_nodes(xpath = '//*[@id="call_inquiry_cta"]') %>% 
                                    xml_attr("data-agent-web-url"))

    results[[result.index]] <- df
  }
}

df <- do.call(rbind, results)
Sign up to request clarification or add additional context in comments.

5 Comments

Never use rbind in a loop. It leads to excessive memory copying. See Patrick Burns' R Interno: Circle 2 - Growing Objects.
Generally good advice, especially when dealing with objects of varying length. I've tested this script both performing the rbind in the loop, and creating a list of data frames in the loop with the rbind after. In this particular case I actually found having the rbind inside the loop executed slightly faster, and there was no difference in the memory usage during execution of each method.
OP's needs can extend beyond this which can be a sample of a larger project. At small scales, differences are immaterial but a future reader can take your code to loop through thousands of pages. SO answers should usually aim for best practices.
Thank you.The code works well. Is is possible to nest this inside a loop that looks at multiple zip codes. for example if i wanted the data from the zips bellow also on that same data frame. zipcodelist =c("33415", "33413"). Also, can I make the system wait a 10 seconds after each extraction? Thank you.
I've edited my answer to work with multiple zip codes, and switched the rbind to outside the loop, like @Parfait recommended. If you really want to add in a 10 second delay between transactions, then just a line with Sys.sleep(10) after the results[[result.index]] <- df line

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.