1

Here's my problem: I have this list I've generated containing a large number of links and I want to take this list and apply a function to it to scrape some data from all those links; however, when I run the program it only takes the data from the first link of that element, reprinting that info for the correct number of iterations. Here's all my code so far:

library(tidyverse)
library(rvest)

source_link<-"http://www.ufcstats.com/statistics/fighters?char=a&page=all"
source_link_html<-read_html(source_link)

#This scrapes all the links for the pages of all the fighters
links_vector<-source_link_html%>%
  html_nodes("div ul li a")%>%
  html_attr("href")%>%
  #This seq selects the 26 needed links, i.e. from a-z
  .[1:26]

#Modifies the pulled data so the links become useable and contain all the fighers instead of just some
links_vector_modded<-str_c("http://www.ufcstats.com", links_vector,"&page=all")

fighter_links<-sapply(links_vector_modded, function(links_vector_modded){
  read_html(links_vector_modded[])%>%
  html_nodes("tr td a")%>%
  html_attr("href")%>%
  .[seq(1,length(.),3)]%>%
  na.omit(fighter_links)
})

###Next Portion: Using the above links to further harvest

#Take all the links within an element of fighter_links and run it through the function career_data to scrape all the statistics from said pages.
fighter_profiles_a<-map(fighter_links$`http://www.ufcstats.com/statistics/fighters?char=a&page=all`, function(career_data){
  #Below is where I believe my problem lies
  read_html()%>%
  html_nodes("div ul li")%>%
  html_text() 
})

The issue I'm having is in the last section of code,read_html(). I do not know how to apply each link in the element within the list to that function. Additionally, is there a way to call all of the elements of fighter_links instead of doing it one element at a time?

Thank you for any advice and assistance!

2
  • if you do not need the newest data you could spare the scrapping by taking data from kaggle (fights and fighters): kaggle.com/rajeevw/ufcdata Commented Nov 7, 2020 at 0:00
  • Thank you, DPH, that's awesome! I'm definitely going to play around with that data. However, I'm doing this just as much for the data's sake as for learning R so I want to know how to solve this issue Commented Nov 7, 2020 at 0:33

2 Answers 2

1

You can unlist to get all the fighter_links together and pass it to map function to extract relevant text.

library(rvest)
library(purrr)

fighter_profiles_a<-map(unlist(fighter_links), function(career_data){
  read_html(career_data)%>%
    html_nodes("div ul li")%>%
    html_text() 
})

The text captured at fighter_profiles_a might require some additional cleaning.

Sign up to request clarification or add additional context in comments.

Comments

1

The challenge is that fighter_links is a list of vectors. Applying map to each list element leaves you with a vector of URLs, and you want to get information from each URL.

If it's important to retain the structure of fighter_links - meaning, you don't lose which URL belongs to each fighter - you can nest your call to map, like this:

fighter_profiles <- 
  fighter_links %>%
    map(function(url_list) {
      map(url_list,
           function(url) read_html("http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9"[]) %>% 
             html_nodes("div ul li") %>% 
             html_text() %>%
             str_replace_all(., "\n\\s+\n\\s+", "")) # a little clean up
    })

This produces nested output, which you can use to keep track of which fighter_links entry it came from:

[[1]]
[[1]][[1]]
 [1] "Height:--\n    "         "Weight:155 lbs.\n    "   "Reach:--\n    "          "STANCE:"                
 [5] "DOB:Jul 13, 1978"        "SLpM:0.00\n\n        "   "Str. Acc.:0%\n        "  "SApM:0.00\n        "    
 [9] "Str. Def:0%\n        "   ""                        "TD Avg.:0.00\n        "  "TD Acc.:0%\n        "   
[13] "TD Def.:0%\n        "    "Sub. Avg.:0.0\n        " "Events & Fights"         "Fighters"               
[17] "Stat Leaders"           

[[1]][[2]]
 [1] "Height:--\n    "         "Weight:155 lbs.\n    "   "Reach:--\n    "          "STANCE:"                
 [5] "DOB:Jul 13, 1978"        "SLpM:0.00\n\n        "   "Str. Acc.:0%\n        "  "SApM:0.00\n        "    
 [9] "Str. Def:0%\n        "   ""                        "TD Avg.:0.00\n        "  "TD Acc.:0%\n        "   
[13] "TD Def.:0%\n        "    "Sub. Avg.:0.0\n        " "Events & Fights"         "Fighters"               
[17] "Stat Leaders"           

Note: You can use map instead of the initial sapply as well, if you like:

path <- "http://www.ufcstats.com/statistics/fighters"
query_str <- paste0("?char=", letters, "&page=all")
urls <- paste0(path, query_str)

get_fighter_link <- function(url) {
  read_html(url[])%>%
    html_nodes("tr td a")%>%
    html_attr("href")%>%
    .[seq(1, length(.), 3)]%>%
    na.omit(fighter_links)
}

fighter_links <- map(urls, get_fighter_link)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.