Trouble with Webscraping in R using XML Package

Question

I've used the XML package successfully to scrape multiple websites, but I'm having trouble creating a data frame from this specific page:

library(XML)

url <- paste("http://www.foxsports.com/nfl/injuries?season=2013&seasonType=1&week=1", sep = "")
df1 <- readHTMLTable(url)

print(df1)

> print(df1)
$`NULL`
NULL

$`NULL`
NULL

$`NULL`
             Player Pos         Injury           Game Status
1       Dickson, Ed  TE          thigh              Probable
2      Jensen, Ryan   C           foot              Doubtful
3     Jones, Arthur  DE        illness                   Out
4   McPhee, Pernell  LB           knee              Probable
5     Pitta, Dennis  TE dislocated hip Injured Reserve (DFR)
6  Thompson, Deonte  WR           foot              Doubtful
7 Williams, Brandon  DT            toe              Doubtful

$`NULL`
           Player Pos        Injury Game Status
1  Anderson, C.J.  RB          knee         Out
2   Ayers, Robert  DE      Achilles    Probable
3   Bailey, Champ  CB          foot         Out
4     Clady, Ryan   T      shoulder    Probable
5  Dreessen, Joel  TE          knee         Out
6    Kuper, Chris   G         ankle    Doubtful
7 Osweiler, Brock  QB left shoulder    Probable
8     Welker, Wes  WR         ankle    Probable

$`NULL`

etc

If I try to coerce it I get this error:

> df1 <- data.frame(readHTMLTable(url))
Error in data.frame(`NULL` = NULL, `NULL` = NULL, `NULL` = list(Player = 1:7,  : 
  arguments imply differing number of rows: 0, 7, 8, 6, 9, 1, 11, 4, 12, 5, 21, 3, 2, 15

I'd like all of the injury data (PLAYER, POS, INJURY, GAME STATUS) for all of the teams.

Thanks in advance.

You are getting a list of tables because the page contains multiple tables. The first two are probably the heading and one is the table with no injuries on a team and so does not have the expected four columns... — Spacedman
– Spacedman, Commented Jul 31, 2014 at 15:14

Chris S. · Accepted Answer · 2014-07-31 15:11:28Z

2

You just need to remove the NULL elements and tables with 1 column listing "No injuries reported" and then rbind using do.call

n<-sapply(df1, function(x) !is.null(x) && ncol(x)==4)
x <-  do.call("rbind", df1[n])
rownames(x)<-NULL

answered Jul 31, 2014 at 15:11

Chris S.

2,2351 gold badge15 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Spacedman Over a year ago

Note also this doesn't get you the team names, which are stored in <div> elements outside the tables. Scrape for <div class="wisfb_injuryHeader">

Frank B. Over a year ago

Thanks Chris, the sapply works perfectly. I've never used anything with <div> but I'm sure I can figure it out.

Tony Breyal · Accepted Answer · 2014-07-31 16:23:07Z

1

# Packages
require(XML)
require(RCurl)

# URL of interest
url <- paste("http://www.foxsports.com/nfl/injuries?season=2013&seasonType=1&week=1", sep = "")

# Parse HTML
doc <- htmlParse(url)

# Tables which are not nulls
df1 <- readHTMLTable(doc)
df.list <- df1[!as.vector(sapply(df1, is.null))]

# Get table names
table.names <- xpathSApply(doc, "//div[@class='wisfb_injuryHeader']", function(x) gsub("^\\s+|\\s+$", "", xmlValue(x)))

# Assign names
names(df.list) <- table.names


# $`San Diego Chargers`
# Player Pos                         Injury Game Status
# 1    Floyd, Malcom  WR                           knee    Probable
# 2   Ingram, Melvin  LB                  Torn left ACL  Day-to-Day
# 3    Liuget, Corey  DE                       shoulder    Probable
# 4  Patrick, Johnny  CB concussion, not injury related    Probable
# 5     Royal, Eddie  WR              chest, concussion    Probable
# 6  Taylor, Brandon   S                           knee    Probable
# 7      Te'o, Manti  LB                           foot         Out
# 8 Wright, Shareece  CB                          chest    Probable
# #[etc.]

EDIT: Just saw the @Spacedman said basically the same thing in one of the comments to the answer by @Chris S.

edited Jul 31, 2014 at 16:23

answered Jul 31, 2014 at 16:17

Tony Breyal

5,3883 gold badges31 silver badges49 bronze badges

3 Comments

Frank B. Over a year ago

Hey all, I was running my code this morning and the readHTMLTable(url) from my code and the doc <- htmlParse(url) from @Tony no longer work (I get "Error in names(ans) = header : 'names' attribute [4] must be the same length as the vector [1]"). I'm assuming something changed on the website, but in terms of SO etiquette should I post a new question, or is asking it here cool? Thanks.

Tony Breyal Over a year ago

@FrankB. I just ran my code above and it worked fine. With respect to SO etiquette, I'm not sure. I'd ask it as a separate question, linking back to the original. I suppose the best place to ask about etiquette is over on meta: meta.stackoverflow.com

Frank B. Over a year ago

This is so odd. I just tried again and the same error. Obviously it used to work for me, and I'm copying/pasting my own code from here.

Collectives™ on Stack Overflow

Trouble with Webscraping in R using XML Package

2 Answers 2

2 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related