Turning a table in HTML into a data frame

Question

I'm trying my hand at scraping tables from Wikipedia and I'm reaching an impasse. I'm using the squads of the FIFA 2014 World Cup as an example. In this case, I want to extract the list of the participating countries from the table of the contents from the page "2014 FIFA World Cup squads" and store them as a vector. Here's how far I got:

library(tidyverse)
library(rvest)
library(XML)
library(RCurl)

(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>% 
  html_node(xpath = '//*[@id="toc"]/ul') %>% 
  htmlTreeParse() %>%
  xmlRoot())

This spits out a bunch of HTML code that I won't copy/paste here. I specifically am looking to extract all lines with the tag <span class="toctext"> such as "Group A", "Brazil", "Cameroon", etc. and have them saved as a vector. What function would make this happen?

SymbolixAU · Accepted Answer · 2017-07-27 09:45:30Z

You can read the text from a node using html_text()

url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
    read_html() %>%
    html_node(xpath = '//*[@id="toc"]') %>%
    html_text()

This gives you a single character vector. You can then split on the \n character to give you the results as a vector (and you can clean out the blanks)

contents <- strsplit(toc, "\n")[[1]]

contents[contents != ""]

# [1] "Contents"                                   "1 Group A"                                  "1.1 Brazil"                                
# [4] "1.2 Cameroon"                               "1.3 Croatia"                                "1.4 Mexico"                                
# [7] "2 Group B"                                  "2.1 Australia"                              "2.2 Chile"                                 
# [10] "2.3 Netherlands"                            "2.4 Spain"                                  "3 Group C"                                 
# [13] "3.1 Colombia"                               "3.2 Greece"                                 "3.3 Ivory Coast"                           
# [16] "3.4 Japan"                                  "4 Group D"                                  "4.1 Costa Rica"                            
# [19] "4.2 England"                                "4.3 Italy"                                  "4.4 Uruguay"                               
# ---
# etc

Generally, to read tables in an html document you can use the html_table() function, but in this case the table of contents isn't read.

url %>% 
    read_html() %>%
    html_table()

Collectives™ on Stack Overflow

Turning a table in HTML into a data frame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related