4

I'm trying my hand at scraping tables from Wikipedia and I'm reaching an impasse. I'm using the squads of the FIFA 2014 World Cup as an example. In this case, I want to extract the list of the participating countries from the table of the contents from the page "2014 FIFA World Cup squads" and store them as a vector. Here's how far I got:

library(tidyverse)
library(rvest)
library(XML)
library(RCurl)

(Countries <- read_html("https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads") %>% 
  html_node(xpath = '//*[@id="toc"]/ul') %>% 
  htmlTreeParse() %>%
  xmlRoot())

This spits out a bunch of HTML code that I won't copy/paste here. I specifically am looking to extract all lines with the tag <span class="toctext"> such as "Group A", "Brazil", "Cameroon", etc. and have them saved as a vector. What function would make this happen?

1 Answer 1

3

You can read the text from a node using html_text()

url <- "https://en.wikipedia.org/wiki/2014_FIFA_World_Cup_squads"
toc <- url %>%
    read_html() %>%
    html_node(xpath = '//*[@id="toc"]') %>%
    html_text()

This gives you a single character vector. You can then split on the \n character to give you the results as a vector (and you can clean out the blanks)

contents <- strsplit(toc, "\n")[[1]]

contents[contents != ""]

# [1] "Contents"                                   "1 Group A"                                  "1.1 Brazil"                                
# [4] "1.2 Cameroon"                               "1.3 Croatia"                                "1.4 Mexico"                                
# [7] "2 Group B"                                  "2.1 Australia"                              "2.2 Chile"                                 
# [10] "2.3 Netherlands"                            "2.4 Spain"                                  "3 Group C"                                 
# [13] "3.1 Colombia"                               "3.2 Greece"                                 "3.3 Ivory Coast"                           
# [16] "3.4 Japan"                                  "4 Group D"                                  "4.1 Costa Rica"                            
# [19] "4.2 England"                                "4.3 Italy"                                  "4.4 Uruguay"                               
# ---
# etc

Generally, to read tables in an html document you can use the html_table() function, but in this case the table of contents isn't read.

url %>% 
    read_html() %>%
    html_table()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.