Extracting html table and turn into tibble or data.frame in R

Question

Using the following code:

library(rvest)
 
read_html("https://gainblers.com/mx/quinielas/progol-revancha/", encoding = "UTF-8")|>
    html_elements(xpath= '//*[@id="content_seccionb"]/div[1]/ul')|>
    html_children()|>
    html_text()|>
    as_tibble()-> gb

I get this:

structure(list(value = c("\r\n\r\n                Partidos\r\n                L\r\n                E\r\n                V\r\n                Pronósticos\r\n            ", 
"\r\n\t\t\t            1   MéxicovsJapón\r\n\t\t\t            2,4239%\r\n\t\t\t\t        3,5027%\r\n\t\t\t\t        2,7234%\r\n\t\t\t\t        LV\r\n\t        \t\t", 
"\r\n\t\t\t            2   USAvsCorea del Sur\r\n\t\t\t            2,2243%\r\n\t\t\t\t        3,2030%\r\n\t\t\t\t        3,4028%\r\n\t\t\t\t        L\r\n\t        \t\t", 
"\r\n\t\t\t            3   Juárez FCvsPachuca\r\n\t\t\t            3,9023%\r\n\t\t\t\t        3,5626%\r\n\t\t\t\t        1,7851%\r\n\t\t\t\t        V\r\n\t        \t\t", 
"\r\n\t\t\t            4   Tigres UANLvsMonterrey\r\n\t\t\t            1,5359%\r\n\t\t\t\t        3,8823%\r\n\t\t\t\t        5,3117%\r\n\t\t\t\t        L\r\n\t        \t\t", 
"\r\n\t\t\t            5   Dorados De SinaloavsIrapuato\r\n\t\t\t            2,9631%\r\n\t\t\t\t        3,4027%\r\n\t\t\t\t        2,2541%\r\n\t\t\t\t        LV\r\n\t        \t\t", 
"\r\n\t\t\t            6   Tampico MaderovsTapatio\r\n\t\t\t            1,6058%\r\n\t\t\t\t        3,9024%\r\n\t\t\t\t        5,1318%\r\n\t\t\t\t        L\r\n\t        \t\t", 
"\r\n\t\t\t            7   Tepatitlan de MorelosvsLeones Negros\r\n\t\t\t            2,0445%\r\n\t\t\t\t        3,3628%\r\n\t\t\t\t        3,4027%\r\n\t\t\t\t        L\r\n\t        \t\t", 
"\r\n\t\t\t            8   IrlandavsHungría\r\n\t\t\t            2,7536%\r\n\t\t\t\t        3,2630%\r\n\t\t\t\t        2,9034%\r\n\t\t\t\t        LV\r\n\t        \t\t", 
"\r\n\t\t\t            9   GreciavsDinamarca\r\n\t\t\t            2,8435%\r\n\t\t\t\t        3,2630%\r\n\t\t\t\t        2,8135%\r\n\t\t\t\t        LV\r\n\t        \t\t", 
"\r\n\t\t\t            10   Estoril PraiavsSanta Clara\r\n\t\t\t            2,7734%\r\n\t\t\t\t        3,1031%\r\n\t\t\t\t        2,7535%\r\n\t\t\t\t        LV\r\n\t        \t\t", 
"\r\n\t\t\t            11   St. Louis CityvsDallas FC\r\n\t\t\t            2,0049%\r\n\t\t\t\t        4,1524%\r\n\t\t\t\t        3,5528%\r\n\t\t\t\t        L\r\n\t        \t\t", 
"\r\n\t\t\t            12   Sporting Kansas CityvsAustin FC\r\n\t\t\t            2,4939%\r\n\t\t\t\t        3,7526%\r\n\t\t\t\t        2,6736%\r\n\t\t\t\t        LV\r\n\t        \t\t", 
"\r\n\t\t\t            13   Deportivo CoruñavsSporting Gijón\r\n\t\t\t            2,2044%\r\n\t\t\t\t        3,2530%\r\n\t\t\t\t        3,7026%\r\n\t\t\t\t        LE\r\n\t        \t\t", 
"\r\n\t\t\t            14   BurgosvsLas Palmas\r\n\t\t\t            2,4340%\r\n\t\t\t\t        3,1231%\r\n\t\t\t\t        3,2530%\r\n\t\t\t\t        LV\r\n\t        \t\t", 
"\r\n\t\t\t            15   BurgosvsLas Palmas\r\n\t\t\t            2,4340%\r\n\t\t\t\t        3,1231%\r\n\t\t\t\t        3,2530%\r\n\t\t\t\t        LV\r\n\t        \t\t", 
"\r\n\t\t\t            16   PumasvsToluca\r\n\t\t\t            %\r\n\t\t\t\t        %\r\n\t\t\t\t        %\r\n\t\t\t\t        \r\n\t        \t\t", 
"\r\n\t\t\t            17   Tlaxcala FCvsOaxaca\r\n\t\t\t            1,7554%\r\n\t\t\t\t        3,9024%\r\n\t\t\t\t        4,2522%\r\n\t\t\t\t        L\r\n\t        \t\t", 
"\r\n\t\t\t            18   AtlantevsCorrecaminos\r\n\t\t\t            1,3171%\r\n\t\t\t\t        5,0518%\r\n\t\t\t\t        9,1010%\r\n\t\t\t\t        L\r\n\t        \t\t", 
"\r\n\t\t\t            19   TurquíavsEspaña\r\n\t\t\t            6,0017%\r\n\t\t\t\t        4,6122%\r\n\t\t\t\t        1,6062%\r\n\t\t\t\t        V\r\n\t        \t\t", 
"\r\n\t\t\t            20   IsraelvsItalia\r\n\t\t\t            8,7011%\r\n\t\t\t\t        4,9120%\r\n\t\t\t\t        1,4369%\r\n\t\t\t\t        V\r\n\t        \t\t", 
"\r\n\t\t\t            21   ZaragozavsValladolid\r\n\t\t\t            2,6537%\r\n\t\t\t\t        3,2530%\r\n\t\t\t\t        2,9033%\r\n\t\t\t\t        LV\r\n\t        \t\t", 
"\r\n\t\t\t            22   AlmeríavsRacing Santander\r\n\t\t\t            1,9051%\r\n\t\t\t\t        3,8425%\r\n\t\t\t\t        4,0024%\r\n\t\t\t\t        L\r\n\t        \t\t"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-23L))

Now I want to turn that output into a readable data.frame but I really don't know where to start since output looks somewhat messy. My only guess is to use separate() but that's all. I'm also thinking it could be possible to extract a tidiest version for the html table from site and just make some minor adjustments. Any advice will be much appreciated.

lailaps · Accepted Answer · 2025-09-06 08:14:18Z

What I usually do in these cases is to

get each row <li>
and then use xpaths to navigate from each <li> element down to the element of interest and map it to column.

with this you save yourself the trouble of cleaning up the merged texts like "2,4239%".

On the page use F12 -> Elements to inspect the tables' structure using the button (1).

You can right click on elements within the structure -> "copy xpath" to get the path for each html_element() call. So from each <li> element we go down the xpaths to get to the column data.

library(rvest)

# get rows 'li' of table to iterate over them
rows <- read_html("https://gainblers.com/mx/quinielas/progol-revancha/", encoding = "UTF-8") |>
  html_element(xpath= '//*[@id="content_seccionb"]/div[1]/ul') |>
  html_nodes("li") 

# helper function to get the text from a nodes child found by xpath
from_xpath <- \(x, path) x |> html_element(xpath = path) |> html_text(trim = TRUE)
# @margusl correctly pointed out, that "from_xpath()" is already vectorized and can directly be applied to "rows"

foo <- data.frame(
      nr =        from_xpath(rows, "div[1]/span"),
      partidos1 = from_xpath(rows, "div[1]/p/span[1]"), 
      partidos2 = from_xpath(rows, "div[1]/p/span[3]"),
      L1 =        from_xpath(rows, "div[2]/span"),
      L2 =        from_xpath(rows, "div[2]/strong"),
      E1 =        from_xpath(rows, "div[3]/span"),
      E2 =        from_xpath(rows, "div[3]/strong"),
      V1 =        from_xpath(rows, "div[4]/span"),
      V2 =        from_xpath(rows, "div[4]/strong"),
      pron1 =     from_xpath(rows, "div[5]/div[1]"),
      pron2 =     from_xpath(rows, "div[5]/div[2]")
    ) |> 
  subset(!is.na(partidos1)) # filter out header row

giving

   nr             partidos1        partidos2   L1  L2   E1  E2   V1  V2 pron1 pron2
2   1                México            Japón 2,42 39% 3,50 27% 2,72 34%     L     V
3   2                   USA    Corea del Sur 2,22 43% 3,20 30% 3,40 28%     L  <NA>
4   3             Juárez FC          Pachuca 3,87 24% 3,55 26% 1,79 51%     V  <NA>
5   4           Tigres UANL        Monterrey 1,53 59% 3,88 23% 5,31 17%     L  <NA>
6   5    Dorados De Sinaloa         Irapuato 2,96 31% 3,40 27% 2,23 42%     L     V
7   6        Tampico Madero          Tapatio 1,60 58% 3,90 24% 5,13 18%     L  <NA>
8   7 Tepatitlan de Morelos    Leones Negros 2,03 46% 3,38 27% 3,40 27%     L  <NA>
9   8               Irlanda          Hungría 2,75 36% 3,26 30% 2,90 34%     L     V
10  9                Grecia        Dinamarca 2,82 35% 3,24 30% 2,84 35%     L     V
11 10         Estoril Praia      Santa Clara 2,77 34% 3,10 31% 2,75 35%     L     V
12 11        St. Louis City        Dallas FC 2,00 49% 4,15 24% 3,55 28%     L  <NA>
13 12  Sporting Kansas City        Austin FC 2,49 39% 3,75 26% 2,67 36%     L     V
14 13      Deportivo Coruña   Sporting Gijón 2,20 44% 3,25 30% 3,70 26%     L     E
15 14                Burgos       Las Palmas 2,43 40% 3,12 31% 3,25 30%     L     V
16 15                Burgos       Las Palmas 2,43 40% 3,12 31% 3,25 30%     L     V
17 16                 Pumas           Toluca        %        %        %  <NA>  <NA>
18 17           Tlaxcala FC           Oaxaca 1,73 54% 4,00 24% 4,25 22%     L  <NA>
19 18               Atlante     Correcaminos 1,31 71% 5,05 18% 9,10 10%     L  <NA>
20 19               Turquía           España 6,00 17% 4,61 22% 1,60 62%     V  <NA>
21 20                Israel           Italia 8,70 11% 4,91 20% 1,43 69%     V  <NA>
22 21              Zaragoza       Valladolid 2,65 37% 3,25 30% 2,90 33%     L     V
23 22               Almería Racing Santander 1,90 51% 3,84 25% 4,00 24%     L  <NA>

Answering your comment-question

I have a doubt: How do you extract xpaths for any variable? I'm asking because for "Partidos" I get "//*[@id="content_seccionb"]/div[1]/ul/li[1]/div[1]" and you got "div[1]/p/span[1]". Is it something you see in the structure of the "rows" object? Sorry for asking but for any answer I receive I try to fully understand in order to learn and help others.

"//*[@id="content_seccionb"]/div[1]/ul/li[1]/div[1]" matches the whole div, so the text will be concatenated together like "MéxicovsJapón". "div[1]/p/span[1]" on the other hand matches "México", see the HTML structure of one <li[2]> element below. I added the calls to from_xpath(.x, "") to make it clear which xpath corresponds to DOM-element.

<li class="tr quiniela-tr">
    <div class="td flex7 td-event-with-calendar">
        <span class="m-none">1&nbsp;&nbsp;&nbsp; -- from_xpath(row, "div[1]/span")
        </span>
        <p class="event">
            <a class="linkForzoso" href="/es/apuestas/futbol/internacional/amistosos/mexico-japon/">
            <span> 
            México -- from_xpath(row, "div[1]/p/span[1]")
            </span>
            <span class="vs">
            vs -- I skipped this one because it's just "vs"
            </span>
            <span>
            Japón -- from_xpath(row, "div[1]/p/span[3]")
            </span>
            </a>
        </p>
    </div>
    <div class="td flex2 f-row">
      <span class="cuotita in-event no-link">2,40</span> -- from_xpath(row, "div[2]/span")
      <strong class="counter">39%</strong></div> -- from_xpath(row, "div[2]/strong")
    <div class="td flex2 f-row">
      <span class="cuotita in-event no-link">3,50</span> -- from_xpath(row, "div[3]/span")
      <strong class="counter">27%</strong> -- from_xpath(row, "div[3]/strong")
    </div>
    <div class="td flex2 f-row">
      <span class="cuotita in-event no-link">2,80</span>  -- from_xpath(row, "div[4]/span")
      <strong class="counter">34%</strong> -- from_xpath(row, "div[4]/strong")
    </div>
    <div class="td flex2 f-row">
        <div class="grupo-casilla">L</div> -- from_xpath(row, "div[5]/div[1]")
        <div class="grupo-casilla">V</div> -- from_xpath(row, "div[5]/div[2]")
    </div>
</li>

Thanks for your answer. I have a doubt: How do you extract xpaths for any variable? I'm asking because for "Partidos" I get "//*[@id="content_seccionb"]/div[1]/ul/li[1]/div[1]" and you got "div[1]/p/span[1]". Is it something you see in the structure of the "rows" object? Sorry for asking but for any answer I receive I try to fully understand in order to learn and help others.
"//*[@id="content_seccionb"]/div[1]/ul/li[1]/div[1]" matches the whole div, so the text will be concatenated together like "MéxicovsJapón". "div[1]/p/span[1]" on the other hand matches "México". See my updated answer :)
Wow. I have no words. You definetely have a vast understanding of webscrapping. How can I pay you for your time and patience? Thanks again!
html_element() (and thus your from_xpath() ) is already vectorized, so you could skip iteration and go directly for data.frame(nr = from_xpath(rows, "div[1]/span"), ..., pron2 = from_xpath(rows, "div[5]/div[2]")) and still end up with an identical frame. Here it saves just few tens of ms, which might not really matter in an interactive session targeting a single page and/or small table(s), but it's surprisingly easy to end up in a secs vs ms situation. There's a map() vs vectorized rvest benchmark for creating a 270x4 frame at the end of stackoverflow.com/a/78974301/646761
You are right, it's good practice to make use of the vectorization. I updated my answer!

G. Grothendieck · Accepted Answer · 2025-09-05 22:31:00Z

The question did not specify the output desired but I made some assumptions and you can modify it as desired as the code is mostly just a bunch of string substitutions. Extract the only column of gb, value, trim whitespace off both ends, replace 2 or more consecutive whitespace characters with semicolon, remove the percent signs, commas and leading digits, insert a comma and space between any lower case letter immediately followed by an upper case letter and then read it in.

gb |>
  _$value |>
  trimws() |>
  gsub("\\s{2,}", ";", x = _) |>
  gsub("[%,]|^\\d+", "", x = _) |>
  sub("([[:lower:]])([[:upper:]])", "\\1, \\2", x = _) |>
  read.csv2(text = _, fill = TRUE)

giving

                                    Partidos     L     E     V Pronósticos
1                            Méxicovs, Japón 24239 35027 27234          LV
2                       USAvs, Corea del Sur 22243 32030 34028           L
3                       Juárez FCvs, Pachuca 38724 35526 17951           V
4                   Tigres UANLvs, Monterrey 15359 38823 53117           L
5             Dorados De Sinaloavs, Irapuato 29631 34027 22342          LV
6                  Tampico Maderovs, Tapatio 16058 39024 51318           L
7     Tepatitlan de Morelosvs, Leones Negros 20346 33827 34027           L
8                         Irlandavs, Hungría 27536 32630 29034          LV
9                        Greciavs, Dinamarca 28235 32430 28435          LV
10              Estoril Praiavs, Santa Clara 27734 31031 27535          LV
11               St. Louis Cityvs, Dallas FC 20049 41524 35528           L
12         Sporting Kansas Cityvs, Austin FC 24939 37526 26736          LV
13        Deportivo Coruñavs, Sporting Gijón 22044 32530 37026          LE
14                      Burgosvs, Las Palmas 24340 31231 32530          LV
15                      Burgosvs, Las Palmas 24340 31231 32530          LV
16                           Pumasvs, Toluca    NA    NA    NA            
17                     Tlaxcala FCvs, Oaxaca 17354 40024 42522           L
18                   Atlantevs, Correcaminos 13171 50518 91010           L
19                         Turquíavs, España 60017 46122 16062           V
20                          Israelvs, Italia 87011 49120 14369           V
21                    Zaragozavs, Valladolid 26537 32530 29033          LV
22               Almeríavs, Racing Santander 19051 38425 40024           L

We can directly modify the presented workflow by doing ... |> rvest::html_text(trim=TRUE) |> gsub('[\t\r\n]+', ';', x=_) |> sub('^\\d+\\s*;?\\s*', '', x=_) |> textConnection() |> read.csv2(fill=TRUE). However splitting columns needs work: cbind(Partidos=gt$Partidos, gt[c('L', 'E', 'V')] |> lapply(\(j) sub('\\,', '\\.', j) |> strcapture(x=_, pattern='([0-9].[0-9]{2})([0-9]{2}%)', proto=data.frame(numeric(), character()))) |> do.call(what='cbind') |> setNames(paste0(rep(c('L', 'E', 'V'), each=2), 1:2)), Pronósticos=gt$Pronósticos) , where we have stored the scraping approch in gt before.

Collectives™ on Stack Overflow

Extracting html table and turn into tibble or data.frame in R

2 Answers 2

Answering your comment-question

5 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Answering your comment-question

5 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related