R Web Scraping - Data is Incomplete (Yahoo Finance)

Question

I am using the following code. It successfully targets the correct url and node text. However, the data that is returned is incomplete as some of the fields (like previous close and open) are blank or failed to download

library(rvest)
library(httr)
library(xml2)

ticker <- "IVV"
url <- paste0("https://finance.yahoo.com/quote/",ticker, "/")
browser_ua <- "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"
head <- c("Accept" = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language" = "en-US,en;q=0.9")
    
html_page <- session(
    url,
    user_agent(browser_ua),
    add_headers(head))
    
node_txt <-"yf-1b7pzha"   # old node was "yf-tx3nkj"                                                  
  
temp <- html_page %>% 
    read_html() %>%
    xml_find_all("//li[contains(@class, node_txt)]/span/text()")

The code uses this url: https://finance.yahoo.com/quote/IVV/ What code change is required to grab all values in the first table under the graph?

Note that scraping data like this is likely a violation of Yahoo finance's terms of service, see section 2.d.ix — jpsmith
– jpsmith, Commented Jul 30 at 18:37
There is an R package yahoofinanceR. On ToS: You should refer to Yahoo!’s terms of use (here, here) for details on your rights to use the actual data downloaded. Remember - the Yahoo! finance API is intended for personal use only. — lailaps
– lailaps, Commented Jul 30 at 19:05
It's "write" . And yes, many of us do. I am getting an error 401. Function call stop_for_status(res) is giving Unauthorized (HTTP 401). So @jpsmith and @TimG are probably right (pun not intended). — Rui Barradas
– Rui Barradas, Commented Jul 30 at 19:15
You can use the getQuote() function in package quantmod if you specify the second argument what based on what the helper functopm yahooQF() helps you select. (There are two named vectors here: one for the JSON request fields, one for for the returned column.) The corresponding Python package also allows you to select fields. — Ada Lovelace
– Ada Lovelace, Commented Jul 30 at 19:45

Ada Lovelace · Accepted Answer · 2025-07-30 22:16:10Z

Expanding on my earlier comment here is a (partial) answer relying on the quantmod package and its handling of the request. Yahoo! actually supports a range of fields, and the Python package yfinance has slightly better documentation.

Here we first select (interactively !) the fields we want:

> library(quantmod) # CRAN package used here
> qf <- yahooQF()   # launches a GUI-based selector
> qf                # this corresponds to the selection I made
[[1]]
 [1] "symbol"                      "shortName"                  
 [3] "ask"                         "bid"                        
 [5] "regularMarketPrice"          "regularMarketChange"        
 [7] "regularMarketOpen"           "regularMarketDayHigh"       
 [9] "regularMarketDayLow"         "regularMarketVolume"        
[11] "regularMarketChangePercent"  "regularMarketPreviousClose" 
[13] "fiftyTwoWeekLow"             "fiftyTwoWeekHigh"           
[15] "ytdReturn"                   "trailingPE"                 
[17] "trailingAnnualDividendYield" "netAssets"                  
[19] "netExpenseRatio"            

[[2]]
 [1] "Symbol"            "Name"              "Ask"              
 [4] "Bid"               "Last"              "Change"           
 [7] "Open"              "High"              "Low"              
[10] "Volume"            "% Change"          "P. Close"         
[13] "52-week Low"       "52-week High"      "YTD Return"       
[16] "P/E Ratio"         "Dividend Yield"    "Net Assets"       
[19] "Net Expense Ratio"

attr(,"class")
[1] "quoteFormat"
>

Next we use this selection to download data for IVV:

> getQuote("IVV", what=qf)
             Trade Time Symbol                     Name    Ask    Bid
IVV 2025-07-30 15:52:42    IVV iShares Core S&P 500 ETF 634.96 635.02
      Last   Change  Open    High    Low  Volume  % Change P. Close
IVV 636.16 -2.24005 639.1 640.735 634.59 2909287 -0.350885    638.4
    52-week Low 52-week High YTD Return P/E Ratio Dividend Yield
IVV         484       641.74    6.18625   27.3658     0.00890977
     Net Assets Net Expense Ratio
IVV 6.22809e+11              0.03
>

This function is vectorized so given a selection of fields as in qf here you could also retrieve multiple quotes at once.

PS For completeness a re-usable display of the qf variable I used:

> dput(qf)
structure(list(c("symbol", "shortName", "ask", "bid", "regularMarketPrice", 
"regularMarketChange", "regularMarketOpen", "regularMarketDayHigh", 
"regularMarketDayLow", "regularMarketVolume", "regularMarketChangePercent", 
"regularMarketPreviousClose", "fiftyTwoWeekLow", "fiftyTwoWeekHigh", 
"ytdReturn", "trailingPE", "trailingAnnualDividendYield", "netAssets", 
"netExpenseRatio"), c("Symbol", "Name", "Ask", "Bid", "Last", 
"Change", "Open", "High", "Low", "Volume", "% Change", "P. Close", 
"52-week Low", "52-week High", "YTD Return", "P/E Ratio", "Dividend Yield", 
"Net Assets", "Net Expense Ratio")), class = "quoteFormat")
>

This is nice! But in germany I get an error: Unable to obtain yahoo crumb. If this is being called from a GDPR country, Yahoo requires GDPR consent, which cannot be scripted
Oh sorry to hear that, that is rather painful. Yahoo! has been hiding more and more content and access behind such shenanigans. The issue ticket discussions for packages like quantmod have some context in existing discussions.

lailaps · Accepted Answer · 2025-07-31 16:43:22Z

Ada's Qunatmod answer is to be preferred, but you asked

What code change is required to grab all values in the first table under the graph?

A lot. You want all <li> items in the <ul> below the div[@data-testid="quote-statistics"]. For a cleaner approach, you could target the 12 <fin-streamer> elements

<fin-streamer data-symbol="IVV" data-value="4,559,917" data-trend="none" active="" data-dfield="longFmt" data-field="regularMarketVolume" class="yf-1b7pzha">4,559,917</fin-streamer>

and manually adress the 3 remaining fields and pull their attributes. In my Code, I use selenider + chromote and then just pull the raw text out of the 15 <li> elements and apply some string splitting.

library(chromote)
library(selenider)
session <- selenider_session("chromote",options = chromote_options(headless = FALSE))
open_url("https://finance.yahoo.com/quote/IVV/")
try(s("button[name='reject']") |> elem_click(), silent = TRUE)
tab <- do.call(rbind,lapply(ss(xpath = '//div[@data-testid="quote-statistics"]//ul//li'), \(x) elem_text(x)))[,1]
res <- sub("52 Week", "FiftyTwo Week", tab)
res <- sub("5Y Monthly", "Five Y Monthly", res)
name <- sub("^(\\D+).*", "\\1", res) |> trimws()
value <- sapply(regmatches(res, regexpr("^(.+?)\\s+(?=[0-9])", res, perl = TRUE), invert = TRUE), \(x) trimws(paste(x[-1], collapse = "")))
si <- data.frame(name = name, value = value)
close_session()

                     name           value
1          Previous Close          638.40
2                    Open          639.10
3                     Bid    640.02 x 300
4                     Ask    640.65 x 400
5             Day's Range 634.59 - 640.73
6     FiftyTwo Week Range 484.00 - 641.74
7                  Volume       4,559,917
8             Avg. Volume       5,854,059
9              Net Assets         622.81B
10                    NAV          638.12
11         PE Ratio (TTM)           27.42
12                  Yield           1.29%
13 YTD Daily Total Return           9.12%
14  Beta (Five Y Monthly)            1.00
15    Expense Ratio (net)           0.03%

Disclaimer: I very much do not recommend this, your IP could be temporarly banned.

A much better approach would be to use the yahoofinancer package (I used kable because it makes the result more clear)

library(yahoofinancer)
s <- Ticker$new('IVV')
yf <- s$get_history(start = today(), interval = '1d')
fields <- c('regular_market_price', 'fifty_two_week_high', 'fifty_two_week_low',
            'regular_market_volume', 'exchange_name', 'full_exchange_name',
            'previous_close', 'currency', 'exchange_timezone_name', 'symbol')
yf[fields] <- lapply(fields, \(x) s[[x]])

date	volume	high	low	open	close	adj_close	regular_market_price	fifty_two_week_high	fifty_two_week_low	regular_market_volume	exchange_name	full_exchange_name	previous_close	currency	exchange_timezone_name	symbol
2025-07-30 20:00:00	4559917	640.735	634.59	639.1	637.51	637.51	637.51	641.74	484	4559917	PCX	NYSEArca	638.4	USD	America/New_York	IVV

Or using the code base from {yahoofinanceR} you can write your own function that retrieves the Quote data directly from the API:

# using yahoofinancers function get_meta we can write our own version
# that retrieves the full API response as list
library(jsonlite)
library(httr)

# Source: https://github.com/rsquaredacademy/yahoofinancer/blob/dbad4b14f355ee925650f95d380d0eae52f821ab/R/ticker.R#L433
get_yahoo_symbol_info <- function(sym) {
  url <- paste0("https://query2.finance.yahoo.com/v8/finance/chart/", sym)
  jsonlite::fromJSON(httr::content(httr::GET(url), "text", encoding = "UTF-8"), 
                     simplifyVector = FALSE)$chart$result[[1]]
}

ivv_info <- get_yahoo_symbol_info("IVV")

Collectives™ on Stack Overflow

R Web Scraping - Data is Incomplete (Yahoo Finance)

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related