Webscrape script variable and convert string into JSON in R

Question

I scrape information with rvest and store it in a dataframe. All information on various institutions and their context characteristics is stored in one string. It looks similar to JSON, but it isn't. I followed another stack post but am not successful. I think string manipulation should do the job. Finally, "title", "street", "number", etc. should be variables and each institution should be a row. Thank you very much

library('tidyverse')
library('rvest')
library('stringr')
library('stringi')
library('jsonlite')

rubyhash <- "https://www.blutspenden.de/blutspendedienste/#" %>%
  read_html() %>% 
  html_nodes("body") %>% 
  html_nodes("script:first-of-type") %>%  
  html_text() %>% 
  as_tibble() %>% 
  slice(1)

substr(rubyhash$value,1,150)
"\n        var instituionsmap_data = '[{\"title\":\"Plasmazentrum Heidelberg\",\"street\":\"Hans-B\\u00f6ckler-Stra\\u00dfe\",\"number\":\"2A\",\"zip\":\"69115\",\"city\":\""

rubyhash$json <- str_replace(rubyhash$value, "var instituionsmap_data =", "")
rubyhash$json <- trimws(rubyhash$json)

substr(rubyhash$json,1,150)
"'[{\"title\":\"Plasmazentrum Heidelberg\",\"street\":\"Hans-B\\u00f6ckler-Stra\\u00dfe\",\"number\":\"2A\",\"zip\":\"69115\",\"city\":\"Heidelberg\",\"phone\":\"06221 89466960"

fromJSON(rubyhash$json)

Allan Cameron · Accepted Answer · 2022-09-01 08:59:02Z

2

The data you are trying to parse is an array of different json strings, each one containing the equivalent of a data frame row. As well as removing the javascript variable assignment at the start, you need to split the array up into its component json strings before parsing:

rubyhash$value %>%
  str_replace("var instituionsmap_data = '\\[\\{", "") %>%
  str_replace("\\}\\]';\n", '') %>% # Removes the javascript chars at the end
  strsplit('\\},\\{') %>% # Split into component json strings
  getElement(1) %>%
  sapply(function(x) paste0('{', x, '}'), USE.NAMES = FALSE) %>%
  lapply(function(x) as.data.frame(fromJSON(x))) %>%
  bind_rows() %>%
  as_tibble()
#> # A tibble: 195 x 14
#>    title street number zip   city  phone fax   email~1 email url   rekon~2   uid
#>    <chr> <chr>  <chr>  <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <int> <int>
#>  1 Plas~ Hans-~ "2A"   69115 Heid~ 0622~ ""    "info(~ java~ http~      48   567
#>  2 Plas~ Kamps~ "88 -~ 44137 Dort~ 0231~ ""    "info-~ java~ http~      16   568
#>  3 Plas~ Roteb~ "25"   70178 Stut~ 0711~ ""    "stutt~ java~ http~      16   571
#>  4 Plas~ K1 2   ""     68159 Mann~ 6211~ ""    ""      java~ http~     112   575
#>  5 DRK-~ Fried~ ""     68167 Mann~ 0621~ ""    ""      java~ www.~      49   359
#>  6 DRK-~ Gunze~ "35"   76530 Bade~ 0722~ ""    ""      java~ www.~      33   387
#>  7 DRK ~ Helmh~ ""     89081 Ulm   0731~ ""    ""      java~ www.~      49   389
#>  8 Blut~ Im Ne~ "305"  69120 Heid~ 0622~ ""    ""      java~ http~      49   400
#>  9 Blut~ Otfri~ ""     72076 Tübi~ 0707~ ""    "bluts~ java~ www.~      49   402
#> 10 Blut~ Diako~ ""     74523 Schw~ 0791~ ""    ""      java~ www.~      32   403
#> # ... with 185 more rows, 2 more variables: lat <chr>, lon <chr>, and
#> #   abbreviated variable names 1: email_display, 2: rekonvaleszentenplasma

^{Created on 2022-09-01 with reprex v2.0.2}

answered Sep 1, 2022 at 8:59

Allan Cameron

178k7 gold badges71 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Marco Over a year ago

This approach uses more available data from the string. But it looses 5 observations for some reason. I'm trying to figure out which ones and why. It should be n=200.

Allan Cameron Over a year ago

@Marco those are simply the observations in the json you have found. The approach itself doesn't drop them.

HoelR Over a year ago

@Marco Also, with this method you getting false information. The observation 6 is missing a phone number on the website. I am not sure where the number in the dataframe came from.

Marco Over a year ago

There seems to be more information in the html than is displayed. Thus n=6 has a phone number 0671 2530 that does not appear in browser view. I have to check whether this is old information or not. Would be great to capture even the secret info from the background.

HoelR · Accepted Answer · 2022-09-01 09:21:04Z

2

I propose this solution with easier code

library(tidyverse)
library(rvest)
library(httr2)

page <- "https://www.blutspenden.de/blutspendedienste/" %>%
  request() %>%
  req_perform() %>%
  resp_body_html()

tibble(
  title = page %>%
    html_elements(".institutions__title") %>%
    html_text2(),
  location = page %>%
    html_elements(".institutions__location") %>%
    html_text2(),
  address = page %>%
    html_elements(".institutions__address") %>%
    html_text2(),
  phone = page %>%
    html_elements(".institutions__item") %>%
    map_chr(. %>%
              html_element(".institutions__phone") %>%
              html_text2),
  position = page %>%  
    html_elements(".institutions__item") %>% 
    map_chr(. %>% 
              html_element(".institutions__position") %>% 
              html_text2)
)

# A tibble: 200 x 5
   title                                         locat~1 address phone posit~2
   <chr>                                         <chr>   <chr>   <chr> <chr>  
 1 Plasma Service Europe Aachen                  Aachen  "Alter~ 0241~ https:~
 2 Octapharma Plasma Aachen                      Aachen  "Peter~ 0241~ https:~
 3 Blutspendedienst der Uniklinik RWTH Aachen    Aachen  "Pauwe~ 0241~ www.uk~
 4 Haema Plasmaspendezentrum Augsburg            Augsbu~ "Phili~ 0821~ https:~
 5 Institut für Transfusionsmedizin und Hämost~  Augsbu~ "Steng~ 0821~ https:~
 6 DRK Blutspendedienst NSTOB Bad Fallingbostel  Bad Fa~ "Konra~ NA    https:~
 7 DRK-Blutspendedienst Bad Kreuznach            Bad Kr~ "Burgw~ 0671~ https:~
 8 Blutspendedienst OWL - Bad Oeynhausen HDZ NRW Bad Oe~ "Georg~ 0573~ https:~
 9 DRK-Blutspendedienst Bad Salzuflen            Bad Sa~ "Heldm~ NA    https:~
10 DRK-Blutspendedienst Baden-Baden              Baden-~ "Gunze~ 0722~ www.bl~
# ... with 190 more rows, and abbreviated variable names 1: location,
#   2: position
# i Use `print(n = ...)` to see more rows

edited Sep 1, 2022 at 9:21

answered Sep 1, 2022 at 9:01

HoelR

6,8771 gold badge8 silver badges18 bronze badges

5 Comments

Marco Over a year ago

Using a cleaner approach when scraping looks optimal. But I don't find all other characteristics. I am also interested in .institutions__position but it says that this information is only available for 199 rows thus I cannot extend this approach directly.

HoelR Over a year ago

@Marco what else are you missing? I have added positions in the code. Since some of them are missing data, such as phone number and position - you can solve this by using map within institutions__item

Marco Over a year ago

Ah okay, .institutions__position returns the URL for some reason. I also need the long and lat coordiaates and zip code. Best

HoelR Over a year ago

@Marco Then my method falls short, unfornuately.

Marco Over a year ago

What are those .institutions__ snippets. div classes? And for some reason the website owner did not define a class for every information?

QHarr · Accepted Answer · 2022-09-02 03:17:41Z

1

I would go with a simpler regex pattern to extract the JavaScript array and then deserializing with jsonlite produces a dataframe as output.

library(rvest)
library(jsonlite)
library(stringr)
library(dplyr)

s <- read_html('https://www.blutspenden.de/blutspendedienste/#') %>% toString()

data <- jsonlite::parse_json(str_match(s, "var instituionsmap_data = '(.*)'") %>% .[, 2], simplifyVector = T)

Test check:

filter(data, uid == 363)

The tel number here is present both in the page source and in the rendered view once you expand the relevant section on the webpage.

answered Sep 2, 2022 at 3:17

QHarr

84.5k14 gold badges58 silver badges105 bronze badges

Collectives™ on Stack Overflow

Webscrape script variable and convert string into JSON in R

3 Answers 3

4 Comments

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related