2

I scrape information with rvest and store it in a dataframe. All information on various institutions and their context characteristics is stored in one string. It looks similar to JSON, but it isn't. I followed another stack post but am not successful. I think string manipulation should do the job. Finally, "title", "street", "number", etc. should be variables and each institution should be a row. Thank you very much

library('tidyverse')
library('rvest')
library('stringr')
library('stringi')
library('jsonlite')

rubyhash <- "https://www.blutspenden.de/blutspendedienste/#" %>%
  read_html() %>% 
  html_nodes("body") %>% 
  html_nodes("script:first-of-type") %>%  
  html_text() %>% 
  as_tibble() %>% 
  slice(1)

substr(rubyhash$value,1,150)
"\n        var instituionsmap_data = '[{\"title\":\"Plasmazentrum Heidelberg\",\"street\":\"Hans-B\\u00f6ckler-Stra\\u00dfe\",\"number\":\"2A\",\"zip\":\"69115\",\"city\":\""

rubyhash$json <- str_replace(rubyhash$value, "var instituionsmap_data =", "")
rubyhash$json <- trimws(rubyhash$json)

substr(rubyhash$json,1,150)
"'[{\"title\":\"Plasmazentrum Heidelberg\",\"street\":\"Hans-B\\u00f6ckler-Stra\\u00dfe\",\"number\":\"2A\",\"zip\":\"69115\",\"city\":\"Heidelberg\",\"phone\":\"06221 89466960"

fromJSON(rubyhash$json)

3 Answers 3

2

The data you are trying to parse is an array of different json strings, each one containing the equivalent of a data frame row. As well as removing the javascript variable assignment at the start, you need to split the array up into its component json strings before parsing:

rubyhash$value %>%
  str_replace("var instituionsmap_data = '\\[\\{", "") %>%
  str_replace("\\}\\]';\n", '') %>% # Removes the javascript chars at the end
  strsplit('\\},\\{') %>% # Split into component json strings
  getElement(1) %>%
  sapply(function(x) paste0('{', x, '}'), USE.NAMES = FALSE) %>%
  lapply(function(x) as.data.frame(fromJSON(x))) %>%
  bind_rows() %>%
  as_tibble()
#> # A tibble: 195 x 14
#>    title street number zip   city  phone fax   email~1 email url   rekon~2   uid
#>    <chr> <chr>  <chr>  <chr> <chr> <chr> <chr> <chr>   <chr> <chr>   <int> <int>
#>  1 Plas~ Hans-~ "2A"   69115 Heid~ 0622~ ""    "info(~ java~ http~      48   567
#>  2 Plas~ Kamps~ "88 -~ 44137 Dort~ 0231~ ""    "info-~ java~ http~      16   568
#>  3 Plas~ Roteb~ "25"   70178 Stut~ 0711~ ""    "stutt~ java~ http~      16   571
#>  4 Plas~ K1 2   ""     68159 Mann~ 6211~ ""    ""      java~ http~     112   575
#>  5 DRK-~ Fried~ ""     68167 Mann~ 0621~ ""    ""      java~ www.~      49   359
#>  6 DRK-~ Gunze~ "35"   76530 Bade~ 0722~ ""    ""      java~ www.~      33   387
#>  7 DRK ~ Helmh~ ""     89081 Ulm   0731~ ""    ""      java~ www.~      49   389
#>  8 Blut~ Im Ne~ "305"  69120 Heid~ 0622~ ""    ""      java~ http~      49   400
#>  9 Blut~ Otfri~ ""     72076 Tübi~ 0707~ ""    "bluts~ java~ www.~      49   402
#> 10 Blut~ Diako~ ""     74523 Schw~ 0791~ ""    ""      java~ www.~      32   403
#> # ... with 185 more rows, 2 more variables: lat <chr>, lon <chr>, and
#> #   abbreviated variable names 1: email_display, 2: rekonvaleszentenplasma

Created on 2022-09-01 with reprex v2.0.2

Sign up to request clarification or add additional context in comments.

4 Comments

This approach uses more available data from the string. But it looses 5 observations for some reason. I'm trying to figure out which ones and why. It should be n=200.
@Marco those are simply the observations in the json you have found. The approach itself doesn't drop them.
@Marco Also, with this method you getting false information. The observation 6 is missing a phone number on the website. I am not sure where the number in the dataframe came from.
There seems to be more information in the html than is displayed. Thus n=6 has a phone number 0671 2530 that does not appear in browser view. I have to check whether this is old information or not. Would be great to capture even the secret info from the background.
2

I propose this solution with easier code

library(tidyverse)
library(rvest)
library(httr2)

page <- "https://www.blutspenden.de/blutspendedienste/" %>%
  request() %>%
  req_perform() %>%
  resp_body_html()

tibble(
  title = page %>%
    html_elements(".institutions__title") %>%
    html_text2(),
  location = page %>%
    html_elements(".institutions__location") %>%
    html_text2(),
  address = page %>%
    html_elements(".institutions__address") %>%
    html_text2(),
  phone = page %>%
    html_elements(".institutions__item") %>%
    map_chr(. %>%
              html_element(".institutions__phone") %>%
              html_text2),
  position = page %>%  
    html_elements(".institutions__item") %>% 
    map_chr(. %>% 
              html_element(".institutions__position") %>% 
              html_text2)
)

# A tibble: 200 x 5
   title                                         locat~1 address phone posit~2
   <chr>                                         <chr>   <chr>   <chr> <chr>  
 1 Plasma Service Europe Aachen                  Aachen  "Alter~ 0241~ https:~
 2 Octapharma Plasma Aachen                      Aachen  "Peter~ 0241~ https:~
 3 Blutspendedienst der Uniklinik RWTH Aachen    Aachen  "Pauwe~ 0241~ www.uk~
 4 Haema Plasmaspendezentrum Augsburg            Augsbu~ "Phili~ 0821~ https:~
 5 Institut für Transfusions­medizin und Hämost~  Augsbu~ "Steng~ 0821~ https:~
 6 DRK Blutspendedienst NSTOB Bad Fallingbostel  Bad Fa~ "Konra~ NA    https:~
 7 DRK-Blutspendedienst Bad Kreuznach            Bad Kr~ "Burgw~ 0671~ https:~
 8 Blutspendedienst OWL - Bad Oeynhausen HDZ NRW Bad Oe~ "Georg~ 0573~ https:~
 9 DRK-Blutspendedienst Bad Salzuflen            Bad Sa~ "Heldm~ NA    https:~
10 DRK-Blutspendedienst Baden-Baden              Baden-~ "Gunze~ 0722~ www.bl~
# ... with 190 more rows, and abbreviated variable names 1: location,
#   2: position
# i Use `print(n = ...)` to see more rows

5 Comments

Using a cleaner approach when scraping looks optimal. But I don't find all other characteristics. I am also interested in .institutions__position but it says that this information is only available for 199 rows thus I cannot extend this approach directly.
@Marco what else are you missing? I have added positions in the code. Since some of them are missing data, such as phone number and position - you can solve this by using map within institutions__item
Ah okay, .institutions__position returns the URL for some reason. I also need the long and lat coordiaates and zip code. Best
@Marco Then my method falls short, unfornuately.
What are those .institutions__ snippets. div classes? And for some reason the website owner did not define a class for every information?
1

I would go with a simpler regex pattern to extract the JavaScript array and then deserializing with jsonlite produces a dataframe as output.

library(rvest)
library(jsonlite)
library(stringr)
library(dplyr)

s <- read_html('https://www.blutspenden.de/blutspendedienste/#') %>% toString()

data <- jsonlite::parse_json(str_match(s, "var instituionsmap_data = '(.*)'") %>% .[, 2], simplifyVector = T)

Test check:

filter(data, uid == 363)

The tel number here is present both in the page source and in the rendered view once you expand the relevant section on the webpage.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.