I encountered a strange problem while web scraping using rvest.
I scraped the following name: "Abdichter/in EFZ" which at first looked normal. However, when I wrote the file to a csv I found "-" between the letters. In Excel, the word looked like this: Ab-dich-ter/in EFZ.
So I did a str_split(x, "") and found that the string actually looked like this:
c("A", "b", "", "d", "i", "c", "h", "", "t", "e", "r", "/", "i", "n", " ", "E", "F", "Z")
I tried to get the empty strings out of the string but I did not manage. I tried:
my_string <- str_split(my_string , "")
and then
paste0(my_string[my_string != ""])
but this did not help.
Therefore, I wonder:
- How did the empty strings get into that string, and
- how do I get it out again.
Edit: This is the webpage.
And here is how I got the string:
library(rvest)
read_html("https://berufskunde.com/ausbildungsberufe/ausbildung-abdichter.html", encoding = "UTF-8") %>%
html_nodes(".section") %>%
html_nodes(".text-rot") %>%
html_text()
x[nzchar(x)]""is different character. You may needv1[trimws(v1) != ""]Here 'v1' is the split character vector