1

I encountered a strange problem while web scraping using rvest.

I scraped the following name: "Ab­dich­ter/in EFZ" which at first looked normal. However, when I wrote the file to a csv I found "-" between the letters. In Excel, the word looked like this: Ab-­dich-ter/in EFZ.

So I did a str_split(x, "") and found that the string actually looked like this:

c("A", "b", "­", "d", "i", "c", "h", "­", "t", "e", "r", "/", "i", "n", " ", "E", "F", "Z")

I tried to get the empty strings out of the string but I did not manage. I tried:

my_string <- str_split(my_string , "")

and then

paste0(my_string[my_string != ""])

but this did not help.

Therefore, I wonder:

  1. How did the empty strings get into that string, and
  2. how do I get it out again.

Edit: This is the webpage.

And here is how I got the string:

library(rvest)

read_html("https://berufskunde.com/ausbildungsberufe/ausbildung-abdichter.html", encoding = "UTF-8") %>% 
  html_nodes(".section") %>% 
  html_nodes(".text-rot") %>% 
  html_text()
6
  • 3
    Try with x[nzchar(x)] Commented Jul 17, 2019 at 13:34
  • @ akrun, thanks. But it does not work. Commented Jul 17, 2019 at 13:36
  • 1
    I think your "" is different character. You may need v1[trimws(v1) != "­"] Here 'v1' is the split character vector Commented Jul 17, 2019 at 13:38
  • One possible issue could be "" compared to " " (space inbetween the two quotes). For me, in many cases I need to use " " Commented Jul 17, 2019 at 13:41
  • No, it is not a space character. Commented Jul 17, 2019 at 13:42

1 Answer 1

5

The string you’re observing is not the empty string but a SOFT HYPHEN (U+00AD) character. It is supposed to be only displayed when a word is broken across lines, but some editors don’t cope with it correctly, which is why it’s probably shown when you inspect the CSV.

At any rate you probably want to remove it from your string:

str = gsub('\U00AD', '', str)
Sign up to request clarification or add additional context in comments.

1 Comment

@Roccer If you know about the behaviour of this soft hyphen character, your description made it likely that this was the case. And since you posted a reproducible example (excellent!) it was easy to verify. For reference, it may also help to inspect the byte values of a string via charToRaw, but that only helps if you know what to look for.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.