12

I have some character strings which I'm getting from an html. Turns out, these strings have some hidden characters or controls (?).

How can I convert this string so that it only contains the visible characters?

Take for example the term "Besucherüberblick" and its raw representation:

charToRaw("Besucherüberblick")
 [1] 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b

However, from my html, I'm getting:

[1] e2 80 8c 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b

So there are these three weird thingies at the beginning.

I could probably trial and error and manually remove these from my raw vector and then convert it back to character, but a) I don't know in advance which strings the html will give me and b) I'm looking for an automated solution.

Maybe there's some stringr/stringi solution to it?

0

2 Answers 2

17

Those first three bytes (e2 80 8c) are the UTF-8 encoding for the zero width non-joiner unicode character. You can remove those all other other non-printable control characters with the \p{Format} regular expression class which should contain the invisible formatting indicators (see other groups here). You can view the ~160 characters in that class here.

x <- rawToChar(as.raw(c(226, 128, 140, 66, 101, 115, 117, 99, 104, 101, 114, 195, 188, 
      98, 101, 114, 98, 108, 105, 99, 107)))
x
# [1] "‌Besucherüberblick"
charToRaw(x)
#  [1] e2 80 8c 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b


y <- stringr::str_remove_all(x, "[\\p{Format}]") 
y
# [1] "Besucherüberblick"
charToRaw(y)
#  [1] 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b

Another good choice might be \p{Other} if you want to exclude other control characters or unassigned values, etc. That will exclude all the following categories: \p{Control} (an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F which include things like tabs and newline characters), \p{Format} (invisible formatting indicator), \p{Private_Use}: (any code point reserved for private use), \p{Surrogate} (one half of a surrogate pair in UTF-16 encoding) and \p{Unassigned} (any code point to which no character has been assigned)

Sign up to request clarification or add additional context in comments.

3 Comments

This is an excellent answer but I would question the part about using \p{Other}, or at least point out it will remove new lines which may technically be control characters but aren't necessarily the type of character you might want to remove when cleaning text.
If you are scraping HTML I think it makes a lot of sense to remove newlines. This german word seems to translate to "Visitor overview" which sounds like a section header or something. I really doubt you would want new lines in that.
If you know fairly certain what kind of content you expect, you can also always got whitelist instead of blacklist. e.g. replace [^\w\d .:_-] will remove anything not on the whitelist. Also with scraping collapsing all white space characters to a single space is a good measure: replace("\s+"," ")
8

You can remove format characters in base R using the Cf PCRE2 general category property.

gsub("\\p{Cf}+", "", x, perl = TRUE)
# [1] "Besucherüberblick"

This returns the same result as the stringr approach (which uses ICU rather than PCRE):

identical(
    gsub("\\p{Cf}+", "", x, perl = TRUE),
    stringr::str_remove_all(x, "[\\p{Format}]")
)
# [1] TRUE

The PCRE2 docs list all character groups you could use. In this case, the relevant ones are:

  C          Other
  Cc         Control
  Cf         Format
  Cn         Unassigned
  Co         Private use
  Cs         Surrogate

You probably don't want to just use C as it covers all categories listed here, and control characters include \r and \n (carriage return and new line). However, depending on the nature of your data, you might want to expand the pattern to include unassigned, private use or surrogate characters.

1 Comment

Thanks for this additional info. In my current use case, using the p{Format} (or Cf) thingy does work, but good to know that in different use cases I might need to enhance it with other groups.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.