convert string with hidden characters

Question

I have some character strings which I'm getting from an html. Turns out, these strings have some hidden characters or controls (?).

How can I convert this string so that it only contains the visible characters?

Take for example the term "Besucherüberblick" and its raw representation:

charToRaw("Besucherüberblick")
 [1] 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b

However, from my html, I'm getting:

[1] e2 80 8c 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b

So there are these three weird thingies at the beginning.

I could probably trial and error and manually remove these from my raw vector and then convert it back to character, but a) I don't know in advance which strings the html will give me and b) I'm looking for an automated solution.

Maybe there's some stringr/stringi solution to it?

MrFlick · Accepted Answer · 2025-09-16 14:02:02Z

17

Those first three bytes (e2 80 8c) are the UTF-8 encoding for the zero width non-joiner unicode character. You can remove those all other other non-printable control characters with the \p{Format} regular expression class which should contain the invisible formatting indicators (see other groups here). You can view the ~160 characters in that class here.

x <- rawToChar(as.raw(c(226, 128, 140, 66, 101, 115, 117, 99, 104, 101, 114, 195, 188, 
      98, 101, 114, 98, 108, 105, 99, 107)))
x
# [1] "‌Besucherüberblick"
charToRaw(x)
#  [1] e2 80 8c 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b


y <- stringr::str_remove_all(x, "[\\p{Format}]") 
y
# [1] "Besucherüberblick"
charToRaw(y)
#  [1] 42 65 73 75 63 68 65 72 c3 bc 62 65 72 62 6c 69 63 6b

Another good choice might be \p{Other} if you want to exclude other control characters or unassigned values, etc. That will exclude all the following categories: \p{Control} (an ASCII or Latin-1 control character: 0x00–0x1F and 0x7F–0x9F which include things like tabs and newline characters), \p{Format} (invisible formatting indicator), \p{Private_Use}: (any code point reserved for private use), \p{Surrogate} (one half of a surrogate pair in UTF-16 encoding) and \p{Unassigned} (any code point to which no character has been assigned)

edited Sep 16 at 14:02

answered Sep 15 at 19:25

MrFlick

209k19 gold badges300 silver badges324 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

SamR Sep 16 at 13:55

This is an excellent answer but I would question the part about using \p{Other}, or at least point out it will remove new lines which may technically be control characters but aren't necessarily the type of character you might want to remove when cleaning text.

MrFlick Sep 16 at 13:59

If you are scraping HTML I think it makes a lot of sense to remove newlines. This german word seems to translate to "Visitor overview" which sounds like a section header or something. I really doubt you would want new lines in that.

Falco Sep 16 at 15:04

If you know fairly certain what kind of content you expect, you can also always got whitelist instead of blacklist. e.g. replace [^\w\d .:_-] will remove anything not on the whitelist. Also with scraping collapsing all white space characters to a single space is a good measure: replace("\s+"," ")

SamR · Accepted Answer · 2025-09-16 13:52:32Z

8

You can remove format characters in base R using the Cf PCRE2 general category property.

gsub("\\p{Cf}+", "", x, perl = TRUE)
# [1] "Besucherüberblick"

This returns the same result as the stringr approach (which uses ICU rather than PCRE):

identical(
    gsub("\\p{Cf}+", "", x, perl = TRUE),
    stringr::str_remove_all(x, "[\\p{Format}]")
)
# [1] TRUE

The PCRE2 docs list all character groups you could use. In this case, the relevant ones are:

  C          Other
  Cc         Control
  Cf         Format
  Cn         Unassigned
  Co         Private use
  Cs         Surrogate

You probably don't want to just use C as it covers all categories listed here, and control characters include \r and \n (carriage return and new line). However, depending on the nature of your data, you might want to expand the pattern to include unassigned, private use or surrogate characters.

edited Sep 16 at 13:52

answered Sep 16 at 8:07

SamR

23.1k4 gold badges23 silver badges55 bronze badges

1 Comment

deschen Sep 16 at 9:22

Thanks for this additional info. In my current use case, using the p{Format} (or Cf) thingy does work, but good to know that in different use cases I might need to enhance it with other groups.

Collectives™ on Stack Overflow

convert string with hidden characters

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related