remove (non-breaking) space character in string

Question

This question seems to make it easy to remove space characters in a string in R. However when I load the following table I'm not able to remove a space between two numbers (eg.11 846.4):

require(XML)
require(RCurl)
require(data.table)

link2fetch = 'https://www.destatis.de/DE/Themen/Branchen-Unternehmen/Landwirtschaft-Forstwirtschaft-Fischerei/Feldfruechte-Gruenland/Tabellen/ackerland-hauptnutzungsarten-kulturarten.html'

theurl = getURL(link2fetch, .opts = list(ssl.verifypeer = FALSE) ) # important!
area_cult10 = readHTMLTable(theurl, stringsAsFactors = FALSE)
area_cult10 = rbindlist(area_cult10)
    
test = sub(',', '.', area_cult10$V5) # change , to . 
test = gsub('(.+)\\s([A-Z]{1})*', '\\1', test) # remove LETTERS
gsub('\\s', '', test[1]) # remove white space?

Why can't I remove the space in test[1]? Thanks for any advice! Can this be something else than a space character? Maybe the answer is really easy and I'm overlooking something.

ok, after kniting a html I've discovered that it's not a space but a non-braking space. Looks like this   in a html and can be searched with \u00A0. Tricky! — andschar
– andschar, Commented May 2, 2017 at 9:35
I have tried your code and got [1] "11846.4" - no whitespace there. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 2, 2017 at 9:39
strange. after restarting R and running the code I still get this space [1] "11 846.4". However I can remove it with the above mentioned \u00A0. Maybe differing package versions? — andschar
– andschar, Commented May 2, 2017 at 9:47
You know, it got removed when I just ran your code. When I started to check if I can improve the regex, it stopped removing the space. I confirm: creating the test as you showed, the whitespace disappears. If I use test1 <- gsub("[\\sA-Za-z]+", "", area_cult10$V5) to remove all whitespaces and letters, the whitespace remains. And gsub("[[:space:]A-Za-z]+", "", area_cult10$V5) works. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 2, 2017 at 9:49
Try sub(",", ".", gsub("[[:space:]A-Za-z]+|\\W+$", "", area_cult10$V5), fixed=TRUE) — Wiktor Stribiżew
– Wiktor Stribiżew, Commented May 2, 2017 at 9:54

Wiktor Stribiżew · Accepted Answer · 2017-05-02 10:58:26Z

6

You may shorten the test creation to just 2 steps and using just 1 PCRE regex (note the perl=TRUE parameter):

test = sub(",", ".", gsub("(*UCP)[\\s\\p{L}]+|\\W+$", "", area_cult10$V5, perl=TRUE), fixed=TRUE)

Result:

 [1] "11846.4" "6529.2"  "3282.7"  "616.0"   "1621.8"  "125.7"   "14.2"   
 [8] "401.6"   "455.5"   "11.7"    "160.4"   "79.1"    "37.6"    "29.6"   
[15] ""        "13.9"    "554.1"   "236.7"   "312.8"   "4.6"     "136.9"  
[22] "1374.4"  "1332.3"  "1281.8"  "3.7"     "5.0"     "18.4"    "23.4"   
[29] "42.0"    "2746.2"  "106.6"   "2100.4"  "267.8"   "258.4"   "13.1"   
[36] "23.5"    "11.6"    "310.2"

The gsub regex is worth special attention:

(*UCP) - the PCRE verb that enforces the pattern to be Unicode aware
[\\s\\p{L}]+ - matches 1+ whitespace or letter characters
| - or (an alternation operator)
\\W+$ - 1+ non-word chars at the end of the string.

Then, sub(",", ".", x, fixed=TRUE) will replace the first , with a . as literal strings, fixed=TRUE saves performance since it does not have to compile a regex.

edited May 2, 2017 at 10:58

answered May 2, 2017 at 9:56

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

andschar Over a year ago

Thanks for the detailed explanations! However with [[:space:]] I still don't get rid of the non-breaking space. I have to use test = sub(",", ".", gsub("\u00A0|[[:space:][:alpha:]]+|\\W+$", "", area_cult10$V5), fixed=TRUE) to make it work. It's still puzzling why it works for you..

Wiktor Stribiżew Over a year ago

@andrasz: Hm, I have 2 ideas how to solve it in another way, but no idea as to why it fails in different cases. Try also with gsub using "(*UCP)[\\s\\p{L}]+|\\W+$" pattern while passing perl=TRUE argument. Are you on Linux?

andschar Over a year ago

Yes, on Linux Mint 18 based on Ubuntu 14.04. Does that help?

Wiktor Stribiżew Over a year ago

YES - see x <- c("11 846.4 A", "6 529.2 A", "3 282.7 A") gsub("(*UCP)\\s+", "", x, perl=TRUE).

Wiktor Stribiżew Over a year ago

Yes, you may enumerate all the Unicode whitespace code points, and use something like [ \f\n\r\t\v\u00a0\u1680\u180e\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff] (note the escape sequences are compatible with JavaScript, this is taken from MDN site), but when you use \s with the (*UCP) verb, it will match all Unicode whitespace. No need to worry about it next time.

|

Wael · Accepted Answer · 2023-12-21 13:38:06Z

1

with the stringr package the [:space:] regex works fine

str=paste0(sapply(c(104,101,108,160,108,111), FUN=function(x) intToUtf8(x)),collapse = "")
str
#> [1] "hel lo"
stringr::str_replace(str," ","")
#> [1] "hel lo"
stringr::str_replace(str,"[:space:]","")
#> [1] "hello"

^{Created on 2023-12-21 with reprex v2.0.2}

answered Dec 21, 2023 at 13:38

Wael

1,8081 gold badge13 silver badges24 bronze badges

Collectives™ on Stack Overflow

remove (non-breaking) space character in string

2 Answers 2

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related