5

This question seems to make it easy to remove space characters in a string in R. However when I load the following table I'm not able to remove a space between two numbers (eg.11 846.4):

require(XML)
require(RCurl)
require(data.table)

link2fetch = 'https://www.destatis.de/DE/Themen/Branchen-Unternehmen/Landwirtschaft-Forstwirtschaft-Fischerei/Feldfruechte-Gruenland/Tabellen/ackerland-hauptnutzungsarten-kulturarten.html'

theurl = getURL(link2fetch, .opts = list(ssl.verifypeer = FALSE) ) # important!
area_cult10 = readHTMLTable(theurl, stringsAsFactors = FALSE)
area_cult10 = rbindlist(area_cult10)
    
test = sub(',', '.', area_cult10$V5) # change , to . 
test = gsub('(.+)\\s([A-Z]{1})*', '\\1', test) # remove LETTERS
gsub('\\s', '', test[1]) # remove white space?

Why can't I remove the space in test[1]? Thanks for any advice! Can this be something else than a space character? Maybe the answer is really easy and I'm overlooking something.

5
  • 1
    ok, after kniting a html I've discovered that it's not a space but a non-braking space. Looks like this   in a html and can be searched with \u00A0. Tricky! Commented May 2, 2017 at 9:35
  • I have tried your code and got [1] "11846.4" - no whitespace there. Commented May 2, 2017 at 9:39
  • strange. after restarting R and running the code I still get this space [1] "11 846.4". However I can remove it with the above mentioned \u00A0. Maybe differing package versions? Commented May 2, 2017 at 9:47
  • 1
    You know, it got removed when I just ran your code. When I started to check if I can improve the regex, it stopped removing the space. I confirm: creating the test as you showed, the whitespace disappears. If I use test1 <- gsub("[\\sA-Za-z]+", "", area_cult10$V5) to remove all whitespaces and letters, the whitespace remains. And gsub("[[:space:]A-Za-z]+", "", area_cult10$V5) works. Commented May 2, 2017 at 9:49
  • Try sub(",", ".", gsub("[[:space:]A-Za-z]+|\\W+$", "", area_cult10$V5), fixed=TRUE) Commented May 2, 2017 at 9:54

2 Answers 2

6

You may shorten the test creation to just 2 steps and using just 1 PCRE regex (note the perl=TRUE parameter):

test = sub(",", ".", gsub("(*UCP)[\\s\\p{L}]+|\\W+$", "", area_cult10$V5, perl=TRUE), fixed=TRUE)

Result:

 [1] "11846.4" "6529.2"  "3282.7"  "616.0"   "1621.8"  "125.7"   "14.2"   
 [8] "401.6"   "455.5"   "11.7"    "160.4"   "79.1"    "37.6"    "29.6"   
[15] ""        "13.9"    "554.1"   "236.7"   "312.8"   "4.6"     "136.9"  
[22] "1374.4"  "1332.3"  "1281.8"  "3.7"     "5.0"     "18.4"    "23.4"   
[29] "42.0"    "2746.2"  "106.6"   "2100.4"  "267.8"   "258.4"   "13.1"   
[36] "23.5"    "11.6"    "310.2"  

The gsub regex is worth special attention:

  • (*UCP) - the PCRE verb that enforces the pattern to be Unicode aware
  • [\\s\\p{L}]+ - matches 1+ whitespace or letter characters
  • | - or (an alternation operator)
  • \\W+$ - 1+ non-word chars at the end of the string.

Then, sub(",", ".", x, fixed=TRUE) will replace the first , with a . as literal strings, fixed=TRUE saves performance since it does not have to compile a regex.

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks for the detailed explanations! However with [[:space:]] I still don't get rid of the non-breaking space. I have to use test = sub(",", ".", gsub("\u00A0|[[:space:][:alpha:]]+|\\W+$", "", area_cult10$V5), fixed=TRUE) to make it work. It's still puzzling why it works for you..
@andrasz: Hm, I have 2 ideas how to solve it in another way, but no idea as to why it fails in different cases. Try also with gsub using "(*UCP)[\\s\\p{L}]+|\\W+$" pattern while passing perl=TRUE argument. Are you on Linux?
Yes, on Linux Mint 18 based on Ubuntu 14.04. Does that help?
Yes, you may enumerate all the Unicode whitespace code points, and use something like [ \f\n\r\t\v\u00a0\u1680\u180e\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff] (note the escape sequences are compatible with JavaScript, this is taken from MDN site), but when you use \s with the (*UCP) verb, it will match all Unicode whitespace. No need to worry about it next time.
|
1

with the stringr package the [:space:] regex works fine

str=paste0(sapply(c(104,101,108,160,108,111), FUN=function(x) intToUtf8(x)),collapse = "")
str
#> [1] "hel lo"
stringr::str_replace(str," ","")
#> [1] "hel lo"
stringr::str_replace(str,"[:space:]","")
#> [1] "hello"

Created on 2023-12-21 with reprex v2.0.2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.