3

Is there a way to selectively remove spaces in a string, in bash? e.g.

hello world你好 世界!
hello world你 好 世 界!
hello world 你 好 世 界!
你 好 世 界 hello world

and output:

hello world你好世界!
你好世界hello world

Notice I want to preserve spaces between English words or simply English alphabet, but not the others.

I understand python.re module is probably good for this, but i prefer a bash command if possible.

1
  • 1
    According to this table, the Kanji should be matched in bash by [[ $var =~ [[:unicode:]] ]], and based on this, you could build up an iterative solution. However, I found that in my bash at least, this match does not work (although I have set LANG to be unicode. I don't know why this does not work. Maybe you could factor out a separate question in Stackoverflow from this, i.e. how to do a regex match in bash on characters with unicode code point above 255. Commented Oct 21, 2021 at 6:55

2 Answers 2

4

You can use sed:

echo hello world你好 世界! | sed -E "s/([^a-zA-Z]) ([^a-zA-Z])/\1\2/g"
  • ([^a-zA-Z]) ([^a-zA-Z]) is a regular expression matching a whitespace between two non latin characters (^ negates). The preceding and following characters are captured in groups (#1 and #2)
  • \1\2 is the replacement string (only groups without whitespace in-between)

Output:

hello world你好世界!

Note: to replace starting and trailing whitespaces, your expression should be:

(^|[^a-zA-Z]) ([^a-zA-Z]|$)

Edit: One thing I didn't take into account is that this kind of expression consumes the characters before and after the whitespaces. So in the case 你 好 世 界 hello world a whitespace was still remaining. You then have to use a regex engine that supports lookarounds:

echo " 你 好 世 界 hello world, !"  | perl -pe "s/(?<=^|[^[:ascii:]]) | (?=[^[:ascii:]]|$)//g"

Output:

你好世界hello world

In order to remove space between latin chars/kandji I split the expression in two. I also replaced the condition on latin character with ascii. Should give more appropriate matches

Sign up to request clarification or add additional context in comments.

15 Comments

sed looks like a solid way, but this is not yet handling other cases like "hello world你 好 世 界!". It gives me "hello world你好 世界" on OSX (BSD sed).
@galactica Try g instead of p?
still seeing an extra space between "你好 世界", yeah, it's strange that g doesn't seem to work in this case.
@Shawn is right! depending on how you plan to use it, you probably don't need the p. and the g should replace all occurrences (edited my answer)
@Tranbi : This does not work on strings such as xxx: ,yyyy. According to your solution, the spaces would be removed here. I think a correct approach would be to distinguish between unicode characters below and above 255.
|
1

A perl solution using Unicode properties (In particular, if a character is or isn't in the latin script:

$ perl -CSD -lpe 's/^\s+//; # Remove leading spaces
                  s/\s+$//; # Remove trailing spaces
                  # Remove spaces between two non-latin characters.
                  s/(\P{scx=Latin})\s++(?=\P{scx=Latin})/$1/g; 
                  # Remove spaces between a leading latin and trailing non-latin
                  s/(\p{scx=Latin})\s++(?=\P{scx=Latin})/$1/g;
                  # Remove spaces between a leading non-latin and trailing latin
                  s/(\P{scx=Latin})\s++(?=\p{scx=Latin})/$1/g;' input.txt
hello world你好世界!
hello world你好世界!
hello world你好世界!
你好世界hello world

It does a bunch of substitutions for the different cases where you want to remove spaces instead of trying to use a single regular expression to match every possibility.

2 Comments

Depending on your actual text, using ASCII vs non-ASCII might work better, as things like punctuation characters aren't in the Latin script (They're in Common). [[:ascii:]] and [^[:ascii:]] if so instead of \p{scx=Latin} and \P{scx=Latin}.
haven't touched perl's regex for a while but this is clearly structured out there, thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.