bash how to selectively remove space in a string

Question

Is there a way to selectively remove spaces in a string, in bash? e.g.

hello world你好 世界！
hello world你 好 世 界！
hello world 你 好 世 界！
你 好 世 界 hello world

and output:

hello world你好世界！
你好世界hello world

Notice I want to preserve spaces between English words or simply English alphabet, but not the others.

I understand python.re module is probably good for this, but i prefer a bash command if possible.

According to this table, the Kanji should be matched in bash by [[ $var =~ [[:unicode:]] ]], and based on this, you could build up an iterative solution. However, I found that in my bash at least, this match does not work (although I have set LANG to be unicode. I don't know why this does not work. Maybe you could factor out a separate question in Stackoverflow from this, i.e. how to do a regex match in bash on characters with unicode code point above 255. — user1934428
– user1934428, Commented Oct 21, 2021 at 6:55

Tranbi · Accepted Answer · 2021-10-21 07:18:41Z

4

You can use sed:

echo hello world你好 世界！ | sed -E "s/([^a-zA-Z]) ([^a-zA-Z])/\1\2/g"

([^a-zA-Z]) ([^a-zA-Z]) is a regular expression matching a whitespace between two non latin characters (^ negates). The preceding and following characters are captured in groups (#1 and #2)
\1\2 is the replacement string (only groups without whitespace in-between)

Output:

hello world你好世界！

Note: to replace starting and trailing whitespaces, your expression should be:

(^|[^a-zA-Z]) ([^a-zA-Z]|$)

Edit: One thing I didn't take into account is that this kind of expression consumes the characters before and after the whitespaces. So in the case 你好世界 hello world a whitespace was still remaining. You then have to use a regex engine that supports lookarounds:

echo " 你 好 世 界 hello world, !"  | perl -pe "s/(?<=^|[^[:ascii:]]) | (?=[^[:ascii:]]|$)//g"

Output:

你好世界hello world

In order to remove space between latin chars/kandji I split the expression in two. I also replaced the condition on latin character with ascii. Should give more appropriate matches

edited Oct 21, 2021 at 7:18

answered Oct 21, 2021 at 5:56

Tranbi

12.8k6 gold badges19 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

galactica Over a year ago

sed looks like a solid way, but this is not yet handling other cases like "hello world你好世界！". It gives me "hello world你好世界" on OSX (BSD sed).

Shawn Over a year ago

@galactica Try g instead of p?

galactica Over a year ago

still seeing an extra space between "你好世界", yeah, it's strange that g doesn't seem to work in this case.

Tranbi Over a year ago

@Shawn is right! depending on how you plan to use it, you probably don't need the p. and the g should replace all occurrences (edited my answer)

user1934428 Over a year ago

@Tranbi : This does not work on strings such as xxx: ,yyyy. According to your solution, the spaces would be removed here. I think a correct approach would be to distinguish between unicode characters below and above 255.

|

Shawn · Accepted Answer · 2021-10-21 10:29:54Z

1

A perl solution using Unicode properties (In particular, if a character is or isn't in the latin script:

$ perl -CSD -lpe 's/^\s+//; # Remove leading spaces
                  s/\s+$//; # Remove trailing spaces
                  # Remove spaces between two non-latin characters.
                  s/(\P{scx=Latin})\s++(?=\P{scx=Latin})/$1/g; 
                  # Remove spaces between a leading latin and trailing non-latin
                  s/(\p{scx=Latin})\s++(?=\P{scx=Latin})/$1/g;
                  # Remove spaces between a leading non-latin and trailing latin
                  s/(\P{scx=Latin})\s++(?=\p{scx=Latin})/$1/g;' input.txt
hello world你好世界！
hello world你好世界！
hello world你好世界！
你好世界hello world

It does a bunch of substitutions for the different cases where you want to remove spaces instead of trying to use a single regular expression to match every possibility.

answered Oct 21, 2021 at 10:29

Shawn

53.9k3 gold badges29 silver badges74 bronze badges

2 Comments

Shawn Over a year ago

Depending on your actual text, using ASCII vs non-ASCII might work better, as things like punctuation characters aren't in the Latin script (They're in Common). [[:ascii:]] and [^[:ascii:]] if so instead of \p{scx=Latin} and \P{scx=Latin}.

galactica Over a year ago

haven't touched perl's regex for a while but this is clearly structured out there, thanks!

Collectives™ on Stack Overflow

bash how to selectively remove space in a string

2 Answers 2

15 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

15 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related