48

I have a string:

"foo (2 spaces) bar (3 spaces) baaar (6 spaces) fooo"

How do I remove repetitious spaces in it so there should be no more than one space between any two words?

3
  • 2
    You know, this kind of question is easily answered by reviewing all the String methods. I highly recommend getting familiar with the documentation for the String, Array, and Enumerable methods. Commented Feb 5, 2011 at 15:41
  • In case you don't know where to start, visit http://ruby-doc.org/ and then click on the Core API link and then click on the String class in the top middle column. Commented Feb 5, 2011 at 15:51
  • 1
    To the OP's defense, removing the spaces can be accomplished several ways, not all of which are the most intuitive, especially when you look at the benchmark results. Commented Dec 30, 2011 at 19:19

7 Answers 7

107

String#squeeze has an optional parameter to specify characters to squeeze.

irb> "asd  asd asd   asd".squeeze(" ")
=> "asd asd asd asd"

Warning: calling it without a parameter will 'squezze' ALL repeated characters, not only spaces:

irb> 'aaa     bbbb     cccc 0000123'.squeeze
=> "a b c 0123"
Sign up to request clarification or add additional context in comments.

Comments

51
>> str = "foo  bar   bar      baaar"
=> "foo  bar   bar      baaar"
>> str.split.join(" ")
=> "foo bar bar baaar"
>>

5 Comments

+1 For an amusing way to do it, but -1 for an inefficient suggestion compared to other, more appropriate alternatives.
@zetetic. Thanks. This further proofs that split/join is not an amusing or inefficient way , ( as I have always known ) than regex substitution.
@kurumi it is amusing and inefficient unless you have small strings to work on. For the articles I'm working on, squeeze ' ' is an order of magnitude faster.
This method also removes leading and trailing spaces which might not be intended
29

Updated benchmark from @zetetic's answer:

require 'benchmark'
include Benchmark

string = "foo  bar   bar      baaar"
n = 1_000_000
bm(12) do |x|
  x.report("gsub      ")   { n.times { string.gsub(/\s+/, " ") } }
  x.report("squeeze(' ')") { n.times { string.squeeze(' ') } }
  x.report("split/join")   { n.times { string.split.join(" ") } }
end

Which results in these values when run on my desktop after running it twice:

ruby test.rb; ruby test.rb
                  user     system      total        real
gsub          6.060000   0.000000   6.060000 (  6.061435)
squeeze(' ')  4.200000   0.010000   4.210000 (  4.201619)
split/join    3.620000   0.000000   3.620000 (  3.614499)
                  user     system      total        real
gsub          6.020000   0.000000   6.020000 (  6.023391)
squeeze(' ')  4.150000   0.010000   4.160000 (  4.153204)
split/join    3.590000   0.000000   3.590000 (  3.587590)

The issue is that squeeze removes any repeated character, which results in a different output string and doesn't meet the OP's need. squeeze(' ') does meet the needs, but slows down its operation.

string.squeeze
 => "fo bar bar bar"

I was thinking about how the split.join could be faster and it didn't seem like that would hold up in large strings, so I adjusted the benchmark to see what effect long strings would have:

require 'benchmark'
include Benchmark

string = (["foo  bar   bar      baaar"] * 10_000).join
puts "String length: #{ string.length } characters"
n = 100
bm(12) do |x|
  x.report("gsub      ")   { n.times { string.gsub(/\s+/, " ") } }
  x.report("squeeze(' ')") { n.times { string.squeeze(' ') } }
  x.report("split/join")   { n.times { string.split.join(" ") } }
end

ruby test.rb ; ruby test.rb

String length: 250000 characters
                  user     system      total        real
gsub          2.570000   0.010000   2.580000 (  2.576149)
squeeze(' ')  0.140000   0.000000   0.140000 (  0.150298)
split/join    1.400000   0.010000   1.410000 (  1.396078)

String length: 250000 characters
                  user     system      total        real
gsub          2.570000   0.010000   2.580000 (  2.573802)
squeeze(' ')  0.140000   0.000000   0.140000 (  0.150384)
split/join    1.400000   0.010000   1.410000 (  1.397748)

So, long lines do make a big difference.


If you do use gsub then gsub/\s{2,}/, ' ') is slightly faster.

Not really. Here's a version of the benchmark to test just that assertion:

require 'benchmark'
include Benchmark

string = "foo  bar   bar      baaar"
puts string.gsub(/\s+/, " ")
puts string.gsub(/\s{2,}/, ' ')
puts string.gsub(/\s\s+/, " ")

string = (["foo  bar   bar      baaar"] * 10_000).join
puts "String length: #{ string.length } characters"
n = 100
bm(18) do |x|
  x.report("gsub")               { n.times { string.gsub(/\s+/, " ") } }
  x.report('gsub/\s{2,}/, "")')  { n.times { string.gsub(/\s{2,}/, ' ') } }
  x.report("gsub2")              { n.times { string.gsub(/\s\s+/, " ") } }
end
# >> foo bar bar baaar
# >> foo bar bar baaar
# >> foo bar bar baaar
# >> String length: 250000 characters
# >>                          user     system      total        real
# >> gsub                 1.380000   0.010000   1.390000 (  1.381276)
# >> gsub/\s{2,}/, "")    1.590000   0.000000   1.590000 (  1.609292)
# >> gsub2                1.050000   0.010000   1.060000 (  1.051005)

If you want speed, use gsub2. squeeze(' ') will still run circles around a gsub implementation though.

6 Comments

@zetetic, I think Benchmark is an essential tool. I can't count how many times I've assumed something would be the fastest way to do a particular task, and had benchmark prove me wrong. I'd never had considered split/join to be fastest, though I've used it in apps for this purpose.
@zetetic, check out the added test results.
My inference is if you avoided the interpolation in join(" ") by using join(' ') it should be even (immeasureably?) faster.
Nope. We've tested that before, and it makes no difference. Strings, whether defined using double-quotes or single-quotes, are defined as the code is initially parsed by the interpreter at startup, not on the fly. The only time it could make a difference is if there are values being interpolated into the string at run-time.
While it might seem so, benchmarks don't bear that out. Looking for "2 or more" takes longer than "1 or more". See the added benchmark.
|
28

Important note: this is an answer for Ruby on Rails, not plain ruby (both Activesupport and Facets are part of Rails gem)

To complement the other answers, note that both [Activesupport][1] and [Facets][1] provide [String#squish][2] ([update] caveat: it also removes newlines within the string):

>> "foo  bar   bar      baaar".squish
=> "foo bar bar baaar"

function [1]: http://www.rubydoc.info/docs/rails/2.3.8/ActiveSupport/CoreExtensions/String/Filters#squish-instance_method [2]: http://www.rubydoc.info/github/rubyworks/facets/String%3Asquish

1 Comment

Holy cow. I literally threw up my arms in amazement. I just replaced this: str.tr("\r","").tr("\n", "").tr("\t", "").squeeze(" ") with this: str.squish
9

Use a regular expression to match repeating whitespace (\s+) and replace it by a space.

"foo    bar  foobar".gsub(/\s+/, ' ')
=> "foo bar foobar"

This matches every whitespace, as you only want to replace spaces, use / +/ instead of /\s+/.

"foo    bar  \nfoobar".gsub(/ +/, ' ')
=> "foo bar \nfoobar"

Comments

5

Which method performs better?

$ ruby -v
ruby 1.9.2p0 (2010-08-18 revision 29036) [i686-linux]

$ cat squeeze.rb 
require 'benchmark'
include Benchmark

string = "foo  bar   bar      baaar"
n = 1_000_000
bm(6) do |x|
  x.report("gsub      ") { n.times { string.gsub(/\s+/, " ") } }
  x.report("squeeze   ") { n.times { string.squeeze } }
  x.report("split/join") { n.times { string.split.join(" ") } }
end

$ ruby squeeze.rb 
            user     system      total        real
gsub        4.970000   0.020000   4.990000 (  5.624229)
squeeze     0.600000   0.000000   0.600000 (  0.677733)
split/join  2.950000   0.020000   2.970000 (  3.243022)

1 Comment

this benchmark is not quite correct. string.squeeze => "fo bar bar bar" which is stripping any repeated character. Changing to string.squeeze(' ') results in times that put it solidly between gsub and split.join(' '), with the last being the fastest. See my answer for the updated benchmark code.
3

Just use gsub and regexp. For example:

str = "foo  bar   bar      baaar"
str.gsub(/\s+/, " ")

will return new string or you can modify str directly using gsub!.

BTW. Regexp are very useful - there are plenty resources in the internet, for testing your own regexpes try rubular.com for example.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.