I have a string:
"foo (2 spaces) bar (3 spaces) baaar (6 spaces) fooo"
How do I remove repetitious spaces in it so there should be no more than one space between any two words?
I have a string:
"foo (2 spaces) bar (3 spaces) baaar (6 spaces) fooo"
How do I remove repetitious spaces in it so there should be no more than one space between any two words?
String#squeeze has an optional parameter to specify characters to squeeze.
irb> "asd asd asd asd".squeeze(" ")
=> "asd asd asd asd"
Warning: calling it without a parameter will 'squezze' ALL repeated characters, not only spaces:
irb> 'aaa bbbb cccc 0000123'.squeeze
=> "a b c 0123"
>> str = "foo bar bar baaar"
=> "foo bar bar baaar"
>> str.split.join(" ")
=> "foo bar bar baaar"
>>
squeeze ' ' is an order of magnitude faster.Updated benchmark from @zetetic's answer:
require 'benchmark'
include Benchmark
string = "foo bar bar baaar"
n = 1_000_000
bm(12) do |x|
x.report("gsub ") { n.times { string.gsub(/\s+/, " ") } }
x.report("squeeze(' ')") { n.times { string.squeeze(' ') } }
x.report("split/join") { n.times { string.split.join(" ") } }
end
Which results in these values when run on my desktop after running it twice:
ruby test.rb; ruby test.rb
user system total real
gsub 6.060000 0.000000 6.060000 ( 6.061435)
squeeze(' ') 4.200000 0.010000 4.210000 ( 4.201619)
split/join 3.620000 0.000000 3.620000 ( 3.614499)
user system total real
gsub 6.020000 0.000000 6.020000 ( 6.023391)
squeeze(' ') 4.150000 0.010000 4.160000 ( 4.153204)
split/join 3.590000 0.000000 3.590000 ( 3.587590)
The issue is that squeeze removes any repeated character, which results in a different output string and doesn't meet the OP's need. squeeze(' ') does meet the needs, but slows down its operation.
string.squeeze
=> "fo bar bar bar"
I was thinking about how the split.join could be faster and it didn't seem like that would hold up in large strings, so I adjusted the benchmark to see what effect long strings would have:
require 'benchmark'
include Benchmark
string = (["foo bar bar baaar"] * 10_000).join
puts "String length: #{ string.length } characters"
n = 100
bm(12) do |x|
x.report("gsub ") { n.times { string.gsub(/\s+/, " ") } }
x.report("squeeze(' ')") { n.times { string.squeeze(' ') } }
x.report("split/join") { n.times { string.split.join(" ") } }
end
ruby test.rb ; ruby test.rb
String length: 250000 characters
user system total real
gsub 2.570000 0.010000 2.580000 ( 2.576149)
squeeze(' ') 0.140000 0.000000 0.140000 ( 0.150298)
split/join 1.400000 0.010000 1.410000 ( 1.396078)
String length: 250000 characters
user system total real
gsub 2.570000 0.010000 2.580000 ( 2.573802)
squeeze(' ') 0.140000 0.000000 0.140000 ( 0.150384)
split/join 1.400000 0.010000 1.410000 ( 1.397748)
So, long lines do make a big difference.
If you do use gsub then gsub/\s{2,}/, ' ') is slightly faster.
Not really. Here's a version of the benchmark to test just that assertion:
require 'benchmark'
include Benchmark
string = "foo bar bar baaar"
puts string.gsub(/\s+/, " ")
puts string.gsub(/\s{2,}/, ' ')
puts string.gsub(/\s\s+/, " ")
string = (["foo bar bar baaar"] * 10_000).join
puts "String length: #{ string.length } characters"
n = 100
bm(18) do |x|
x.report("gsub") { n.times { string.gsub(/\s+/, " ") } }
x.report('gsub/\s{2,}/, "")') { n.times { string.gsub(/\s{2,}/, ' ') } }
x.report("gsub2") { n.times { string.gsub(/\s\s+/, " ") } }
end
# >> foo bar bar baaar
# >> foo bar bar baaar
# >> foo bar bar baaar
# >> String length: 250000 characters
# >> user system total real
# >> gsub 1.380000 0.010000 1.390000 ( 1.381276)
# >> gsub/\s{2,}/, "") 1.590000 0.000000 1.590000 ( 1.609292)
# >> gsub2 1.050000 0.010000 1.060000 ( 1.051005)
If you want speed, use gsub2. squeeze(' ') will still run circles around a gsub implementation though.
split/join to be fastest, though I've used it in apps for this purpose.Important note: this is an answer for Ruby on Rails, not plain ruby
(both Activesupport and Facets are part of Rails gem)
To complement the other answers, note that both [Activesupport][1] and [Facets][1] provide [String#squish][2] ([update] caveat: it also removes newlines within the string):
>> "foo bar bar baaar".squish
=> "foo bar bar baaar"
function [1]: http://www.rubydoc.info/docs/rails/2.3.8/ActiveSupport/CoreExtensions/String/Filters#squish-instance_method [2]: http://www.rubydoc.info/github/rubyworks/facets/String%3Asquish
str.tr("\r","").tr("\n", "").tr("\t", "").squeeze(" ") with this: str.squishUse a regular expression to match repeating whitespace (\s+) and replace it by a space.
"foo bar foobar".gsub(/\s+/, ' ')
=> "foo bar foobar"
This matches every whitespace, as you only want to replace spaces, use / +/ instead of /\s+/.
"foo bar \nfoobar".gsub(/ +/, ' ')
=> "foo bar \nfoobar"
Which method performs better?
$ ruby -v
ruby 1.9.2p0 (2010-08-18 revision 29036) [i686-linux]
$ cat squeeze.rb
require 'benchmark'
include Benchmark
string = "foo bar bar baaar"
n = 1_000_000
bm(6) do |x|
x.report("gsub ") { n.times { string.gsub(/\s+/, " ") } }
x.report("squeeze ") { n.times { string.squeeze } }
x.report("split/join") { n.times { string.split.join(" ") } }
end
$ ruby squeeze.rb
user system total real
gsub 4.970000 0.020000 4.990000 ( 5.624229)
squeeze 0.600000 0.000000 0.600000 ( 0.677733)
split/join 2.950000 0.020000 2.970000 ( 3.243022)
string.squeeze => "fo bar bar bar" which is stripping any repeated character. Changing to string.squeeze(' ') results in times that put it solidly between gsub and split.join(' '), with the last being the fastest. See my answer for the updated benchmark code.Just use gsub and regexp.
For example:
str = "foo bar bar baaar"
str.gsub(/\s+/, " ")
will return new string or you can modify str directly using gsub!.
BTW. Regexp are very useful - there are plenty resources in the internet, for testing your own regexpes try rubular.com for example.