Let's suppose your given string were
str = "Mary had fourteen little lambs"
and the desired replacements were given by the following hash (aka hashmap):
h = { "Mary"=>"Butch", "four"=>"three", "little"=>"wee", "lambs"=>"hippos" }
indicating that we want to replace "Mary" (wherever it appears in the string, if at all) with "Butch", and so on. We therefore want to return the following string:
"Butch had fourteen wee hippos"
Notice that we do not want 'fourteen' to be replaced with 'threeteen' and we want the extra spaces between 'fourteen' and 'wee' to be preserved.
First collect the keys of the hash h into an array (or list):
keys = h.keys
#=> ["Mary", "four", "little", "lambs"]
Most languages have a method or function sub or gsub that works something like the following:
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch had fourteen wee hippos"
The regular expression /\w+/ (r'\w+' in Python, for example) matches one or more word characters, as many as possible (i.e., a greedy match). Word characters are letters, digits and the underscore ('_'). It therefore will sequentially match 'Mary', 'had', 'fourteen', 'little' and 'lambs'.
Each matched word is passed to the "block" do |word| ...end and is held by the variable word. The block calculation then computes and returns the string that is to replace the value of word in a duplicate of the original string. Different languages uses different structures and formats to do this, of course.
The first word passed to the block by gsub is 'Mary'. The following calculation is then performed:
if keys.include?("Mary") # true
# so replace "Mary" with:
h[word] #=> "Butch
else # not executed
# not executed
end
Next, gsub passes the word 'had' to the block and assigns that string to the variable word. The following calculation is then performed:
if keys.include?("had") # false
# not executed
else
# so replace "had" with:
"had"
# that is, leave "had" unchanged
end
Similar calculations are made for each word matched by the regular expression.
We see that punctuation and other non-word characters is not a problem:
str = "Mary, had fourteen little lambs!"
str.gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "Butch, had fourteen wee hippos!"
We can see that gsub does not perform replacements sequentially:
h = { "foo"=>"bar", "bar"=>"baz" }
keys = h.keys
#=> ["foo", "bar"]
"foo bar".gsub(/\w+/) do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> "bar baz"
Note that a linear search of keys is required to evaluate
keys.include?("Mary")
This could be relatively time-consuming if keys has many elements.
In most languages this can be sped up by making keys a set (an unordered collection of unique elements). Determining whether a set contains a given element is quite fast, comparable to determining if a hash has a given key.
An alternative formulation is to write
str.gsub(/\b(?:Mary|four|little|lambs)\b/) { |word| h[word] }
#=> "Butch had fourteen wee hippos"
where the regular expression is constructed programmatically from h.keys. This regular expression reads, "match one of the four words indicated, preceded and followed by a word boundary (\b). The trailing word boundary prevents 'four' from matching 'fourteen'. Since gsub is now only considering the replacement of those four words the block can be simplified to { |word| h[word] }.
Again, this preserves punctuation and extra spaces.
If for some reason we wanted to be able to replace parts of words (e.g., to replace 'fourteen' with 'threeteen'), simply remove the word boundaries from the regular expression:
str.gsub(/Mary|four|little|lambs/) { |word| h[word] }
#=> "Butch had threeteen wee hippos"
Naturally, different languages provide variations of this approach. In Ruby, for example, one could write:
g = Hash.new { |h,k| k }.merge(h)
The creates a hash g that has the same key-value pairs as h but has the additional property that if g does not have a key k, g[k] (the value of key k) returns k. That allows us to write simply:
str.gsub(/\w+/, g)
#=> "Butch had fourteen wee hippos"
See the second version of String#gsub.
A different approach (which I will show is problematic) is to construct an array (or list) of words from the string, replace those words as appropriate and then rejoin the resulting words to form a string. For example,
words = str.split
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
["Butch", "had", "fourteen", "wee", "hippos"]
arr.join(' ')
#=> "Butch had fourteen wee hippos"
This produces similar results except the extra spaces have been removed.
Now suppose the string contained punctuation:
str = "Mary, had fourteen little lambs!"
words = str.split
#=> ["Mary,", "had", "fourteen", "little", "lambs!"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Mary,", "had", "fourteen", "wee", "lambs!"]
arr.join(' ')
#=> "Mary, had fourteen wee lambs!"
We could deal with punctuation by writing
words = str.scan(/\w+/)
#=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
if keys.include?(word)
h[word]
else
word
end
end
#=> ["Butch", "had", "fourteen", "wee", "hippos"]
Here str.scan returns an array of all matches of the regular expression /\w+/ (one or more word characters). The obvious problem is that all punctuation has been lost when arr.join(' ').
foois replaced withbaf, How shouldaaafoooozzzexist after replacement: asaaabafoozzzoraaababafzzz?