Looking for efficient string replacement algorythm

Question

I'm trying to create a string replacer that accepts multilpe replacements.

The ideia is that it would scan the string to find substrings and replace those substrings with another substring.

For example, I should be able to ask it to replace every "foo" for "bar". Doing that is trivial.

The issue starts when I'm trying to add multiple replacements for this function. Because if I ask it to replace "foo" for "bar" and "bar" for "biz", running those replacements in sequence would result in "foo" turning to "biz", and this behavior is unintended.

I tried splitting the string into words and running each replacement function in each word. However that's not bullet proof either because still results in unintended behavior, since you can ask it to replace substrings that are not whole words. Also, I find that very inefficient.

I'm thinking in some way of running each replacer once in the whole string and sort of storing those changes and merging them. However I think I'm overengineering.

Searching on the web gives me trivial results on how to use string.replace with regular expressions, it doesn't solve my problem.

Is this a problem already solved? Is there an algorithm that can be used here for this string manipulation efficiently?

Have you considered overlapping substrings? For example, if your original string in "foof" and you want the replacements ("foo" -> "bar", "oof" -> "baz"), what should the result be? Or can we assume that the matches in the original string are disjoint intervals? — kcsquared
– kcsquared, Commented Mar 21, 2022 at 6:09
How many different substrings are you searching for, roughly? It's simpler to do a standard string search for each different substring, but something like Aho-Corasick for multiple-pattern search is more efficient with many distinct substrings. — kcsquared
– kcsquared, Commented Mar 21, 2022 at 6:25
@kcsquared Good point. For now I assume the matches would be at different indexes. The amount of substitution would be arbitrary, as much as the developer wants — Tramonta
– Tramonta, Commented Mar 21, 2022 at 17:23
Tramonta, if foo is replaced with baf, How should aaafoooozzz exist after replacement: as aaabafoozzz or aaababafzzz? — chux
– chux, Commented Mar 22, 2022 at 1:15

Abhinav Mathur · Accepted Answer · 2022-03-21 06:06:49Z

1

If you modify your string while searching for all occurences of substrings to be replaced, you'll end up modifying incorrect states of the string. An easy way out could be to get a list of all indexes to update first, then iterate over the indexes and make replacements. That way, indexes for "bar" would've been already computed, and won't be affected even if you replace any substring with "bar" later.

Adding a rough Python implementation to give you an idea:

import re
string = "foo bar biz"
replacements = [("foo", "bar"), ("bar", "biz")]
replacement_indexes = []
offset = 0
for item in replacements:
    replacement_indexes.append([m.start() for m in re.finditer(item[0], string)])

temp = list(string)
for i in range(len(replacement_indexes)):
    old, new, indexes = replacements[i][0], replacements[i][1], replacement_indexes[i]
    for index in indexes:
        temp[offset+index:offset+index+len(old)] = list(new)
        offset += len(new)-len(old)

print(''.join(temp)) # "bar biz biz"

edited Mar 21, 2022 at 6:06

answered Mar 21, 2022 at 5:17

Abhinav Mathur

8,2363 gold badges14 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Tramonta Over a year ago

Ooh, that seems cool! Thank you so much, I'll try to implement this. Great Idea! Quick question, did you came up with that on your on or is it an already established solution? I want to learn more

Abhinav Mathur Over a year ago

@Tramonta not sure if this an "established" solution

Enigmativity Over a year ago

@Tramonta - Until you implement and test that this works, I wouldn't have accepted the answer, if I were you.

Abhinav Mathur Over a year ago

@Enigmativity good point, I completed the implementation as a demonstration

Tramonta Over a year ago

@Enigmativity good point. Thanks, i'm still learning how to use Stack Overflow

|

Enigmativity · Accepted Answer · 2022-03-21 06:15:11Z

0

Here's the approach I would take.

I start with my text and the set of replacements:

string text = "alpha foo beta bar delta";

Dictionary<string, string> replacements = new()
{
    { "foo", "bar" },
    { "bar", "biz" },
};

Now I create an array of parts that are either "open" or not. Open parts can have their text replaced.

var parts = new List<(string text, bool open)>
{
    (text: text, open: true)
};

Now I run through each replacement and build a new parts list. If the part is open I can do the replacements, if it's closed just add it in untouched. It's this last bit that prevents double mapping of replacements.

Here's the main logic:

foreach (var replacement in replacements)
{
    var parts2 = new List<(string text, bool open)>();
    foreach (var part in parts)
    {
        if (part.open)
        {
            bool skip = true;
            foreach (var split in part.text.Split(new[] { replacement.Key }, StringSplitOptions.None))
            {
                if (skip)
                {
                    skip = false;
                }
                else
                {
                    parts2.Add((text: replacement.Value, open: false));
                }
                parts2.Add((text: split, open: true));
            }
        }
        else
        {
            parts2.Add(part);
        }
    }
    parts = parts2;
}

That produces the following:

Now it just needs to be joined back up again:

string result = String.Concat(parts.Select(p => p.text));

That gives:

alpha bar beta biz delta

As requested.

answered Mar 21, 2022 at 6:15

Enigmativity

117k12 gold badges101 silver badges184 bronze badges

2 Comments

Tramonta Over a year ago

It would be cool, but as I said, it needs to work even if the replacement is not a whole word but part of the word. I can't have "parts" because there is no set way of dividing the string in parts

Enigmativity Over a year ago

@Tramonta - How does the replacement of part of a word work?

Cary Swoveland · Accepted Answer · 2022-03-21 20:55:59Z

Let's suppose your given string were

str = "Mary had fourteen   little lambs"

and the desired replacements were given by the following hash (aka hashmap):

h = { "Mary"=>"Butch", "four"=>"three", "little"=>"wee", "lambs"=>"hippos" }

indicating that we want to replace "Mary" (wherever it appears in the string, if at all) with "Butch", and so on. We therefore want to return the following string:

"Butch had fourteen   wee hippos"

Notice that we do not want 'fourteen' to be replaced with 'threeteen' and we want the extra spaces between 'fourteen' and 'wee' to be preserved.

First collect the keys of the hash h into an array (or list):

keys = h.keys
  #=> ["Mary", "four", "little", "lambs"]

Most languages have a method or function sub or gsub that works something like the following:

str.gsub(/\w+/) do |word|
  if keys.include?(word)
    h[word]
  else
    word
  end
end
  #=> "Butch had fourteen   wee hippos"

The regular expression /\w+/ (r'\w+' in Python, for example) matches one or more word characters, as many as possible (i.e., a greedy match). Word characters are letters, digits and the underscore ('_'). It therefore will sequentially match 'Mary', 'had', 'fourteen', 'little' and 'lambs'.

Each matched word is passed to the "block" do |word| ...end and is held by the variable word. The block calculation then computes and returns the string that is to replace the value of word in a duplicate of the original string. Different languages uses different structures and formats to do this, of course.

The first word passed to the block by gsub is 'Mary'. The following calculation is then performed:

if keys.include?("Mary") # true
  # so replace "Mary" with:
  h[word] #=> "Butch
else # not executed
  # not executed
end

Next, gsub passes the word 'had' to the block and assigns that string to the variable word. The following calculation is then performed:

if keys.include?("had") # false
  # not executed
else
  # so replace "had" with:
  "had"
  # that is, leave "had" unchanged
end

Similar calculations are made for each word matched by the regular expression.

We see that punctuation and other non-word characters is not a problem:

str = "Mary, had fourteen   little lambs!"

str.gsub(/\w+/) do |word|
  if keys.include?(word)
    h[word]
  else
    word
  end
end
  #=> "Butch, had fourteen   wee hippos!"

We can see that gsub does not perform replacements sequentially:

h = { "foo"=>"bar", "bar"=>"baz" }
keys = h.keys
  #=> ["foo", "bar"]
"foo bar".gsub(/\w+/) do |word|
  if keys.include?(word)
    h[word]
  else
    word
  end
end
  #=> "bar baz"

Note that a linear search of keys is required to evaluate

keys.include?("Mary")

This could be relatively time-consuming if keys has many elements.

In most languages this can be sped up by making keys a set (an unordered collection of unique elements). Determining whether a set contains a given element is quite fast, comparable to determining if a hash has a given key.

An alternative formulation is to write

 str.gsub(/\b(?:Mary|four|little|lambs)\b/) { |word| h[word] }
   #=> "Butch had fourteen   wee hippos"

where the regular expression is constructed programmatically from h.keys. This regular expression reads, "match one of the four words indicated, preceded and followed by a word boundary (\b). The trailing word boundary prevents 'four' from matching 'fourteen'. Since gsub is now only considering the replacement of those four words the block can be simplified to { |word| h[word] }.

Again, this preserves punctuation and extra spaces.

If for some reason we wanted to be able to replace parts of words (e.g., to replace 'fourteen' with 'threeteen'), simply remove the word boundaries from the regular expression:

 str.gsub(/Mary|four|little|lambs/) { |word| h[word] }
   #=> "Butch had threeteen   wee hippos"

Naturally, different languages provide variations of this approach. In Ruby, for example, one could write:

g = Hash.new { |h,k| k }.merge(h)

The creates a hash g that has the same key-value pairs as h but has the additional property that if g does not have a key k, g[k] (the value of key k) returns k. That allows us to write simply:

str.gsub(/\w+/, g)
  #=> "Butch had fourteen   wee hippos"

See the second version of String#gsub.

A different approach (which I will show is problematic) is to construct an array (or list) of words from the string, replace those words as appropriate and then rejoin the resulting words to form a string. For example,

words = str.split
  #=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
  if keys.include?(word)
    h[word]
  else
    word
  end
end
  ["Butch", "had", "fourteen", "wee", "hippos"]
arr.join(' ')
  #=> "Butch had fourteen wee hippos"

This produces similar results except the extra spaces have been removed.

Now suppose the string contained punctuation:

str = "Mary, had fourteen   little lambs!"
words = str.split
  #=> ["Mary,", "had", "fourteen", "little", "lambs!"]
arr = words.map do |word|
  if keys.include?(word)
    h[word]
  else
    word
  end
end
  #=> ["Mary,", "had", "fourteen", "wee", "lambs!"]
arr.join(' ')
  #=> "Mary, had fourteen wee lambs!"

We could deal with punctuation by writing

words = str.scan(/\w+/)
  #=> ["Mary", "had", "fourteen", "little", "lambs"]
arr = words.map do |word|
  if keys.include?(word)
    h[word]
  else
    word
  end
end
  #=> ["Butch", "had", "fourteen", "wee", "hippos"]

Here str.scan returns an array of all matches of the regular expression /\w+/ (one or more word characters). The obvious problem is that all punctuation has been lost when arr.join(' ').

Ricardo Silva · Accepted Answer · 2022-03-22 00:37:28Z

0

You can achieve in a simple way, by using regular expressions:

import re

replaces = {'foo' : 'bar', 'alfa' : 'beta', 'bar': 'biz'}

original_string = 'foo bar, alfa foo. bar other.'
expected_string = 'bar biz, beta bar. biz other.'

replaced = re.compile(r'\w+').sub(lambda m: replaces[m.group()] if m.group() in replaces else m.group(), original_string)
assert replaced == expected_string

I haven't checked the performance, but I believe it is probably faster than using "nested for loops".

answered Mar 22, 2022 at 0:37

Ricardo Silva

83411 silver badges13 bronze badges

Collectives™ on Stack Overflow

Looking for efficient string replacement algorythm

4 Answers 4

7 Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

7 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related