0

I am using string.Replace to replace substring A

func removeIP(text string) string {
    text = strings.Replace(text, "someWord", "**NewWord**", -1)
    return text
}

func removeIPUsingRegex(text string) string {
    var re = regexp.MustCompile(`\b` + "someWord" + `\b`) // I want to replace whole word only
    text = re.ReplaceAllString(text, "**NewWord**")
}

The Issue I am facing here is, I want to replace a whole word only if is not supported by string replace.
As I have to replace for very very huge strings may be in GBs. Regex is very slow compare to string replace.
eg: text: "abcdef defgh /def/ .def/ =def= def xxxy" -> Replace def with DEF
output: "abcdef defgh /DEF/ .DEF/ =DEF= DEF xxxy" //Notice only whole words have been replaced.

Regex bumps the time by almost 100 times (https://medium.com/codezillas/golang-replace-vs-regexp-de4e48482f53). Any Ideas will be much appreciated.

6
  • 2
    The reason regex takes longer is that it's more complex, which is also why it actually works. Simple string replace is too simplistic to handle what you're drying to do. You could write you own replace function, iterating the string character by character, which would likely have performance somewhere between strings.Replace and regexp.ReplaceAllString, but it would also require more effort in development and testing. Commented Feb 28, 2020 at 17:29
  • If your text is several GBs big, and you access it using an io.Reader, you could also look at this question about replacing characters while streaming, it however doesn't answer the "whole word" thing Commented Feb 28, 2020 at 19:01
  • 1
    You can use very fast string search with Index() to find the places where your substring matches. Then check the character before and after the match. If they are not letters then replace this occurence of a substring. This will be just slightly slower than Replace. Commented Feb 28, 2020 at 19:31
  • An addition to my previous comment: to find a position of a substring starting at certain index inside the string use Slice() and then Index(): stackoverflow.com/questions/54090467/… Commented Feb 28, 2020 at 19:44
  • Yea, stream reading is the most likely answer. Go's re2 is already one of the most efficient regex engines but still not good for (really) huge files. Commented Feb 29, 2020 at 0:00

1 Answer 1

1

KMP ALGorithm used
// ReplaceWholeWord ...

 func ReplaceWholeWord(text string, oldWord string, newWord string) string {
        var patternLength = len(oldWord)
        var textLength = len(text) 

        var copyIndex = 0
        var textIndex = 0
        var patternIndex = 0
        var newString strings.Builder
        var lps = computeLPSArray(oldWord)

        for textIndex < textLength {
            if oldWord[patternIndex] == text[textIndex] {
                patternIndex++
                textIndex++
            }
            if patternIndex == patternLength {
                startIndex := textIndex - patternIndex
                endIndex := textIndex - patternIndex + patternLength - 1

                if checkIfWholeWord(text, startIndex, endIndex) {
                    if copyIndex != startIndex {
                        newString.WriteString(text[copyIndex:startIndex])
                    }
                    newString.WriteString(newWord)
                    copyIndex = endIndex + 1
                }

                patternIndex = 0
                textIndex = endIndex + 1

            } else if textIndex < textLength && oldWord[patternIndex] != text[textIndex] {

                if patternIndex != 0 {
                    patternIndex = lps[patternIndex-1]

                } else {
                    textIndex = textIndex + 1
                }

            }
        }
        newString.WriteString(text[copyIndex:])

        return newString.String()
    }

    func computeLPSArray(pattern string) []int {
        var length = 0
        var i = 1
        var patternLength = len(pattern)

        var lps = make([]int, patternLength)

        lps[0] = 0

        for i = 1; i < patternLength; {
            if pattern[i] == pattern[length] {
                length++
                lps[i] = length
                i++

            } else {

                if length != 0 {
                    length = lps[length-1]

                } else {
                    lps[i] = length
                    i++
                }
            }
        }
        return lps
    }

    func checkIfWholeWord(text string, startIndex int, endIndex int) bool {
        startIndex = startIndex - 1
        endIndex = endIndex + 1

        if (startIndex < 0 && endIndex >= len(text)) ||
            (startIndex < 0 && endIndex < len(text) && isNonWord(text[endIndex])) ||
            (startIndex >= 0 && endIndex >= len(text) && isNonWord(text[startIndex])) ||
            (startIndex >= 0 && endIndex < len(text) && isNonWord(text[startIndex]) && isNonWord(text[endIndex])) {
            return true
        }

        return false
    }

    func isNonWord(c byte) bool {
        return !((c >= '0' && c <= '9') || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c == '_'))
    }
Sign up to request clarification or add additional context in comments.

1 Comment

This is same as general string replace algorithm. I have used extara spaces but space is not an issue for me

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.