How to remove Unicode characters from byte buffer in Go?

Question

I have a bytes.Buffer type variable which I filled with Unicode characters:

var mbuff bytes.Buffer
unicodeSource := 'کیا حال ھے؟'    
for i,r := range(unicodeSource) {
    mbuff.WriteRune(r)
}

Note: I iterated over a Unicode literals here, but really the source is an infinite loop of user input characters.

Now, I want to remove a Unicode character from any position in the buffer mbuff. The problem is that characters may be of variable byte sizes. So I cannot just pick out the ith byte from mbuff.String() as it might be the beginning, middle, or end of a character. This is my trivial (and horrendous) solution:

// removing Unicode character at position n
var tempString string
currChar := 0
for _, ch := range(mbuff.String()) { // iterate over Unicode chars
    if currChar != n {               // skip concatenating nth char
        tempString += ch
    }
    currChar++
}
mbuff.Reset()                        // empty buffer
mbuff.WriteString(tempString)        // write new string

This is bad in many ways. For one, I convert buffer to string, remove ith element, and write a new string back into the buffer. Too many operations. Second, I use the += operator in the loop to concatenate Unicode characters into a new string. I am using buffers in the first place exactly to avoid concatenation using += which is slow as this answer points out.

What is an efficient method to remove the ith Unicode character in a bytes.Buffer?
Also what is an efficient way to insert a Unicode character after i-1 Unicode characters (i.e. in the ith place)?

score 3 · Accepted Answer · 2016-10-10 02:35:35Z

3

To remove the ith rune from a slice of bytes, loop through the slice counting runes. When the ith rune is found, copy the bytes following the rune down to the position of the ith rune:

func removeAtBytes(p []byte, i int) []byte {
    j := 0
    k := 0
    for k < len(p) {
        _, n := utf8.DecodeRune(p[k:])
        if i == j {
            p = p[:k+copy(p[k:], p[k+n:])]
        }
        j++
        k += n
    }
    return p
}

This function modifies the backing array of the argument slice, but it does not allocate memory.

Use this function to remove a rune from a bytes.Buffer.

p := removeAtBytes(mbuf.Bytes(), i)
mbuf.Truncate(len(p)) // backing bytes were updated, adjust length

playground example

To remove the ith rune from a string, loop through the string counting runes. When the ith rune is found, create a string by concatenating the segment of the string before the rune with the segment of the string after the rune.

func removeAt(s string, i int) string {
    j := 0  // count of runes
    k := 0  // index in string of current rune
   for k < len(s) {
        _, n := utf8.DecodeRuneInString(s[k:])
        if i == j {
            return s[:k] + s[k+n:]
        }
        j++
        k += n
    }
    return s
}

This function allocates a single string, the result. DecodeRuneInString is a function in the standard library unicode/utf8 package.

edited Oct 10, 2016 at 2:35

answered Oct 7, 2016 at 0:55

user5728991

Sign up to request clarification or add additional context in comments.

1 Comment

hazrmard Over a year ago

Accepted because it explicitly shows how to work with buffers. Amd's answer is also worth looking at as it explores variations of my problem.

Caleb · Accepted Answer · 2016-10-07 02:03:31Z

Taking a step back, go often works on Readers and Writers, so an alternative solution would be to use the text/transform package. You create a Transformer, attach it to a Reader and use the new Reader to produce a transformed string. For example here's a skipper:

func main() {
    src := strings.NewReader("کیا حال ھے؟")
    skipped := transform.NewReader(src, NewSkipper(5))
    var buf bytes.Buffer
    io.Copy(&buf, skipped)
    fmt.Println("RESULT:", buf.String())
}

And here's the implementation:

package main

import (
    "bytes"
    "fmt"
    "io"
    "strings"
    "unicode/utf8"

    "golang.org/x/text/transform"
)

type skipper struct {
    pos int
    cnt int
}

// NewSkipper creates a text transformer which will remove the rune at pos
func NewSkipper(pos int) transform.Transformer {
    return &skipper{pos: pos}
}

func (s *skipper) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
    for utf8.FullRune(src) {
        _, sz := utf8.DecodeRune(src)
        // not enough space in the dst
        if len(dst) < sz {
            return nDst, nSrc, transform.ErrShortDst
        }
        if s.pos != s.cnt {
            copy(dst[:sz], src[:sz])
            // track that we stored in dst
            dst = dst[sz:]
            nDst += sz
        }
        // track that we read from src
        src = src[sz:]
        nSrc += sz
        // on to the next rune
        s.cnt++
    }
    if len(src) > 0 && !atEOF {
        return nDst, nSrc, transform.ErrShortSrc
    }
    return nDst, nSrc, nil
}

func (s *skipper) Reset() {
    s.cnt = 0
}

There may be bugs with this code, but hopefully you can see the idea.

The benefit of this approach is it could work on a potentially infinite amount of data without having to store all of it in memory. For example you could transform a file this way.

score 0 · Accepted Answer · 2016-10-07 21:43:34Z

Edit:

Remove the ith rune in the buffer:
A: Shift all runes one location to the left (Here A is faster than B), try it on The Go Playground:

func removeRuneAt(s string, runePosition int) string {
    if runePosition < 0 {
        return s
    }
    r := []rune(s)
    if runePosition >= len(r) {
        return s
    }
    copy(r[runePosition:], r[runePosition+1:])
    return string(r[:len(r)-1])
}

B: Copy to new buffer, try it on The Go Playground

func removeRuneAt(s string, runePosition int) string {
    if runePosition < 0 {
        return s // avoid allocation
    }
    r := []rune(s)
    if runePosition >= len(r) {
        return s // avoid allocation
    }
    t := make([]rune, len(r)-1) // Apply replacements to buffer.
    w := copy(t, r[:runePosition])
    w += copy(t[w:], r[runePosition+1:])
    return string(t[:w])
}

C: Try it on The Go Playground:

package main

import (
    "bytes"
    "fmt"
)

func main() {
    str := "hello"
    fmt.Println(str)
    fmt.Println(removeRuneAt(str, 1))

    buf := bytes.NewBuffer([]byte(str))
    fmt.Println(buf.Bytes())

    buf = bytes.NewBuffer([]byte(removeRuneAt(buf.String(), 1)))
    fmt.Println(buf.Bytes())
}
func removeRuneAt(s string, runePosition int) string {
    if runePosition < 0 {
        return s // avoid allocation
    }
    r := []rune(s)
    if runePosition >= len(r) {
        return s // avoid allocation
    }

    t := make([]rune, len(r)-1) // Apply replacements to buffer.
    w := copy(t, r[0:runePosition])
    w += copy(t[w:], r[runePosition+1:])
    return string(t[0:w])
}

D: Benchmark:
A: 745.0426ms
B: 1.0160581s
for 2000000 iterations

1- Short Answer: to replace all (n) instances of a character (or even a string):

n := -1
newR := ""
old := "µ"
buf = bytes.NewBuffer([]byte(strings.Replace(buf.String(), old, newR, n)))

2- For replacing the character(string) in the ith instance in the buffer, you may use:

buf = bytes.NewBuffer([]byte(Replace(buf.String(), oldString, newOrEmptyString, ith)))

See:

// Replace returns a copy of the string s with the ith
// non-overlapping instance of old replaced by new.
func Replace(s, old, new string, ith int) string {
    if len(old) == 0 || old == new || ith < 0 {
        return s // avoid allocation
    }
    i, j := 0, 0
    for ; ith >= 0; ith-- {
        j = strings.Index(s[i:], old)
        if j < 0 {
            return s // avoid allocation
        }
        j += i
        i = j + len(old)
    }
    t := make([]byte, len(s)+(len(new)-len(old))) // Apply replacements to buffer.
    w := copy(t, s[0:j])
    w += copy(t[w:], new)
    w += copy(t[w:], s[j+len(old):])
    return string(t[0:w])
}

Try it on The Go Playground:

package main

import (
    "bytes"
    "fmt"
    "strings"
)

func main() {
    str := `How are you?µ`
    fmt.Println(str)
    fmt.Println(Replace(str, "µ", "", 0))

    buf := bytes.NewBuffer([]byte(str))
    fmt.Println(buf.Bytes())

    buf = bytes.NewBuffer([]byte(Replace(buf.String(), "µ", "", 0)))

    fmt.Println(buf.Bytes())
}
func Replace(s, old, new string, ith int) string {
    if len(old) == 0 || old == new || ith < 0 {
        return s // avoid allocation
    }
    i, j := 0, 0
    for ; ith >= 0; ith-- {
        j = strings.Index(s[i:], old)
        if j < 0 {
            return s // avoid allocation
        }
        j += i
        i = j + len(old)
    }
    t := make([]byte, len(s)+(len(new)-len(old))) // Apply replacements to buffer.
    w := copy(t, s[0:j])
    w += copy(t[w:], new)
    w += copy(t[w:], s[j+len(old):])
    return string(t[0:w])
}

3- If you want to remove all instances of Unicode character (old string) from any position in the string, you may use:

strings.Replace(str, old, "", -1)

4- Also this works fine for removing from bytes.buffer:

strings.Replace(buf.String(), old, newR, -1)

Like so:

buf = bytes.NewBuffer([]byte(strings.Replace(buf.String(), old, newR, -1)))

Here is the complete working code (try it on The Go Playground):

package main

import (
    "bytes"
    "fmt"
    "strings"
)

func main() {
    str := `کیا حال ھے؟` //How are you?
    old := `ک`
    newR := ""
    fmt.Println(strings.Replace(str, old, newR, -1))

    buf := bytes.NewBuffer([]byte(str))
    //  for _, r := range str {
    //      buf.WriteRune(r)
    //  }
    fmt.Println(buf.Bytes())

    bs := []byte(strings.Replace(buf.String(), old, newR, -1))
    buf = bytes.NewBuffer(bs)

    fmt.Println("       ", buf.Bytes())
}

output:

یا حال ھے؟
[218 169 219 140 216 167 32 216 173 216 167 217 132 32 218 190 219 146 216 159]
        [219 140 216 167 32 216 173 216 167 217 132 32 218 190 219 146 216 159]

5- strings.Replace is very efficient, see inside:

// Replace returns a copy of the string s with the first n
// non-overlapping instances of old replaced by new.
// If old is empty, it matches at the beginning of the string
// and after each UTF-8 sequence, yielding up to k+1 replacements
// for a k-rune string.
// If n < 0, there is no limit on the number of replacements.
func Replace(s, old, new string, n int) string {
  if old == new || n == 0 {
      return s // avoid allocation
  }

  // Compute number of replacements.
  if m := Count(s, old); m == 0 {
      return s // avoid allocation
  } else if n < 0 || m < n {
      n = m
  }

  // Apply replacements to buffer.
  t := make([]byte, len(s)+n*(len(new)-len(old)))
  w := 0
  start := 0
  for i := 0; i < n; i++ {
      j := start
      if len(old) == 0 {
          if i > 0 {
              _, wid := utf8.DecodeRuneInString(s[start:])
              j += wid
          }
      } else {
          j += Index(s[start:], old)
      }
      w += copy(t[w:], s[start:j])
      w += copy(t[w:], new)
      start = j + len(old)
  }
  w += copy(t[w:], s[start:])
  return string(t[0:w])
}

Thanks. But I think this will replace all instances of a character and not the character in the ith position in the buffer.
Thanks again, but I think you misunderstood: I want to remove whatever character that is at the ith position in the buffer. Not the ith instance of some character. Essentially, a solution should only need the index in the buffer to remove, nothing else. So I think using replace is inherently unsuitable here because replace asks for what character to substitute, not what position to remove. So if I have a buffer b containing hello, the function removeAt(b,1) should modify b so it contains hllo.

Collectives™ on Stack Overflow

How to remove Unicode characters from byte buffer in Go?

3 Answers 3

1 Comment

Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related