I'm trying to find out whether a particular file is UTF-8 encoded readable text, by which I mean printable symbols, whitespaces, \n, \r\n and \t (think: source code). As speed is of importance, this has to be determined from just the first few dozen or so bytes of the file.
I've tried my luck using an io.RuneReader, implementing the core logic around it. This io.RuneReader can then be implemented by either a bufio.Reader for reading from a file or by a strings.Reader if I wanted to write some tests without creating dedicated files. The logic itself is simple enough, utf8.IsGraphic covers almost everything I need.
Here's the source code (or go playground), followed by my actual questions, which are also contained within the code sample.
package main
import (
"bufio"
"errors"
"fmt"
"io"
"os"
"strings"
"unicode"
"unicode/utf8"
)
func IsReadableTextRunes(r io.RuneReader, max int) (bool, error) {
for i := 0; i < max; i++ {
rn, sz, err := r.ReadRune()
switch {
// Can I rely on io.RuneReader to always behave like
// utf8.DecodeRune() with respect to erroneous encodings?
// I.e. will io.RuneReader::ReadRune() always return
// (utf8.RuneError, 1, nil) where an invalid encoding is
// encountered?
case rn == utf8.RuneError && sz == 1:
return false, nil
case unicode.IsGraphic(rn) || rn == '\n' || rn == '\t':
continue
case rn == '\r':
rn2, _, err2 := r.ReadRune()
if rn2 != '\n' || err2 != nil {
return false, nil
}
i++
// Do we have an prior information about which particular
// errors may be returned by io.RuneReader::ReadRune()?
case err != nil:
if errors.Is(err, io.EOF) {
return true, nil
}
return false, err
default:
return false, nil
}
}
return true, nil
}
func IsUtf8ReadableTextFile(filepath string) (bool, error) {
f, err := os.Open(filepath)
if err != nil {
return false, err
}
defer f.Close()
// Is this buffer size the smallest possible
// to guarantee that valid UTF-8 read into the
// the buffer is never truncated?
max := 32
bufSz := max * utf8.UTFMax
return IsReadableTextRunes(bufio.NewReaderSize(f, bufSz), max)
}
func main() {
data := []string{
"Is this\nthe\r\nreal\tlife?", // valid
"\ra", // invalid: \r outside of \r\n
"\xffa", // invalid: invalid UTF-8
"\x1b123", // invalid: contains non-graphical rune
}
for _, d := range data {
fmt.Printf("%q\n", d)
r := strings.NewReader(d)
valid, err := IsReadableTextRunes(r, 12)
switch {
case err != nil:
fmt.Println(" -> error")
case valid:
fmt.Println(" -> valid UTF-8 readable text")
default:
fmt.Println(" -> invalid UTF-8 readable text")
}
}
}
- What do you think about this approach in general? I'm a go novice and would appreciate any kind of feedback, meta or specific. Is this a viable method to quickly scan many, possibly thousands of files? Is it idiomatic go?
- io.RuneReader does not specify how
ReadRune()handles erroneous encodings. I found out thatstrings.Reader::ReadRune()as well asbufio.Reader::ReadRune()use utf.DecodeRune() under the hood, which, upon encountering invalid UTF-8, will return(RuneError, 1). Can I rely on allio.RuneReaders to behave this way? My implementation ofIsReadableTextRunesimplicitly depends on it. - io.RuneReader is also unclear about which errors I can expect under which conditions. Can I at least count upon an
io.EOFto signal the end of the data stream? Again, the above implementation assumes so. - Does
IsUtf8ReadableTextFilecorrectly wrapIsReadableTextRunes, choosing the smallest possible buffer size?
Thank you.