Determining if a file is UTF-8 text by looking at its first n bytes

Question

I'm trying to find out whether a particular file is UTF-8 encoded readable text, by which I mean printable symbols, whitespaces, \n, \r\n and \t (think: source code). As speed is of importance, this has to be determined from just the first few dozen or so bytes of the file.

I've tried my luck using an io.RuneReader, implementing the core logic around it. This io.RuneReader can then be implemented by either a bufio.Reader for reading from a file or by a strings.Reader if I wanted to write some tests without creating dedicated files. The logic itself is simple enough, utf8.IsGraphic covers almost everything I need.

Here's the source code (or go playground), followed by my actual questions, which are also contained within the code sample.

package main

import (
    "bufio"
    "errors"
    "fmt"
    "io"
    "os"
    "strings"
    "unicode"
    "unicode/utf8"
)

func IsReadableTextRunes(r io.RuneReader, max int) (bool, error) {
    for i := 0; i < max; i++ {
        rn, sz, err := r.ReadRune()
        switch {

        // Can I rely on io.RuneReader to always behave like
        // utf8.DecodeRune() with respect to erroneous encodings?
        // I.e. will io.RuneReader::ReadRune() always return
        // (utf8.RuneError, 1, nil) where an invalid encoding is
        // encountered?
        case rn == utf8.RuneError && sz == 1:
            return false, nil

        case unicode.IsGraphic(rn) || rn == '\n' || rn == '\t':
            continue
        case rn == '\r':
            rn2, _, err2 := r.ReadRune()
            if rn2 != '\n' || err2 != nil {
                return false, nil
            }
            i++

        // Do we have an prior information about which particular
        // errors may be returned by io.RuneReader::ReadRune()?
        case err != nil:
            if errors.Is(err, io.EOF) {
                return true, nil
            }
            return false, err

        default:
            return false, nil
        }
    }
    return true, nil
}

func IsUtf8ReadableTextFile(filepath string) (bool, error) {
    f, err := os.Open(filepath)
    if err != nil {
        return false, err
    }
    defer f.Close()

    // Is this buffer size the smallest possible
    // to guarantee that valid UTF-8 read into the
    // the buffer is never truncated?
    max := 32
    bufSz := max * utf8.UTFMax
    return IsReadableTextRunes(bufio.NewReaderSize(f, bufSz), max)
}

func main() {
    data := []string{
        "Is this\nthe\r\nreal\tlife?", // valid
        "\ra",                         // invalid: \r outside of \r\n
        "\xffa",                       // invalid: invalid UTF-8
        "\x1b123",                     // invalid: contains non-graphical rune
    }

    for _, d := range data {
        fmt.Printf("%q\n", d)
        r := strings.NewReader(d)
        valid, err := IsReadableTextRunes(r, 12)
        switch {
        case err != nil:
            fmt.Println("  -> error")
        case valid:
            fmt.Println("  -> valid UTF-8 readable text")
        default:
            fmt.Println("  -> invalid UTF-8 readable text")
        }
    }
}

What do you think about this approach in general? I'm a go novice and would appreciate any kind of feedback, meta or specific. Is this a viable method to quickly scan many, possibly thousands of files? Is it idiomatic go?
io.RuneReader does not specify how ReadRune() handles erroneous encodings. I found out that strings.Reader::ReadRune() as well as bufio.Reader::ReadRune() use utf.DecodeRune() under the hood, which, upon encountering invalid UTF-8, will return (RuneError, 1). Can I rely on all io.RuneReaders to behave this way? My implementation of IsReadableTextRunes implicitly depends on it.
io.RuneReader is also unclear about which errors I can expect under which conditions. Can I at least count upon an io.EOF to signal the end of the data stream? Again, the above implementation assumes so.
Does IsUtf8ReadableTextFile correctly wrap IsReadableTextRunes, choosing the smallest possible buffer size?

Thank you.

J_H · Accepted Answer · 2023-01-29 02:28:25Z

Thank you for offering this. It's some nice source code. Bonus points for the playground!

This is a good approach, a viable method to quickly scan many files. The style matches the idioms found in the underlying utf8 library.
The underlying error handling happens here: https://github.com/golang/go/blob/1e12c6/src/unicode/utf8/utf8.go#L151 So you're good as long as we stick with utf8 encoding. I have seen astonishing things checked into source control, such as a utf16-encoded Makefile. I believe you would view such an artifact as "binary" for the present use case. Add some unit tests to verify behavior. Be sure to include LE + BE encodings, with good + bad (mismatched) BOM.
Can you depend on EOF behavior? Yes, as long as we stick with utf8 encoding the above mentioned DecodeRune() will properly report EOF. Your unit test should include a 4-glyph input file, that has glyphs of lengths 1, 2, 3, 4. Truncate the file repeatedly, and verify correct behavior at each length.
Yes, the buffer is correctly allocated. However, iterating to max is on the conservative side. You might consider iterating until EOF, which on some input files would let you examine four times as much of the file, at the expense of needing to track the index or to disregard final error due to truncation. And no matter which way you go on that, I definitely recommend you do some single-core and multi-core timings. If you can scan a bigger file prefix with minimal cost, it's worth doing.

Do we have [any] prior information about which particular errors may be returned by io.RuneReader::ReadRune()?

It looks like general I/O errors can come back, perhaps including "NFS read timeout" errors. Write a unit test if you wish to demonstrate particular errors.

The usual idiom for reading in utf16 format is

bufio.NewScanner(transform.NewReader(file, unicode.UTF16(unicode.LittleEndian, unicode.UseBOM).NewDecoder()))

and there's no trace of such code here, so I suppose we're safe from accidentally using an inappropriate decoder.

I am reading IsReadableTextRunes().

The value of max, 32, seems like it's on the short side if you want to assess line terminators within the file prefix, since some input lines will be longer than that.

This function determines three different things:

is file well-formed utf8?
is its prefix printable characters + whitespace?
does it contain CR-delimited lines?

Those last two were kind of shoe-horned in. Consider breaking out helpers, then verify you still get good timings. Or maybe expand "Text" to something like IsReadablePlainTextRunes(), emphasizing that we enforce rules about what valid text looks like.

We do not attempt to verify canonical forms or combining mark craziness, and that is fine.

Let's take a step back. Here's a pair of alternate / complementary approaches.

Many of your input files will have just \t, SPACE, \n, and ASCII printable characters. Create a boolean vector and use it when scanning the prefix bytes. On some files this will let you declare victory early, so there's no need to fall back to the full-blown utf8 decoding.
Many of your input files will start with GIF89a, \x89PNG\r\n, or other well known file signatures that fit nicely in a hash table. Again, there is an opportunity to declare victory early upon seeing certain prefixes.

I will offer an even more radical suggestion.

Ask stat() for file length. Pick a random offset, not "too close" to the end.

At this point there's an excellent chance you're pointing at the middle of some multi-byte glyph. So read a few runes.

If you manage to at last read without error, you are now in sync, and can read max runes if it's well-formed utf8.

Alternatively, use a constant starting offset, to let you examine the final portion of a 4-KiB file buffer that the OS already had to read in anyway.

Overall, this code achieves its design goals.

It would benefit from timing figures and from more extensive unit tests.

Marvelous, thank you for this wonderful reply! I'll be back shortly with a few benchmarks to compare approaches. In the meanwhile, I do have three questions left: a) which 'canonical forms' are your referring to, b) what do you mean by 'combining mark craziness' and c) why would it be beneficial to sample bytes from the middle of the file instead of from the beginning? — korolev
– korolev, Commented Jan 29, 2023 at 7:52
(a) + (b) are kind of the same thing, and refer to this, e.g. c + ̧ versus ç. Given just 7-bit ASCII, some folks,such as yourself, frown on certain sequences as "bad", e.g. CR instead of CRLF terminators. Given Unicode, some folks frown on non-normalized glyphs as "bad", and might choose to reject such a file as "not plain text". I was just observing that the spec for this project doesn't go quite that far, and that's cool. On (c), the idea is to skip over common boilerplate and get to the meat, to what's unique — J_H
– J_H, Commented Jan 29, 2023 at 16:49
Example of an "annoying" or "hard to classify" file: repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh Question: is it text? is it binary? Yes, and yes, front and back. So the preamble of bash source is obviously text. But most of the file is clearly compressed binary. Some PS / PDF files have similar characteristics. Seeking within the file lets us rapidly skip to the meat of it, without being fooled by preamble. This project's emphasis on only scanning a small prefix made it clear that attempting to read most of a file's runes would be out-of-scope, too expensive. — J_H
– J_H, Commented Jan 29, 2023 at 16:57
Your suggestion of the ASCII fast path is golden, thank you! I've done benchmarks checking the full Linux source code for which it does wonders, being mostly ASCII as most source code is (I am actually looking to identify (human-written) source code, which I tried to qualify as readable UTF-8 text). My solution now incorporates your suggestions to (1) jump somewhere into the bulk of the file, (2) to scan for ASCII text and only then to (3) thoroughly parse UTF-8, ignoring possible parsing errors at the beginning of the byte sequence. Oh, and I also learned that BOM-ed UTF-8 exists. — korolev
– korolev, Commented Feb 11, 2023 at 12:36

Stack Exchange Network

Determining if a file is UTF-8 text by looking at its first n bytes

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Determining if a file is UTF-8 text by looking at its first n bytes

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions