11

I'm using Ruby 2.4 and Rails 5. I have file content in a variabe named "content". The content could contain data from things like a PDF file, a Word file, or an HTML file. Is there any way to tell if the variable contains binary data? Ultimately, I would like to know if this is a PDf, Microsoft Office, or some other type of OpenOffice file. This answer -- Rails: possible to check if a string is binary? -- suggests that I can check the encoding of the variable

content.encoding

and it would produce

ASCII-8BIT

in the case of binary data, however, I've noticed there are cases where HTML content stored in the variable could also return "ASCII-8BIT" as the content.encoding, so using "content.encoding" is not a foolproof way to tell me if I have binary data. Does such a way exist and if so, what is it?

8
  • Given your requirements, It seems like you're gonna have to do some analysis of the content. I'd pull the top n bytes and check them against your standard ASCII codes. If many of the characters you encounter aren't ASCII, it's likely that your content is binary. Seems like a chi-squared test may be a good fit. Why can't you get access to the actual file object? Commented May 3, 2017 at 19:09
  • I'm accessing the content from a database in which there is no additional information about the file. Sometimes there is a file name, but extensions are unreliable for determining file/content type. Commented May 3, 2017 at 20:24
  • Wait, the content of the file is in the DB? Commented May 3, 2017 at 20:30
  • I your real question is not about binary data per se but about determining the file type of the data, I'd recommend to have a look at the ruby-filemagic gem which will give you this information much more reliably. Commented May 3, 2017 at 20:40
  • 1
    @Dave According to the gem's documentation at github.com/blackwinter/ruby-filemagic it can work with a buffer, so you wouldn't need to write anything to a file. Just read the first N bytes into memory and pass it to the gem. Commented May 3, 2017 at 21:34

2 Answers 2

3

If your real question is not about binary data per se but about determining the file type of the data, I'd recommend to have a look at the ruby-filemagic gem which will give you this information much more reliably. The gem is a simple wrapper around the libmagic library which is standard on unix-like systems. The library works by scanning the content of a file and matching it against a set of known "magic" patterns in various file types.

Sample usage for a string buffer (e.g. data read form the database):

require "ruby-filemagic"

content = File.read("/.../sample.pdf") # just an example to get some data

fm = FileMagic.new
fm.buffer(content)    
#=> "PDF document, version 1.4"

For the gem to work (and compile) you need the file utility as well as the magic library with headers installed on your system. Quoting from the readme:

The file(1) library and headers are required:

Debian/Ubuntu:: +libmagic-dev+
Fedora/SuSE:: +file-devel+
Gentoo:: +sys-libs/libmagic+
OS X:: brew install libmagic

Tested to work well under Rails 5.

Sign up to request clarification or add additional context in comments.

8 Comments

Hmmm, I'm still getting a build error when I try and install this gem -- "checking for -lgnurx... no, *** ERROR: missing required library to compile this module". I will have to research that and then come back and try your suggestion.
What system are you trying this on? If you get stuck, can you post the full log with the error messages?
I hadn't run "brew install libmagic" per your suggestion. Running that does allow everything to install. One question I coudln't figure out from teh docs -- does "buffer" always print out file types in a consistent way? That is, do Excel docs always output "Microsoft Excel" and PDF docs always print out the word "PDF" ?
Good! Regarding your question, there isof course no absolute certainity, but I'd expect the output to be very consistent. The file utility with the associated magic library have been around for many many years and there is no reason that the authors would change its behavior. Take a look at the sources for all the format variants that the library currently recognizes.
Heya, I started a bounty on this one only because I can see no consistency in the way file types are printed out by this gem. I'm getting too much variation to feel comfortable using the solution.
|
0

If you're on an unix machine, you can use the file command:

file titi.pdf

You could then do something like:

require 'open2'

cmd = 'file -'
Open3.popen3(cmd) do |stdin, stdout, wait_thr|
  stdin.write(content)
  stdin.close
  puts "file type is:" + stoud.read
end

1 Comment

My production environment is Ubuntu Linux, but my local environment is Mac OS X.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.