Is there a way to check if a Ruby variable contains binary data?

Question

I'm using Ruby 2.4 and Rails 5. I have file content in a variabe named "content". The content could contain data from things like a PDF file, a Word file, or an HTML file. Is there any way to tell if the variable contains binary data? Ultimately, I would like to know if this is a PDf, Microsoft Office, or some other type of OpenOffice file. This answer -- Rails: possible to check if a string is binary? -- suggests that I can check the encoding of the variable

content.encoding

and it would produce

ASCII-8BIT

in the case of binary data, however, I've noticed there are cases where HTML content stored in the variable could also return "ASCII-8BIT" as the content.encoding, so using "content.encoding" is not a foolproof way to tell me if I have binary data. Does such a way exist and if so, what is it?

Given your requirements, It seems like you're gonna have to do some analysis of the content. I'd pull the top n bytes and check them against your standard ASCII codes. If many of the characters you encounter aren't ASCII, it's likely that your content is binary. Seems like a chi-squared test may be a good fit. Why can't you get access to the actual file object? — Brennan
– Brennan, Commented May 3, 2017 at 19:09
I'm accessing the content from a database in which there is no additional information about the file. Sometimes there is a file name, but extensions are unreliable for determining file/content type. — Dave
– Dave, Commented May 3, 2017 at 20:24
I your real question is not about binary data per se but about determining the file type of the data, I'd recommend to have a look at the ruby-filemagic gem which will give you this information much more reliably. — Matouš Borák
– Matouš Borák, Commented May 3, 2017 at 20:40
@Dave According to the gem's documentation at github.com/blackwinter/ruby-filemagic it can work with a buffer, so you wouldn't need to write anything to a file. Just read the first N bytes into memory and pass it to the gem. — Brian
– Brian, Commented May 3, 2017 at 21:34

Matouš Borák · Accepted Answer · 2017-05-04 05:03:52Z

3

If your real question is not about binary data per se but about determining the file type of the data, I'd recommend to have a look at the ruby-filemagic gem which will give you this information much more reliably. The gem is a simple wrapper around the libmagic library which is standard on unix-like systems. The library works by scanning the content of a file and matching it against a set of known "magic" patterns in various file types.

Sample usage for a string buffer (e.g. data read form the database):

require "ruby-filemagic"

content = File.read("/.../sample.pdf") # just an example to get some data

fm = FileMagic.new
fm.buffer(content)    
#=> "PDF document, version 1.4"

For the gem to work (and compile) you need the file utility as well as the magic library with headers installed on your system. Quoting from the readme:

The file(1) library and headers are required:

Debian/Ubuntu:: +libmagic-dev+
Fedora/SuSE:: +file-devel+
Gentoo:: +sys-libs/libmagic+
OS X:: brew install libmagic

Tested to work well under Rails 5.

edited May 4, 2017 at 5:03

answered May 4, 2017 at 4:27

Matouš Borák

16k2 gold badges46 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Dave Over a year ago

Hmmm, I'm still getting a build error when I try and install this gem -- "checking for -lgnurx... no, *** ERROR: missing required library to compile this module". I will have to research that and then come back and try your suggestion.

Matouš Borák Over a year ago

What system are you trying this on? If you get stuck, can you post the full log with the error messages?

Dave Over a year ago

I hadn't run "brew install libmagic" per your suggestion. Running that does allow everything to install. One question I coudln't figure out from teh docs -- does "buffer" always print out file types in a consistent way? That is, do Excel docs always output "Microsoft Excel" and PDF docs always print out the word "PDF" ?

Matouš Borák Over a year ago

Good! Regarding your question, there isof course no absolute certainity, but I'd expect the output to be very consistent. The file utility with the associated magic library have been around for many many years and there is no reason that the authors would change its behavior. Take a look at the sources for all the format variants that the library currently recognizes.

Dave Over a year ago

Heya, I started a bounty on this one only because I can see no consistency in the way file types are printed out by this gem. I'm getting too much variation to feel comfortable using the solution.

|

Xavier Nicollet · Accepted Answer · 2017-05-23 07:41:20Z

0

If you're on an unix machine, you can use the file command:

file titi.pdf

You could then do something like:

require 'open2'

cmd = 'file -'
Open3.popen3(cmd) do |stdin, stdout, wait_thr|
  stdin.write(content)
  stdin.close
  puts "file type is:" + stoud.read
end

answered May 23, 2017 at 7:41

Xavier Nicollet

3536 silver badges12 bronze badges

1 Comment

Dave Over a year ago

My production environment is Ubuntu Linux, but my local environment is Mac OS X.

Collectives™ on Stack Overflow

Is there a way to check if a Ruby variable contains binary data?

2 Answers 2

8 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related