Searching Binary Data in Ruby

Question

Using only pure ruby (or justifiably commonplace gems) is there an efficient way to search a large binary document for a specific string of bytes?

Deeper context: the mpeg4 container format is a 4-byte indexed serialised data structure, without having to parse the structure fully (I can assume it is valid) I want to pull out specific tags.

For those of you that haven't come across this 'dmap' serialization before it works something like this:

<4-byte length<4-byte tag><4-byte length><4-byte type definition><8 bytes of something I can't remember><data>

eg, this defines the 'tvsh' (or TV Show) tag as being 'Futurama'

00 00 00 20  ... 
74 76 73 68  tvsh
00 00 00 18  ....
64 61 74 61  data
00 00 00 01  ....
00 00 00 00  ....
46 75 74 75  Futu
72 61 6D 61  rama

The exact structure isn't really important, I'd like to write a method which can pull out the show name when I give it 'tvsh' or that it's season 2 if I give it 'tvsn'.

My first plan would be to use Regular Expressions, but I get the (unjustified) feeling that this would be slow.

Let me know your thoughts! Thanks in advance

When opening a file in Ruby you can give it the flag b for binary, such as File.open("test.mpg", "rb"). Would that be helpful? — Jesse Jashinsky
– Jesse Jashinsky, Commented Aug 6, 2010 at 17:05
It's an MPEG4 video, so potentially it could be 8GB, so far too large to fit into memory! — JP.
– JP., Commented Aug 7, 2010 at 19:35

Jan Goyvaerts · Accepted Answer · 2010-08-07 11:55:06Z

3

In Ruby you can use the /n flag when creating your regex to tell Ruby that your input is 8-bit data.

You could use /(.{4})tvsh(.{4})data(.{8})([\x20-\x7F]+)/n to match 4 bytes, tvsh, 4 bytes, data, 8 bytes, and any number of ASCII characters. I don't see any reason why this regex would be significantly slower to execute than hand-coding a similar search. If you don't care about the 4-byte and 8-byte blocks, /tvsh.{4}data.{8}([\x20-\x7F])/n should be nearly as fast as a literal text search for tvsh.

answered Aug 7, 2010 at 11:55

Jan Goyvaerts

22.1k7 gold badges63 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

davidgyoung Over a year ago

If I try this in my binary data I get "invalid byte sequence in UTF-8 (ArgumentError)"

Victor Over a year ago

Works if the string is ASCII-8BIT. The default is UTF-8. You can change it with String#force_encoding, for instance: bindata.force_encoding("ASCII-8BIT").

Mladen Jablanović · Accepted Answer · 2010-08-06 19:12:41Z

1

If I understand your description correctly, whole file consists of a number of such "blocks" of a fixed structure?

In that case, I suggest scanning one by one, and skipping ones not of interest to you. So, your each step should do the following:

Read 8 bytes (using IO#readbytes or a similar method)
From the read header, extract the size (first 4 bytes), and the tag (second 4)
1. If the tag is the one you need, skip following 16 bytes and read size-24 bytes.
2. If the tag is not of interest, skip following size-16 bytes.
Repeat.

For skipping bytes, you can use IO#seek.

edited Aug 6, 2010 at 19:12

answered Aug 6, 2010 at 19:06

Mladen Jablanović

44.2k13 gold badges92 silver badges113 bronze badges

3 Comments

BaroqueBobcat Over a year ago

One annoying aspect of the format is that atoms(blocks) can be nested.

BaroqueBobcat Over a year ago

There is this library, haven't used it though: github.com/arbarlow/ruby-mp4info

JP. Over a year ago

This has to be the right way forward, this way I don't have to scan through all the atoms I don't actually need (seems most mov files have the metadata at the end of the file). Just gotta figure out which atoms to push into! Oh, and figure out why some don't fit the pattern…

rogerdpack · Accepted Answer · 2010-08-06 23:37:45Z

0

Theoretically you can use regexes against any arbitrary data, including binary strings. HTH.

answered Aug 6, 2010 at 23:37

rogerdpack

67.7k40 gold badges290 silver badges409 bronze badges

1 Comment

Jan Goyvaerts Over a year ago

Nothing theoretical about it, as long as the regex engine you're using has an 8-bit mode where one byte equals one character.

Collectives™ on Stack Overflow

Searching Binary Data in Ruby

3 Answers 3

2 Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related