1

Using only pure ruby (or justifiably commonplace gems) is there an efficient way to search a large binary document for a specific string of bytes?


Deeper context: the mpeg4 container format is a 4-byte indexed serialised data structure, without having to parse the structure fully (I can assume it is valid) I want to pull out specific tags.

For those of you that haven't come across this 'dmap' serialization before it works something like this:

<4-byte length<4-byte tag><4-byte length><4-byte type definition><8 bytes of something I can't remember><data>

eg, this defines the 'tvsh' (or TV Show) tag as being 'Futurama'

00 00 00 20  ... 
74 76 73 68  tvsh
00 00 00 18  ....
64 61 74 61  data
00 00 00 01  ....
00 00 00 00  ....
46 75 74 75  Futu
72 61 6D 61  rama

The exact structure isn't really important, I'd like to write a method which can pull out the show name when I give it 'tvsh' or that it's season 2 if I give it 'tvsn'.

My first plan would be to use Regular Expressions, but I get the (unjustified) feeling that this would be slow.

Let me know your thoughts! Thanks in advance

4
  • When opening a file in Ruby you can give it the flag b for binary, such as File.open("test.mpg", "rb"). Would that be helpful? Commented Aug 6, 2010 at 17:05
  • is your "large binary document" too big to fit in RAM? Commented Aug 6, 2010 at 18:13
  • binary is only needed on windows. FYI Commented Aug 6, 2010 at 18:31
  • It's an MPEG4 video, so potentially it could be 8GB, so far too large to fit into memory! Commented Aug 7, 2010 at 19:35

3 Answers 3

3

In Ruby you can use the /n flag when creating your regex to tell Ruby that your input is 8-bit data.

You could use /(.{4})tvsh(.{4})data(.{8})([\x20-\x7F]+)/n to match 4 bytes, tvsh, 4 bytes, data, 8 bytes, and any number of ASCII characters. I don't see any reason why this regex would be significantly slower to execute than hand-coding a similar search. If you don't care about the 4-byte and 8-byte blocks, /tvsh.{4}data.{8}([\x20-\x7F])/n should be nearly as fast as a literal text search for tvsh.

Sign up to request clarification or add additional context in comments.

2 Comments

If I try this in my binary data I get "invalid byte sequence in UTF-8 (ArgumentError)"
Works if the string is ASCII-8BIT. The default is UTF-8. You can change it with String#force_encoding, for instance: bindata.force_encoding("ASCII-8BIT").
1

If I understand your description correctly, whole file consists of a number of such "blocks" of a fixed structure?

In that case, I suggest scanning one by one, and skipping ones not of interest to you. So, your each step should do the following:

  1. Read 8 bytes (using IO#readbytes or a similar method)
  2. From the read header, extract the size (first 4 bytes), and the tag (second 4)
    1. If the tag is the one you need, skip following 16 bytes and read size-24 bytes.
    2. If the tag is not of interest, skip following size-16 bytes.
  3. Repeat.

For skipping bytes, you can use IO#seek.

3 Comments

One annoying aspect of the format is that atoms(blocks) can be nested.
There is this library, haven't used it though: github.com/arbarlow/ruby-mp4info
This has to be the right way forward, this way I don't have to scan through all the atoms I don't actually need (seems most mov files have the metadata at the end of the file). Just gotta figure out which atoms to push into! Oh, and figure out why some don't fit the pattern…
0

Theoretically you can use regexes against any arbitrary data, including binary strings. HTH.

1 Comment

Nothing theoretical about it, as long as the regex engine you're using has an 8-bit mode where one byte equals one character.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.