37

I have a sequence of bytes that I have to search for in a set of Binary files using Java.

Example: I'm searching for the byte sequence DEADBEEF (in hex) in a Binary file. How would I go about doing this in Java? Is there a built-in method, like String.contains() for Binary files?

5 Answers 5

56

No, there is no built-in method to do that. But, directly copied from HERE (with two fixes applied to the original code):

/**
 * Knuth-Morris-Pratt Algorithm for Pattern Matching
 */
class KMPMatch {
    /**
     * Finds the first occurrence of the pattern in the text.
     */
    public static int indexOf(byte[] data, byte[] pattern) {
        if (data.length == 0) return -1;

        int[] failure = computeFailure(pattern);    
        int j = 0;

        for (int i = 0; i < data.length; i++) {
            while (j > 0 && pattern[j] != data[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == data[i]) { j++; }
            if (j == pattern.length) {
                return i - pattern.length + 1;
            }
        }
        return -1;
    }

    /**
     * Computes the failure function using a boot-strapping process,
     * where the pattern is matched against itself.
     */
    private static int[] computeFailure(byte[] pattern) {
        int[] failure = new int[pattern.length];

        int j = 0;
        for (int i = 1; i < pattern.length; i++) {
            while (j > 0 && pattern[j] != pattern[i]) {
                j = failure[j - 1];
            }
            if (pattern[j] == pattern[i]) {
                j++;
            }
            failure[i] = j;
        }

        return failure;
    }
}
Sign up to request clarification or add additional context in comments.

1 Comment

Very little optimization: you don't need to compute the failure function of the pattern if the data.length is zero ==> you can move the data.length zero checking to the first line of the function.
4
private int bytesIndexOf(byte[] source, byte[] search, int fromIndex) {
    boolean find = false;
    int i;
    for (i = fromIndex; i < (source.length - search.length); i++) {
        if (source[i] == search[0]) {
            find = true;
            for (int j = 0; j < search.length; j++) {
                if (source[i + j] != search[j]) {
                    find = false;
                }
            }
        }
        if (find) {
            break;
        }
    }
    if (!find) {
        return -1;
    }
    return i;
}

1 Comment

'<Source.length-Search.length' should be '<=' instead
3

You can find sequence of bytes from giga-bytes order file using bigdoc.

Lib and Example here on Github at: https://github.com/riversun/bigdoc

package org.example;

import java.io.File;
import java.util.List;

import org.riversun.bigdoc.bin.BigFileSearcher;

public class Example {

    public static void main(String[] args) throws Exception {

        byte[] searchBytes = "hello world.".getBytes("UTF-8");

        File file = new File("/var/tmp/yourBigfile.bin");

        BigFileSearcher searcher = new BigFileSearcher();

        List<Long> findList = searcher.searchBigFile(file, searchBytes);

        System.out.println("positions = " + findList);
    }
}

If you want to search it on memory,check this. Examples here on Github at: https://github.com/riversun/finbin

 import java.util.List;

 import org.riversun.finbin.BigBinarySearcher;

 public class Example {

     public static void main(String[] args) throws Exception {

         BigBinarySearcher bbs = new BigBinarySearcher();

         byte[] iamBigSrcBytes = "Hello world.It's a small world.".getBytes("utf-8");

         byte[] searchBytes = "world".getBytes("utf-8");

         List<Integer> indexList = bbs.searchBytes(iamBigSrcBytes, searchBytes);

         System.out.println("indexList=" + indexList);
     }
 }

Returns all the matched positions in the array of bytes

It also can withstand a large array of bytes:)

Comments

1

For those who prefer libraries, there is an implementation (source below) of the Knuth-Morris-Pratt algorithm in Twitter's Elephant-Bird open source library (Apache license).

You can find the library on Github at: https://github.com/twitter/elephant-bird

package com.twitter.elephantbird.util;

import java.io.IOException;
import java.io.InputStream;
import java.util.Arrays;

/**
 * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm.
 * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
 */
public class StreamSearcher {

  protected byte[] pattern_;
  protected int[] borders_;

  // An upper bound on pattern length for searching. Throws exception on longer patterns
  public static final int MAX_PATTERN_LENGTH = 1024;

  public StreamSearcher(byte[] pattern) {
    setPattern(pattern);
  }

  /**
   * Sets a new pattern for this StreamSearcher to use.
   * @param pattern
   *          the pattern the StreamSearcher will look for in future calls to search(...)
   */
  public void setPattern(byte[] pattern) {
    if (pattern.length > MAX_PATTERN_LENGTH) {
      throw new IllegalArgumentException("The maximum pattern length is " + MAX_PATTERN_LENGTH);
    }

    pattern_ = Arrays.copyOf(pattern, pattern.length);
    borders_ = new int[pattern_.length + 1];
    preProcess();
  }

  /**
   * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note
   * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the
   * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have
   * another reasonable default, i.e. leave the stream unchanged.
   *
   * @return bytes consumed if found, -1 otherwise.
   * @throws IOException
   */
  public long search(InputStream stream) throws IOException {
    long bytesRead = 0;

    int b;
    int j = 0;

    while ((b = stream.read()) != -1) {
      bytesRead++;

      while (j >= 0 && (byte)b != pattern_[j]) {
        j = borders_[j];
      }
      // Move to the next character in the pattern.
      ++j;

      // If we've matched up to the full pattern length, we found it.  Return,
      // which will automatically save our position in the InputStream at the point immediately
      // following the pattern match.
      if (j == pattern_.length) {
        return bytesRead;
      }
    }

    // No dice, Note that the stream is now completely consumed.
    return -1;
  }

  /**
   * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally
   * and aids in implementation of the Knuth-Moore-Pratt string search.
   * <p>
   * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm.
   */
  protected void preProcess() {
    int i = 0;
    int j = -1;
    borders_[i] = j;
    while (i < pattern_.length) {
      while (j >= 0 && pattern_[i] != pattern_[j]) {
        j = borders_[j];
      }
      borders_[++i] = ++j;
    }
  }
}

1 Comment

Where to stand the 1024 byte limitation for the pattern as stated by the unused MAX_PATTERN_LENGTH member ?
0

If the approved answer is too complex for your task then try simple code below. It uses native Java libs only

/**
 * Search byte array for the byte pattern
 *
 * @param array to look into
 * @param pattern to search
 * @return index of first occurrence of the pattern, -1 otherwise
 */
public static int searchBytePattern(byte[] array, byte[] pattern) {
    if (pattern.length > array.length) {
        return -1;
    }
    for (int i = 0; i <= array.length - pattern.length; i++) {
        if (Arrays.compare(array, i, i + pattern.length, pattern, 0, pattern.length) == 0) {
            return i;
        }
    }
    return -1;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.