0

Given 2 strings, I want to find the first match of at least four characters.

This is the code I currently have to do it. It works correctly, but I'm thinking there may be a better way to do this. Are there any screaming inefficiencies or bad practices in what I'm doing? Are there common libraries, like Apache Commons, that I should be taking advantage of but I'm not?

Don't worry about the Gene class - it just contains the String in question. Also - GeneMatch() signifies no match exists, whereas the GeneMatch constructor with arguments signifies a match has been found.

Constants.MIN_MATCH == 4, in this case.

public static GeneMatch findMatch(Gene g0, Gene g1) {

    String g0DNA = g0.getDNA();
    String g1DNA = g1.getDNA();

    if (g0DNA.equals("") || g1DNA.equals("")) { //there won't be a match if one is empty
        return new GeneMatch();
    }

    int g0Left = -1;
    int g0Right = -1;
    int g1Left = -1;
    int g1Right = -1;

    String window;

    for (int inx = 0; inx <= g0DNA.length() - Constants.MIN_MATCH; inx++) {
        window = g0DNA.substring(inx, inx + Constants.MIN_MATCH);

        if (g1DNA.indexOf(window) != -1) {

            g0Left = inx;
            g0Right = inx + Constants.MIN_MATCH;

            g1Left = g1DNA.indexOf(window);
            g1Right = g1Left + Constants.MIN_MATCH;

            /* grow the match to the right
             * while the two right indices are less than the lengths of their respective strings, and the 
             * characters at the indices match, increment each index
             */
            while (g0Right < g0DNA.length() && g1Right < g1DNA.length() && g0DNA.charAt(g0Right) == g1DNA.charAt(g1Right)) {
                g0Right++;
                g1Right++;
            }
            break; //we've already found a match, no need to continue sliding the window
        }
    }

    //now that the indices are found, convert to Genes
    if (g0Left == -1 || g0Right == -1 || g1Left == -1 || g1Right == -1) { //no match found
        return new GeneMatch();
    }

    Gene gL0 = new Gene(g0DNA.substring(0, g0Left));
    Gene gL1 = new Gene(g1DNA.substring(0, g1Left));

    Gene g0match = new Gene(g0DNA.substring(g0Left, g0Right));
    Gene g1match = new Gene(g1DNA.substring(g1Left, g1Right));

    Gene gR0 = new Gene(g0DNA.substring(g0Right));
    Gene gR1 = new Gene(g1DNA.substring(g1Right));

    //sanity check
    assert g0DNA.equals(gL0.getDNA() + g0match.getDNA() + gR0.getDNA()) : "g0 didn't add up";
    assert g1DNA.equals(gL1.getDNA() + g1match.getDNA() + gR1.getDNA()) : "g1 didn't add up";

    return new GeneMatch(gL0, gR0, g0match, g1match, gL1, gR1);

}
1
  • I'm scared as to how you're going to (re)use this.. Are you sure you don't want to use any of the readily available software designed to align two sequences? en.wikipedia.org/wiki/Sequence_alignment_software Commented Oct 16, 2009 at 7:23

3 Answers 3

2

Current approach

  1. Double g1DNA.indexOf(window) call - first call result can be stored and reused later;
  2. Unnecessary string objects construction during window = g0DNA.substring(inx, inx + Constants.MIN_MATCH);
  3. Unnecessary gL0, gL1, gR0, gR1 construction in case assertion is off;
  4. if (g0DNA.equals("") || g1DNA.equals("")) check can be improved in order to check that the strings has at least four symbols each;
  5. It always better to call equals() on constant, i.e. use "".equals(arg). It allows to avoid possible NPE if arg is null. It doesn't have much impact here, just a good coding policy to apply;
  6. There is String.isEmpty() method that can be used to replace "".equals(arg);
  7. No null check is performed for the DNA strings;

Improvements

  1. It's better to loop the shortest string, i.e. you should check dna1 and dna2 length and perform outer loop against the one with shorter length. That allows to minimize iterations number;
  2. You can avoid creating new string objects and operate in terms of characters. Moreover, you can modify the algorithm in order to work with any java.lang.CharSequence implementation;
  3. You can remember unmatched sequences, i.e. keep set of char sequences that were checked and proved to be unmatched in order to minimize the time of outer loop iteration. For example you iterate over the string that contains many 'b' chars. You check that the second string doesn't contain that char during first 'b' processing. You can remember that and stop subsequent 'b' processings eagerly;
  4. When you use String.indexOf() the search is performed from start of the string. That may be problem if the string to be search on is rather long. It may be worth to create a characters index for it. I.e. before finding the match you can iterate all target string characters and build mappings like 'character' -> 'set of indexes of their occurrence within the string'. That allows to perform the loop body check much faster in case of long strings;

General consideration There is no 'the one best algorithm' because 'the best' selection depends on input data profile and algorithm usage policy. I.e. if the algorithm is executed rarely and its performance impact is insignificant there is no point in spending a lot of time to its optimization and much better to write a simple code that is easy to maintain. If input strings are rather short there is no point in building characters index etc. In general just try to avoid preliminary optimization whenever possible and carefully consider all input data during choosing resulting algorithm if you really have a bottleneck there.

Sign up to request clarification or add additional context in comments.

7 Comments

thanks for the detailed response. Can you specify what you mean by point (4.)? Some of the strings I'm dealing with do get rather long, and this algorithm does get called many times.
and for (3.), what do you think would be the fastest algorithm to store unmatched sequences in? HashSet? LinkedList? ArrayList?
and by "algorithm", I mean "data structure". oops.
Aboit the forth point - when you call String.indexOf(), the string characters are consequently iterated from the zero index in order to find the match. Consider the example when dna2 string starts with 'b' char and dna1 string is a very long string where first 'b' char occurence is located rather far from the start. If you just use String.indexOf() you perform lot of unnecessary comparisons then.
About (3) - let's consider what do we want to achieve. I see the following targets - the data structure should be able to answer if particular char sequence is contained in it and response time should be as less as possible; stored data should be as lightweight as possible; we want to avoid creating new objects whenever possible. So, I see that we want to have a HashSet with objects of our custom class. That class should serve as a wrapper on a char sequence, i.e. we want to store target char sequence and offset and length to use with it.
|
1

Looks quite okay to me. Just two minor things:

  1. reuse the result of g1DNA.indexOf(window) instead of calling it twice (g1Left = g1DNA.indexOf(window);)

  2. you don't have to check all 4 vars for being == -1 as you all set them at once anyway.

Comments

0

Looks good to me. One might go ahead and micro optimize in terms of assignments, but this is the job of the JIT compiler. If you feel the algorithm is too slow, try to profile it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.