5

I want to find out if a string that is comma separated contains only the same values:

test,asd,123,test
test,test,test

Here the 2nd string contains only the word "test". I'd like to identify these strings.

As I want to iterate over 100GB, performance matters a lot.

Which might be the fastest way of determining a boolean result if the string contains only one value repeatedly?

public static boolean stringHasOneValue(String string) {
   String value = null;
   for (split : string.split(",")) {
      if (value == null) {
         value = split;
      } else {
         if (!value.equals(split)) return false;
      }
   }
   return true;
}
2
  • 1
    The split will end up being a significant bottleneck due to memory allocations if your input is 100GB (especially in JRE7 onwards). Better stick with indexOf. You may not even want to using Strings but instead use input stream or mapped memory through NIO. Commented Nov 11, 2015 at 16:11
  • Is it possible that these entries don't fit into memory? For example, could there be two values, 50 gigs each? Commented Nov 11, 2015 at 16:33

3 Answers 3

12

No need to split the string at all, in fact no need for any string manipulation.

  • Find the first word (indexOf comma).
  • Check the remaining string length is an exact multiple of that word+the separating comma. (i.e. length-1 % (foundLength+1)==0)
  • Loop through the remainder of the string checking the found word against each portion of the string. Just keep two indexes into the same string and move them both through it. Make sure you check the commas too (i.e. bob,bob,bob matches bob,bobabob does not).
  • As assylias pointed out there is no need to reset the pointers, just let them run through the String and compare the 1st with 2nd, 2nd with 3rd, etc.

Example loop, you will need to tweak the exact position of startPos to point to the first character after the first comma:

for (int i=startPos;i<str.length();i++) {
   if (str.charAt(i) != str.charAt(i-startPos)) {
      return false;
   }
}
return true;

You won't be able to do it much faster than this given the format the incoming data is arriving in but you can do it with a single linear scan. The length check will eliminate a lot of mismatched cases immediately so is a simple optimization.

Sign up to request clarification or add additional context in comments.

5 Comments

In the third step, you mean to read using indexes right? Since you now know the size of the expected word. As @bill.cn said using the split method is overkill.
@RafaelSaraiva Yep, I've just finished editing my answer to clarify that :)
No need for resetting in step 3 - you can simply compare 2nd occurrence with 3rd occurrence etc.
Actually, if we are scanning a text file, to know a length we would have to read it whole. I am afraid that this optimisation for checking length, although clever, will not work in this case.
@JaroslawPawlak That depends how the incoming data is arriving. If it's a continuous stream then you could well be right. If it's already coming in as Strings in separate lines though (which the question implies it is) then we have the length available.
1

Calling split might be expensive - especially if it is 200 GB data.

Consider something like below (NOT tested and might require a bit of tweaking the index values, but I think you will get the idea) -

public static boolean stringHasOneValue(String string) {

        String seperator = ",";
        int firstSeparator = string.indexOf(seperator); //index of the first separator i.e. the comma
        String firstValue = string.substring(0, firstSeparator); // first value of the comma separated string
        int lengthOfIncrement = firstValue.length() + 1; // the string plus one to accommodate for the comma

        for (int i = 0 ; i < string.length(); i += lengthOfIncrement) {
            String currentValue = string.substring(i, firstValue.length());
            if (!firstValue.equals(currentValue)) {
                return false;
            }
        }

        return true;
    }

Complexity O(n) - assuming Java implementations of substring is efficient. If not - you can write your own substring method that takes the required no of characters from the String.

Comments

0

for a crack just a line code:

(@Tim answer is more efficient)

System.out.println((new HashSet<String>(Arrays.asList("test,test,test".split(","))).size()==1));

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.