Comparing strings with different newlines

Question

Recently, I was burdened with the task of finding a bug. It turns out the problem was strings from different systems containing different newlines. Comparing two strings with different newlines (but same "text") still are not equal. E.g. "new\nline" (Unix flavor) and "new\r\nline" (Windows flavor) are not equal.

Since the code will be dealing with both types of newlines, I wrote a method to test for equality independent from newline type (the code treats "\n", "\r", "\r\n" and "\n\r" the same (even though "\n\r" isn't really used as newline)).

After I got the code done I would like your opinion on it. What do you think of variable names or method names (I know I could have chosen better names)? Is there a way to optimize the code or make it more readable?

public class StringUtils {
    public static final char LF = '\n';
    public static final char CR = '\r';

    public static boolean equalsIgnoreNewlineTwirks(String str, String other){
        if (str == null || other == null){
            return false;
        }
        if (str == other){
            return true;
        }

        char[] s1 = str.toCharArray();
        char[] s2 = other.toCharArray();
        int index1 = 0, index2 = 0;
        while (true){
            boolean oob1 = index1 >= s1.length, oob2 = index2 >= s2.length;
            if (oob1 | oob2){
                return oob1 & oob2;
            }

            char ch1 = s1[index1], ch2 = s2[index2];
            if (ch1 != ch2){
                if (ch1 != LF && ch1 != CR) return false;
                if (ch2 != LF && ch2 != CR) return false;

                if (index1 + 1 < s1.length && isCRAndLF(s1[index1], s1[index1 + 1])){
                    index1++;
                }
                if (index2 + 1 < s2.length && isCRAndLF(s2[index2], s2[index2 + 1])){
                    index2++;
                }
            }

            index1++; index2++;
        }
    }

    private static boolean isCRAndLF(char ch1, char ch2){
        return (ch1 == CR && ch2 == LF) || (ch1 == LF && ch2 == CR);
    }
}

What is the meaning of oob1? (it signals if index is at the end of string, but I'm curious how it relates to letters oob) Edit: probably out-of-bytes. Right? — industryworker3595112
– industryworker3595112, Commented Aug 31, 2016 at 12:28
@industryworker3595112 out-of-bounds, likely. (Speedrunner jargon makes me personally immediately associate OOB with Out-Of-Bounds) — CAD97
– CAD97, Commented Aug 31, 2016 at 13:40
@industryworker3595112 it means "out of bounds". I came to think of it when I remembered java's OutOfBoundsException. — Sirac
– Sirac, Commented Aug 31, 2016 at 17:10

zeluisping · Accepted Answer · 2019-07-11 14:46:26Z

Alternative to what follows

An alternative to everything that follows is removing any carriage return (\r) from the strings when you get them as input into the program.

Some reading on the iteration and carriage return

This question over at Stack Overflow will give you some insight into iteration over a string including speeds.

This one explains what's behind carriage return and why we have \n\r and just \n.

Onto your code (What was already mentioned)

As Roland Illig has already mentioned, in this answer, the function and parameter's names will be changed to match the following:

public static boolean equalsIgnoringNewlineStyle(String a, String b)

And the two constants are now private.

Do read CAD97's answer as it also gives you some new points of which only one I will mention: if both are null the result should always be true in conformance with java.util.Objects::equals. All his other points maintain relevance. Thus:

public static boolean equalsIgnoringNewlineStyle(String a, String b) {
    if (a == b) {
        return true;
    }

    if (a == null || b == null) {
        return false;
    }

    // ...

toCharArray()

This function returns a new copy of the string, this will slow down your function, specially if you have intentions on calling your function often. Given this we'll be indexing the strings instead, indexing them only once for each character needed, storing them in a char.

Variable Declaration

This might be more of a style option but declaring your variables each on it's own line tends to look cleaner and the end result is the same, not affecting performance. Thus I'd change the indexes declaration to this:

int index_a = 0;
int index_b = 0;

`while (true)`

while (true) is considered bad practice, either use a while or do...while depending on your needs. In this case we'd stick to a while as our condition is right at the start of the loop, we should then move the condition to the loop:

while (index_a < a.length() && index_b < b.length()) {
    // ...
}

return index_a == a.length() && index_b == b.length();

Notice the return this tells us right away that anything that can change the result of the function call to false will be inside the while loop (either through other conditions that will return false or just through changes to index_a and index_b.

The return result is whether or not we went through the whole of both strings, if not then they are of different lengths.

`ch1` and `ch2`

Here I have renamed ch1 and ch2 to first and second respectively. The difference in names will allow for better spotting of which is which and avoids errors where one can write ch1 or ch2 when they meant the other; usually these kind of typos are hard to find too. Note the same could happen with my a and b but it is less likely.

char first = a.charAt(index_a);
char second = b.charAt(index_b);

Reduce Indentations

This is a pretty simple change that makes the code easier to read by reducing the amount of indentation, making the code more vertical, and reduces brace nesting (which increases reading complexity). So instead of this:

if (first != second){
    // ...            
}

we'll have this:

if (first == second) {
    ++index_a;
    ++index_b;
    continue;
}

We have to increment the indices since we are not going to reach the end of the loop which is where we do it.

Merging `if`s

You have two sequential conditions both of which have a common result: returning false. This:

if (ch1 != LF && ch1 != CR) return false;
if (ch2 != LF && ch2 != CR) return false;

becomes this:

if ((first != LF && first != CR) ||
    (second != LF && second != CR)) {
    // different characters and are not 'NL' nor 'CR'
    return false;
}

Removing `isCRAndLF`

I removed isCRAndLF as it is a simple function that can be simplified on-site, plus we remove the function call (in the case the compiler does no inline it). Even if it did inline the function, it removes reading complexity.

Applying the changes to this:

if (index1 + 1 < s1.length && isCRAndLF(s1[index1], s1[index1 + 1])){
    index1++;
}
if (index2 + 1 < s2.length && isCRAndLF(s2[index2], s2[index2 + 1])){
    index2++;
}

we get this:

if (index_a + 1 < a.length()) {
    char other = a.charAt(index_a + 1);

    // 'first' here is either \n or \r (checked before)
    // other != first ::= not { \n\n , \r\r }
    if (other != first && (other == LF || other == CR)) {
        ++index_a;
    }
}

if (index_b + 1 < b.length()) {
    char other = b.charAt(index_b + 1);

    // same as above, but for 'second'
    if (other != second && (other == LF || other == CR)) {
        ++index_b;
    }
}

Now explaining the condition further: We've checked before in the function if first and second where either NL or CR and we only continued to this point if that was true, this means we have the following (same for second):

first ::= LF | CR

So now we check other != first this means:

first == LF && other != LF || first == CR && other != CR

This means they have to be different, so neither \n\n or \r\r happen. We then check:

other == LF || other == CR

This makes sure that this character is either LF or CR because we have not checked yet and it could just be any other character and nothing to do with what we want, this constrains the possible results to \n\r and \r\n by making sure other is the "opposite" of first.

Full code (untested)

public class StringUtils {
    static final char LF = '\n';
    static final char CR = '\r';

    public static boolean equalsIgnoringNewlineStyle(String a, String b){
        if (a == b) { // if both null return true
            return true;
        }

        if (a == null || b == null) {
            return false;
        }

        // toCharArray is slow (creates new copy of the whole string)
        // we'll use indexing instead (it's faster)

        // cleaner variable declaration (does not affect performance)
        int index_a = 0;
        int index_b = 0;

        // while (true) are a bad practice, moved loop condition to right place
        while (index_a < a.length() && index_b < b.length()) {
            char first = a.charAt(index_a);
            char second = b.charAt(index_b);

            if (first == second) {
                // decrease amount of identations
                ++index_a;
                ++index_b;
                continue;
            }

            if ((first != LF && first != CR) ||
                (second != LF && second != CR)) {
                // at least one of the characters is not a new line
                return false;
            }

            if (index_a + 1 < a.length()) {
                char other = a.charAt(index_a + 1);

                // 'first' here is either \n or \r (checked before)
                // other != first ::= not { \n\n , \r\r }
                if (other != first && (other == LF || other == CR)) {
                    ++index_a;
                }
            }

            if (index_b + 1 < b.length()) {
                char other = b.charAt(index_b + 1);

                // same as above, but for 'second'
                if (other != second && (other == LF || other == CR)) {
                    ++index_b;
                }
            }

            ++index_a;
            ++index_b;
        }

        return index_a == a.length() && index_b == b.length();
    }
}

Doesn't this return true if a is one string is a substring of the other? E.g. app and apple — CAD97
– CAD97, Commented Aug 31, 2016 at 13:28
@CAD97 hmm, you are right, I'm guessing this is what happens when we write code without sleeping, I will probably be able to come here fix it in an two to three hour; thank you for noticing! — zeluisping
– zeluisping, Commented Aug 31, 2016 at 22:55
@CAD97 I'm at work right on my break and I've edited the code, the change to the return true for checking if we went through the whole of both strings should be enough, I don't know if I am missing more test cases? If you wouldn't mind reviewing I'd be thankful! Once again thank you! — zeluisping
– zeluisping, Commented Sep 1, 2016 at 0:45
Thank you for all your suggestions and for putting them all in the code. I have but one slight improvement to make. In java you cannot index strings (pretty stupid, I know), you will have to use String.charAt(int), so a[index_a] becomes a.charAt(index_a). This method always checks if the index is OOB, so I might benchmark this against the toCharArray()-method. — Sirac
– Sirac, Commented Sep 1, 2016 at 10:35
@Sirac I have fixed the string indexing, thank you, I should have checked before! I have also added a simpler and faster method in case you don't go with the removal of carriage returns upon receiving the strings, did not compile tho — zeluisping
– zeluisping, Commented Sep 1, 2016 at 19:21

Roland Illig · Accepted Answer · 2016-08-31 00:15:22Z

14

The code looks quite complicated at first glance, but when reading it, it is straightforward. If you want to make the code shorter, you can just do this:

public static boolean equalsIgnoreNewlineStyle(String s1, String s2) {
    return s1 != null && s2 != null && normalizeLineEnds(s1).equals(normalizeLineEnds(s2));
}

private static String normalizeLineEnds(String s) {
    return s.replace("\r\n", "\n").replace('\r', '\n');
}

Concerning running time and GC stress, your code is probably better. Use a benchmark to see how much better it is.

The word twirks sounded negative to me, therefore I replaced it with style.

Since the two strings have equal rights and are treated the same, none of them should be called the "other one".

You should not make the constants public, since there is no need to do that.

edited Aug 31, 2016 at 0:15

answered Aug 31, 2016 at 0:07

Roland Illig

21.9k2 gold badges36 silver badges84 bronze badges

\$\begingroup\$ Actually, this is likely as peformant as the version that calls toCharArray() on each input string. I'd be surprised if the benchmark showed much difference. \$\endgroup\$

Toby Speight
– Toby Speight

2016-08-31 14:14:31 +00:00
Commented Aug 31, 2016 at 14:14
\$\begingroup\$ It looks like both variants include some allocations, but the calls to toCharArray can be easily optimized away. String is an immutable data type, and the resulting array is only examined, not modified; it is also only used as a local variable and not passed outside the current method call. Therefore the VM can assume that copying is not necessary in this case. So there might actually be a large difference in favor of the OP's code. \$\endgroup\$

Roland Illig
– Roland Illig

2016-08-31 16:11:36 +00:00
Commented Aug 31, 2016 at 16:11
\$\begingroup\$ ... and this discussion just proves the same old point: if in doubt, benchmark! :-) \$\endgroup\$

Toby Speight
– Toby Speight

2016-08-31 16:26:48 +00:00
Commented Aug 31, 2016 at 16:26
\$\begingroup\$ Since my code will run very often I might think of normalizing the strings. I will test that method too, thank you. \$\endgroup\$

Sirac
– Sirac

2016-09-01 10:29:17 +00:00
Commented Sep 1, 2016 at 10:29

Add a comment |

Community · Accepted Answer · 2017-04-13 12:40:47Z

You return false given a call StringUtils.equalsIgnoreNewlineTwirks(null, null). This can be changed by putting str == other before str == null || other == null. Given java.util.Objects::equals returns true if given two nulls, I think this is the expected behavior.

Consider normalizing your Strings on receipt as another option. There are many other ways that strings can differ in representation while still being "equivalent" (according to the Unicode specification), such as two more potential line ending characters, LINE SEPARATOR (U+2028) and PARAGRAPH SEPARATOR (U+2029).

If you only need to deal with \r\n vs \n, consider using the normalization routine Roland Illig provides. Though it is more likely more costly to run this a single time to check two strings, if you need to check against multiple strings, it is almost certainly more efficient to normalize the string on receipt, a single time, rather than messing around with normalization-aware equality. Down the line when you ultimately need to treat ñ (\u00F1) and ñ (\u006E\u0303) as the same, normalizing on receipt will be much easier and maintainable.

Consider also if you emit any strings: it is much nicer to anyone using your program if it emits a consistent line ending, rather than mixing \n and \r\n. This is, of course, assuming that you do not need to keep the exact string you were provided with for some reason.

As for your algorithm as written:

You use bit-wise or (|) / and (&) to combine oob1 and oob2. It is more typical to just use logical OR (||) / AND (&&), as these have the typically expected precedence and short-circuiting behavior. In this case, the semantics are the same, but I would still prefer the logical versions; the argument that the bit-wise operations are faster is a moot point, as the compiler will perform those sort of optimizations for you. And in any case, you're talking about one or two machine instructions difference, and if you care that much, I hate to say it, but you shouldn't be writing Java :P

I'd rename ::isCRandLF to ::isLineSeparator, and have it only take one character. This is more in line with the Character methods (e.g. Character::isAlphabetic). (Strangely enough, the Java API does not seem to offer this method out of the box, though it does offer System::lineSeparator.)

And I also agree with Roland Illig's points:

NewlineTwirks -> NewlineStyle

Don't public constants needlessly

Two equal strings, neither should be other

Consider just str1 and str2, as the rest of your variables follow the 1/2 pattern. I've also seen lhs and rhs a good deal, though mostly in dealing with, e.g. lhs + rhs.

Nik Kashi · Accepted Answer · 2021-05-06 12:36:58Z

0

Simply normalize both sides using org.apache.commons.lang3.StringUtils.normalizeSpace You can find documentation here

answered May 6, 2021 at 12:36

Nik Kashi

1012 bronze badges

1

\$\begingroup\$ Welcome to the Code Review Community. Our goal in this community is to help programmers improve their coding skills by making meaningful observations about the existing code. Alternate solutions are not the goal on this site. Explain why the alternate solution you provide is better than the current implementation. \$\endgroup\$

pacmaninbw
– pacmaninbw ♦

2021-05-06 15:08:52 +00:00
Commented May 6, 2021 at 15:08

Add a comment |

Stack Exchange Network

Comparing strings with different newlines

4 Answers 4

Alternative to what follows

Some reading on the iteration and carriage return

Onto your code (What was already mentioned)

toCharArray()

Variable Declaration

`while (true)`

`ch1` and `ch2`

Reduce Indentations

Merging `if`s

Removing `isCRAndLF`

Full code (untested)

You must log in to answer this question.

Hot Network Questions

Comparing strings with different newlines

4 Answers 4

Alternative to what follows

Some reading on the iteration and carriage return

Onto your code (What was already mentioned)

toCharArray()

Variable Declaration

while (true)

ch1 and ch2

Reduce Indentations

Merging ifs

Removing isCRAndLF

Full code (untested)

You must log in to answer this question.

Related

Hot Network Questions

`while (true)`

`ch1` and `ch2`

Merging `if`s

Removing `isCRAndLF`