2

I have a case of mixed Data in a Database, and I am trying to see if this is a problem that can be solved. What I have is a partial URL in one of three formats:

/some/path?ugly=häßlich // case 1, Encoding: UTF-8 (plain)
/some/path?ugly=h%C3%A4%C3%9Flich // case 2, Encoding: UTF-8 (URL-encoded)
/some/path?ugly=h%E4%DFlich // case 3: Encoding: ISO-8859-1 (URL-encoded)

What I need in my Application is the URL-encoded UTF8-version

/some/path?ugly=h%C3%A4%C3%9Flich // Encoding: UTF-8 (URL-encoded)

The Strings in the DB are all UTF-8, but the URL-encoding may or may not be present and may be of either format.

I have a method a that encodes plain UTF-8 to URL-encoded UTF-8, and I have a method b that decodes URL-encoded ISO-8859-1 to plain UTF-8, so basically what I plan to do is:

case 1:

String output = a(input);

case 2:

String output = input;

case 3:

String output = a(b(input));

All of these cases work fine if I know which is which, but is there a safe way for me to detect whether such a String is case 2 or 3? (I can limit the languages used in the Parameters to European languages: German, English, French, Netherlands, Polish, Russian, Danish, Norwegian, Swedish and Turkish, if that is any help).

I know the obvious solution would be to clean up the data, but unfortunately the data is not created by myself, nor do the people who do have the necessary technical understanding (and there is plenty of legacy data that needs to work)

2
  • are only characters (like in your example) and numbers encodeded? Commented Jul 10, 2012 at 20:24
  • @s106mo yes, the application is a redirect to a better search query. and those are alphanumeric by definition. thanks for the suggestion Commented Jul 10, 2012 at 21:21

3 Answers 3

2

If you can assume that only alphanumerics are encoded, following woud work for:

  • "häßlich"
  • "h%C3%A4%C3%9Flich"
  • "h%E4%DFlich"

// check firstly:

public static boolean isUtf8Encoded(String url) {
    return isAlphaNumeric(url);
}

public static boolean isUrlUtf8Encoded(String url)
        throws UnsupportedEncodingException {
    return isAlphaNumeric(URLDecoder.decode(url, "UTF-8"));
}

public static boolean isUrlIsoEncoded(String url)
        throws UnsupportedEncodingException {
    return isAlphaNumeric(URLDecoder.decode(url, "ISO-8859-1"));
}

private static boolean isAlphaNumeric(String decode) {
    for (char c : decode.toCharArray()) {
        if (!Character.isLetterOrDigit(c)) {
            return false;
        }
    }
    return true;
}
Sign up to request clarification or add additional context in comments.

Comments

1

you can make work around as you first decode then encode , if you have unencoded url it isn't affected by decoding

 String url = "your url";
    url=URIUtil.decode(url, "UTF-8");
    url=URIUtil.encodeQuery(url, "UTF-8");

1 Comment

0

Thanks to accepted answer, but it does not work for URL, because URL also contains control characters, this is my solution:

/**
 * List of valid characters in URL.
 */
private static final List VALID_CHARACTERS = Arrays.asList(
        '-', '.', '_', '~', ':', '/', '?', '#', '[', ']', '@', '!',
        '$', '&', '\'', '(', ')', '*', '+', ',', ';', '='
);

/**
 * Check that decoding was successful or not.
 * @param url URL to check
 * @return True if it's valid.
 */
private static boolean isMalformed(final String url) {
    for (char c : url.toCharArray()) {
        if (VALID_CHARACTERS.indexOf(c) == -1 && !Character.isLetterOrDigit(c)) {
            return false;
        }
    }
    return true;
}

/**
 * Try to decode URL with specific encoding.
 * @param url URL
 * @param encoding Valid encoding
 * @return Decoded URL or null of encoding is not write
 * @throws java.io.UnsupportedEncodingException Throw if encoding does not support on your system.
 */
private static String _decodeUrl(final String url, final String encoding) {
    try {
        final String decoded = URLDecoder.decode(url, encoding);
        if(isMalformed(decoded)) {
            return decoded;
        }
    }
    catch (UnsupportedEncodingException ex) {
        throw new IllegalArgumentException("Illegal encoding: " + encoding);
    }
    return null;
}

/**
 * Decode URL with most popular encodings for URL.
 * @param url URL
 * @return Decoded URL or original one if encoding does not support.
 */
public static String decodeUrl(final String url) {
    final String[] mostPopularEncodings = new String[] {"iso-8859-1", "utf-8", "GB2312"};
    return decodeUrl(url, mostPopularEncodings);
}

/**
 * Decode URL with most popular encodings for URL.
 * @param url URL
 * @param encoding Encoding
 * @return Decoded URL or original one if encoding does not support.
 */
public static String decodeUrl(final String url, final String... encoding) {
    for(String e:encoding) {
        final String decoded;
        if((decoded = _decodeUrl(url, e)) != null) {
            return decoded;
        }
    }
    return url;
}

3 Comments

Nice, but instead of Character objects, a Guava CharMatcher would be way more efficient
Thank but, I think it use isLetterOrDigit internally also! And who about if I don't use Google libs!
No it doesn't. It is optimized to do lookups using a bit table. And re not using Google libs: perhaps you should reconsider. they are some of the best open source libs out there

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.