Detect encoding of URL in Java

Question

I have a case of mixed Data in a Database, and I am trying to see if this is a problem that can be solved. What I have is a partial URL in one of three formats:

/some/path?ugly=häßlich // case 1, Encoding: UTF-8 (plain)
/some/path?ugly=h%C3%A4%C3%9Flich // case 2, Encoding: UTF-8 (URL-encoded)
/some/path?ugly=h%E4%DFlich // case 3: Encoding: ISO-8859-1 (URL-encoded)

What I need in my Application is the URL-encoded UTF8-version

/some/path?ugly=h%C3%A4%C3%9Flich // Encoding: UTF-8 (URL-encoded)

The Strings in the DB are all UTF-8, but the URL-encoding may or may not be present and may be of either format.

I have a method a that encodes plain UTF-8 to URL-encoded UTF-8, and I have a method b that decodes URL-encoded ISO-8859-1 to plain UTF-8, so basically what I plan to do is:

case 1:

String output = a(input);

case 2:

String output = input;

case 3:

String output = a(b(input));

All of these cases work fine if I know which is which, but is there a safe way for me to detect whether such a String is case 2 or 3? (I can limit the languages used in the Parameters to European languages: German, English, French, Netherlands, Polish, Russian, Danish, Norwegian, Swedish and Turkish, if that is any help).

I know the obvious solution would be to clean up the data, but unfortunately the data is not created by myself, nor do the people who do have the necessary technical understanding (and there is plenty of legacy data that needs to work)

are only characters (like in your example) and numbers encodeded? — s106mo
– s106mo, Commented Jul 10, 2012 at 20:24
@s106mo yes, the application is a redirect to a better search query. and those are alphanumeric by definition. thanks for the suggestion — Sean Patrick Floyd
– Sean Patrick Floyd, Commented Jul 10, 2012 at 21:21

s106mo · Accepted Answer · 2012-07-10 20:33:34Z

2

If you can assume that only alphanumerics are encoded, following woud work for:

"häßlich"
"h%C3%A4%C3%9Flich"
"h%E4%DFlich"

// check firstly:

public static boolean isUtf8Encoded(String url) {
    return isAlphaNumeric(url);
}

public static boolean isUrlUtf8Encoded(String url)
        throws UnsupportedEncodingException {
    return isAlphaNumeric(URLDecoder.decode(url, "UTF-8"));
}

public static boolean isUrlIsoEncoded(String url)
        throws UnsupportedEncodingException {
    return isAlphaNumeric(URLDecoder.decode(url, "ISO-8859-1"));
}

private static boolean isAlphaNumeric(String decode) {
    for (char c : decode.toCharArray()) {
        if (!Character.isLetterOrDigit(c)) {
            return false;
        }
    }
    return true;
}

answered Jul 10, 2012 at 20:33

s106mo

1,2732 gold badges14 silver badges21 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Elsayed · Accepted Answer · 2016-10-12 12:27:46Z

1

you can make work around as you first decode then encode , if you have unencoded url it isn't affected by decoding

 String url = "your url";
    url=URIUtil.decode(url, "UTF-8");
    url=URIUtil.encodeQuery(url, "UTF-8");

answered Oct 12, 2016 at 12:27

Elsayed

3,0627 gold badges30 silver badges47 bronze badges

1 Comment

Sean Patrick Floyd Over a year ago

I take it you mean URIUtil from Apache HttpComponents

user1079877 · Accepted Answer · 2014-06-24 05:28:12Z

0

Thanks to accepted answer, but it does not work for URL, because URL also contains control characters, this is my solution:

/**
 * List of valid characters in URL.
 */
private static final List VALID_CHARACTERS = Arrays.asList(
        '-', '.', '_', '~', ':', '/', '?', '#', '[', ']', '@', '!',
        '$', '&', '\'', '(', ')', '*', '+', ',', ';', '='
);

/**
 * Check that decoding was successful or not.
 * @param url URL to check
 * @return True if it's valid.
 */
private static boolean isMalformed(final String url) {
    for (char c : url.toCharArray()) {
        if (VALID_CHARACTERS.indexOf(c) == -1 && !Character.isLetterOrDigit(c)) {
            return false;
        }
    }
    return true;
}

/**
 * Try to decode URL with specific encoding.
 * @param url URL
 * @param encoding Valid encoding
 * @return Decoded URL or null of encoding is not write
 * @throws java.io.UnsupportedEncodingException Throw if encoding does not support on your system.
 */
private static String _decodeUrl(final String url, final String encoding) {
    try {
        final String decoded = URLDecoder.decode(url, encoding);
        if(isMalformed(decoded)) {
            return decoded;
        }
    }
    catch (UnsupportedEncodingException ex) {
        throw new IllegalArgumentException("Illegal encoding: " + encoding);
    }
    return null;
}

/**
 * Decode URL with most popular encodings for URL.
 * @param url URL
 * @return Decoded URL or original one if encoding does not support.
 */
public static String decodeUrl(final String url) {
    final String[] mostPopularEncodings = new String[] {"iso-8859-1", "utf-8", "GB2312"};
    return decodeUrl(url, mostPopularEncodings);
}

/**
 * Decode URL with most popular encodings for URL.
 * @param url URL
 * @param encoding Encoding
 * @return Decoded URL or original one if encoding does not support.
 */
public static String decodeUrl(final String url, final String... encoding) {
    for(String e:encoding) {
        final String decoded;
        if((decoded = _decodeUrl(url, e)) != null) {
            return decoded;
        }
    }
    return url;
}

edited Jun 24, 2014 at 5:28

answered Jun 24, 2014 at 5:09

user1079877

9,4485 gold badges46 silver badges55 bronze badges

3 Comments

Sean Patrick Floyd Over a year ago

Nice, but instead of Character objects, a Guava CharMatcher would be way more efficient

user1079877 Over a year ago

Thank but, I think it use isLetterOrDigit internally also! And who about if I don't use Google libs!

Sean Patrick Floyd Over a year ago

No it doesn't. It is optimized to do lookups using a bit table. And re not using Google libs: perhaps you should reconsider. they are some of the best open source libs out there

Collectives™ on Stack Overflow

Detect encoding of URL in Java

3 Answers 3

Comments

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related