i am using an HTML parser called HTMLCLEANER to parse HTML page the problem is that each page has a different encoding than the other. my question
Can i change from any character encoding to UTF-8?
Where do you get the HTML page from? If you get it from the servlet request, you can use getReader() on it and pass that to clean(). This will use the right encoding. If you get it from an upload, pass the input stream to clean(). If you get it by http client, you need to check the reponse header Content-Type using getResponseCharSet().
Can i change from any character encoding to UTF-8?
Yes, you can express any Unicode character in UTF-8 encoding.
There might be a problem when changing the encoding of HTML pages: if the page contains an "charset" Meta-Tag, for example,
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
you have to update this tag so it corresponds to the actual encoding.