3

I am using JSoup to parse a gb2312 charset page: http://vars.sinaapp.com/u/t/jsoup_output_encoding_issue.html

source code:

String testURL="http://vars.sinaapp.com/u/t/jsoup_output_encoding_issue.html";
Document doc=Jsoup.connect(testURL).get();          
System.out.println(
    doc.select("div").html()
);

this gives the following output:

1:? 2:� 3:� 4:—

I want to get same with page source code:

1:· 2:慒 3:啰 4:—

Is there any way to do this?‎

2 Answers 2

2

Try settingdoc.outputSettings().escapeMode(EscapeMode.xhtml) or changing the output charset before printing.

See also the (paltry) documentation for EscapeMode.

Sign up to request clarification or add additional context in comments.

2 Comments

thanks for help,I'm tried other charsets , EscapeMode.xhtml and prettyPrint(false), the output is the same
@Zenofo: bummer =( Consider updating your question with a few things you've tried, this might improve the quality of other answers.
0

Try encoding as "MS932" or "SHIFT-JIS".This will solve your problem. you can also read the charset type of the html page and set while parsing the file.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.