jsoup output encoding issue

Question

I am using JSoup to parse a gb2312 charset page: http://vars.sinaapp.com/u/t/jsoup_output_encoding_issue.html

source code:

String testURL="http://vars.sinaapp.com/u/t/jsoup_output_encoding_issue.html";
Document doc=Jsoup.connect(testURL).get();          
System.out.println(
    doc.select("div").html()
);

this gives the following output:

1:? 2:&#65533; 3:&#65533; 4:&#8212;

I want to get same with page source code:

1:· 2:慒 3:啰 4:&mdash;

Is there any way to do this?‎

maerics · Accepted Answer · 2012-01-09 18:37:59Z

2

Try settingdoc.outputSettings().escapeMode(EscapeMode.xhtml) or changing the output charset before printing.

See also the (paltry) documentation for EscapeMode.

answered Jan 9, 2012 at 18:37

maerics

157k47 gold badges277 silver badges299 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Koerr Over a year ago

thanks for help,I'm tried other charsets , EscapeMode.xhtml and prettyPrint(false), the output is the same

maerics Over a year ago

@Zenofo: bummer =( Consider updating your question with a few things you've tried, this might improve the quality of other answers.

Dipankar chowdury · Accepted Answer · 2018-04-04 10:37:53Z

0

Try encoding as "MS932" or "SHIFT-JIS".This will solve your problem. you can also read the charset type of the html page and set while parsing the file.

answered Apr 4, 2018 at 10:37

Dipankar chowdury

1

Collectives™ on Stack Overflow

jsoup output encoding issue

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related