2

I'm crawling webpages from different websites and they have varied encodings. A sample of the encodings I get are -

  • Big5
  • TIS-620
  • utf-16le
  • shift_JIS
  • EUC-JP
  • MacCyrillic
  • koi8-r

apart from the more common encodings. I can get the unicode source of the web page by decoding using the above encodings.

My question is this: I would like to store all the files as utf8. If I encode the unicode source using utf8, will it work for all webpages? Does utf8 support all unicode code points?

1
  • The “UTF” part of the name stands for Unicode Transformation Format: any of the “UTF-...” encodings can indeed store all Unicode characters. Commented Aug 7, 2011 at 19:06

2 Answers 2

4

Yes, UTF-8 is nothing more than a scheme for storing integers in bytes, in such a way that smaller integers take fewer bytes. The result is that values less than 128 are stored in one byte so that ASCII is still ASCII. UTF-8 can represent all Unicode codepoints.

Sign up to request clarification or add additional context in comments.

Comments

1

Short and sweet, ........ yes!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.