I'm crawling webpages from different websites and they have varied encodings. A sample of the encodings I get are -
- Big5
- TIS-620
- utf-16le
- shift_JIS
- EUC-JP
- MacCyrillic
- koi8-r
apart from the more common encodings. I can get the unicode source of the web page by decoding using the above encodings.
My question is this: I would like to store all the files as utf8. If I encode the unicode source using utf8, will it work for all webpages? Does utf8 support all unicode code points?