How to check whether the character set is in utf-8 encoding,through ruby|ror ?
3 Answers
Check UTF-8 Validity
For most multi-byte encodings it is possible to programmatically detect invalid byte-sequences. As Ruby by default treats all strings to be UTF-8, you can check if a string is given in valid UTF-8:
# encoding: UTF-8
# -------------------------------------------
str = "Partly valid\xE4 UTF-8 encoding: äöüß"
str.valid_encoding?
# => false
str.scrub('').valid_encoding?
# => true
Convert Encoding
Additionally, if a string is not valid UTF-8 encoding, but you know the actual character-encoding, you can convert the string to UTF-8 encoding.
Example
Sometimes, you end up in a situation, in which you know that the encoding of an input-file is either UTF-8 or CP1252 (a.k.a. Windows-1252).
Check which encoding it is and convert to UTF-8 (if necessary):
# encoding: UTF-8
# ------------------------------------------------------
test = "String in CP1252 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
unless str.valid_encoding?
str.encode!( 'UTF-8', 'CP1252', invalid: :replace, undef: :replace, replace: '?' )
end #unless
# => "String CP1252 encoding: äöüß"
=======
Notes
It is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?) with pretty high reliability. After only 16 bytes, the probability of a random byte-sequence being valid UTF-8 is only 0.01%. (Compare this with relying on the UTF-8 BOM)
However, it is NOT easily possible to programmatically detect (in)validity of single-byte-encodings like
CP1252orISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is validCP1252encoding.Even though UTF-8 has become increasingly popular as the default encoding in the web,
CP1252and otherLatin1flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary fromCP1252(a.k.a.Windows-1252). Examples:ISO-8859-1,ISO-8859-15
3 Comments
"String in CP1252 encoding: \xE4\xF6\xFC\xDF" I believe the question was how do you check it's in CP1252 encoding.UTF-8 encoding, or not. You do that by calling str.valid_encoding? on a String str that is in UTF-8-encoding. Does that not get clear from my answer?CP1252. However, you can pretty reliably (depending on the length of the string) check the invalidity of a string in a multi-byte-encoding such as UTF-8.There's no definite way to do this, in Ruby nor anywhere else:
str = 'foo' # start with a simple string
# => "foo"
str.encoding
# => #<Encoding:UTF-8> # which is UTF-8 encoded
str.bytes.to_a
# => [102, 111, 111] # as you can see, it consists of three bytes 102, 111 and 111
str.encode!('us-ascii') # now we will recode the string to 8-bit us-ascii encoding
# => "foo"
str.encoding
# => #<Encoding:US-ASCII>
str.bytes.to_a
# => [102, 111, 111] # see, same three bytes
str.encode!('windows-1251') # let us try some cyrillic
# => "foo"
str.encoding
# => #<Encoding:Windows-1251>
str.bytes.to_a
# => [102, 111, 111] # see, the same three again!
Of course, you can employ some statistical analysis on the text, and eliminate encodings which the text is not valid for, but theoretically, this is not solvable problem.
4 Comments
"Partly valid\xE4 UTF-8 encoding: äöüß".valid_encoding?valid_encoding? checks whether a string contains invalid byte sequences. It does not say if the (otherwise valid) byte sequence originates from certain encoding, and I believe that was the question.UTF-8: After only 16 bytes, the probability of a random byte-sequence being valid UTF-8 is only 0.01%. So, the algorithm of str.valid_encoding? is pretty reliable in determining in checking wether a given string is UTF-8 encoded."your string".encoding
# => #<Encoding:UTF-8>
Or if you want it progmatically,
"your string".encoding.name == "UTF-8"
# => true