How to check whether the character is utf-8

Question

How to check whether the character set is in utf-8 encoding,through ruby|ror ?

Do you mean if you already have the String in memory with the correct encoding, or do you mean before you even begin to read the String into memory (say, from a file on disk)? — d11wtq
– d11wtq, Commented Dec 26, 2011 at 12:41

Andreas Rayo Kniep · Accepted Answer · 2016-02-21 02:16:52Z

20

Check UTF-8 Validity

For most multi-byte encodings it is possible to programmatically detect invalid byte-sequences. As Ruby by default treats all strings to be UTF-8, you can check if a string is given in valid UTF-8:

# encoding: UTF-8
# -------------------------------------------
str = "Partly valid\xE4 UTF-8 encoding: äöüß"

str.valid_encoding?
   # => false

str.scrub('').valid_encoding?
   # => true

Convert Encoding

Additionally, if a string is not valid UTF-8 encoding, but you know the actual character-encoding, you can convert the string to UTF-8 encoding.

Example
Sometimes, you end up in a situation, in which you know that the encoding of an input-file is either UTF-8 or CP1252 (a.k.a. Windows-1252).
Check which encoding it is and convert to UTF-8 (if necessary):

# encoding: UTF-8
# ------------------------------------------------------
test = "String in CP1252 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}

str  = File.read( 'input_file' )

unless str.valid_encoding?
  str.encode!( 'UTF-8', 'CP1252', invalid: :replace, undef: :replace, replace: '?' )
end #unless
   # => "String CP1252 encoding: äöüß"

=======
Notes

It is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?) with pretty high reliability. After only 16 bytes, the probability of a random byte-sequence being valid UTF-8 is only 0.01%. (Compare this with relying on the UTF-8 BOM)
However, it is NOT easily possible to programmatically detect (in)validity of single-byte-encodings like CP1252 or ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid CP1252 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in the web, CP1252 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from CP1252 (a.k.a. Windows-1252). Examples: ISO-8859-1, ISO-8859-15

edited Feb 21, 2016 at 2:16

answered Feb 18, 2016 at 21:08

Andreas Rayo Kniep

6,6903 gold badges35 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mladen Jablanović Over a year ago

"String in CP1252 encoding: \xE4\xF6\xFC\xDF" I believe the question was how do you check it's in CP1252 encoding.

Andreas Rayo Kniep Over a year ago

I thought, the question was if a given string is in (valid) UTF-8 encoding, or not. You do that by calling str.valid_encoding? on a String str that is in UTF-8-encoding. Does that not get clear from my answer?

Andreas Rayo Kniep Over a year ago

Programmatically, you can not (or at least not easily and of course not reliably) check the invalidity of a string in a one-byte-encoding such as CP1252. However, you can pretty reliably (depending on the length of the string) check the invalidity of a string in a multi-byte-encoding such as UTF-8.

Mladen Jablanović · Accepted Answer · 2011-12-26 13:41:31Z

13

There's no definite way to do this, in Ruby nor anywhere else:

str = 'foo' # start with a simple string
# => "foo" 
str.encoding
# => #<Encoding:UTF-8> # which is UTF-8 encoded
str.bytes.to_a
# => [102, 111, 111] # as you can see, it consists of three bytes 102, 111 and 111
str.encode!('us-ascii') # now we will recode the string to 8-bit us-ascii encoding
# => "foo" 
str.encoding
# => #<Encoding:US-ASCII> 
str.bytes.to_a
# => [102, 111, 111] # see, same three bytes
str.encode!('windows-1251') # let us try some cyrillic
# => "foo" 
str.encoding
# => #<Encoding:Windows-1251> 
str.bytes.to_a
# => [102, 111, 111] # see, the same three again!

Of course, you can employ some statistical analysis on the text, and eliminate encodings which the text is not valid for, but theoretically, this is not solvable problem.

answered Dec 26, 2011 at 13:41

Mladen Jablanović

44.2k13 gold badges92 silver badges113 bronze badges

4 Comments

the Tin Man Over a year ago

"There's no definite way to do this, in Ruby nor anywhere else", ah, said like an embittered solder of the Unicode wars. Been there, done that, I feel your pain. :-) I also fell back on statistical analysis of the text, which worked, kinda, most of the time. It's amazing how badly broken HTML, RSS and XML can be when someone is determined to make things work without regard for specs.

Andreas Rayo Kniep Over a year ago

What about String#valid_encoding?? Example: "Partly valid\xE4 UTF-8 encoding: äöüß".valid_encoding?

Mladen Jablanović Over a year ago

valid_encoding? checks whether a string contains invalid byte sequences. It does not say if the (otherwise valid) byte sequence originates from certain encoding, and I believe that was the question.

Andreas Rayo Kniep Over a year ago

Ok, I see. I understand the question differently, though. I understand: "How can I check if a given string is in valid UTF-8 encoding?" You can very reliably determine if a given byte-sequence is valid UTF-8: After only 16 bytes, the probability of a random byte-sequence being valid UTF-8 is only 0.01%. So, the algorithm of str.valid_encoding? is pretty reliable in determining in checking wether a given string is UTF-8 encoded.

sawa · Accepted Answer · 2011-12-26 12:11:49Z

1

"your string".encoding
 # => #<Encoding:UTF-8>

Or if you want it progmatically,

"your string".encoding.name == "UTF-8"
 # => true

answered Dec 26, 2011 at 12:11

sawa

169k51 gold badges287 silver badges398 bronze badges

1 Comment

Mladen Jablanović Over a year ago

This merely checks an encoding set on a string object, not actual encoding of its content. There is no guarantee that the actual content is encoded using the same encoding.

Collectives™ on Stack Overflow

How to check whether the character is utf-8

3 Answers 3

Check UTF-8 Validity

Convert Encoding

3 Comments

4 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Check UTF-8 Validity

Convert Encoding

3 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related