Detecting non-ASCII characters in Rails

Question

I am wondering if there's a way to detect non-ASCII characters in Rails.

I have read that Rails does not use Unicode by default, and characters like Chinese and Japanese have assigned ranges in Unicode. Is there an easy way to detect these characters in Rails? or just specify the range of characters I am expecting?

Is there a plugin for this? Thanks in advance!

What exactly counts as a "foreign character"? Is é foreign? How about ñ, µ, ü, or ≠? Are you trying to limit people to just (7bit) ASCII? Rails is quite happy with Unicode (UTF-8 preferably). — mu is too short
– mu is too short, Commented Aug 26, 2011 at 5:47
yes, I'm trying to block all those characters. How do I use UTF-8? sorry, noob here. — gerky
– gerky, Commented Aug 26, 2011 at 5:51
Everything should be UTF-8 by default in Rails. Which version of Ruby? — mu is too short
– mu is too short, Commented Aug 26, 2011 at 6:36
1.9.2, Is it easier to specify w/c ones I would allow or block all those foreign chars like chinese, japanese, french, etc? — gerky
– gerky, Commented Aug 26, 2011 at 7:08

edgerunner · Accepted Answer · 2011-08-26 08:32:05Z

7

All ideographic language encodings use multiple bytes to represent a character, and Ruby 1.9+ is aware of the difference between bytes and characters (Ruby 1.8 isn't)

You may compare the character length to the byte length of the string as a quick and dirty detector. It is probably not foolproof though.

class String
  def multibyte?
    chars.count < bytes.count
  end
end

"可口可樂".multibyte? #=> true
"qwerty".multibyte? #=> false

answered Aug 26, 2011 at 8:32

edgerunner

15k2 gold badges59 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

gerky Over a year ago

Thanks! but I used regex to match them, like matching {Han} and others

edgerunner Over a year ago

Regex is good and more foolproof and precise than this, but this is probably much faster than regex.

Yarin Over a year ago

Easy smart solution- to clarify, this will distinguish between the 128 US-ASCII characters in Unicode, which need one byte, and everything else- including all foreign alphabets but also things like copyright symbols. (Info here: en.wikipedia.org/wiki/UTF-8 and en.wikipedia.org/wiki/List_of_Unicode_characters)

mu is too short · Accepted Answer · 2011-08-26 08:21:28Z

1

This is pretty easy with 1.9.2 as regular expressions are character-based in 1.9.2 and 1.9.2 knows the difference between bytes and characters top to bottom. You're in Rails so you should get everything in UTF-8. Happily, UTF-8 and ASCII overlap for the entire ASCII range so you can just remove everything that isn't between ' ' and '~' when you have UTF-8 encoded text:

>> "Wheré is µ~pancakes ho元use?".gsub(/[^ -~]/, '')
=> "Wher is ~pancakes house?"

There's really no reason to go to all this trouble though. Ruby 1.9 works great with Unicode as does Rails and pretty much everything else. Dealing with non-ASCII text was a nightmare 15 years ago, now it is common and fairly straight forward.

If you do manage to get text data that isn't UTF-8 then you have some options. If the encoding is ASCII-8BIT or BINARY then you can probably get away with s.force_encoding('utf-8'). If you end up with something other than UTF-8 and ASCII-8BIT then you can use Iconv to re-encode it.

References:

answered Aug 26, 2011 at 8:21

mu is too short

436k71 gold badges863 silver badges822 bronze badges

2 Comments

gerky Over a year ago

Thanks! also, do you know how to test the filtering of foreign characters? in rspec?

mu is too short Over a year ago

@mr_lu_kim: The same way you'd test any other string manipulation in RSpec. You'd just do various utf8_string.mangle.should == utf8less_string and such.

Collectives™ on Stack Overflow

Detecting non-ASCII characters in Rails

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related