4

I'm scraping a few websites and eventually I hit a UTF-8 error that looks like this:

/usr/local/lib/ruby/gems/1.9.1/gems/dm-core-1.2.0/lib/dm-core/support/ext/blank.rb:19:in
`=~': invalid byte sequence in UTF-8 (ArgumentError)

Now, I don't care about the websites being 100% accurate. Is there a way I can take the page I get and strip out any problem encodings and then pass it around inside my program?

I'm using ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0] if that matters.

Update:

def self.blank?(value)
      return value.blank? if value.respond_to?(:blank?)
      case value
      when ::NilClass, ::FalseClass
        true
      when ::TrueClass, ::Numeric
        false
      when ::Array, ::Hash
        value.empty?
      when ::String
        value !~ /\S/ ###This is the line 19 that has the issue.
      else
        value.nil? || (value.respond_to?(:empty?) && value.empty?)
      end
    end
  end

When I try to save the following line:

What Happens in The Garage Tin Sign2. � � Newsletter Our monthly newsletter,

It throws the error. It's on page: http://www.stationbay.com/. But what is odd is that when I view it in my web browser it doesn't show the funny symbols in the source.

What do I do next?

4
  • Can you post the line that does the encoding? Commented Dec 3, 2011 at 15:59
  • Is that what you're asking for? Commented Dec 3, 2011 at 16:11
  • What exactly are you passing as value? That could be the root of the problem. Commented Dec 3, 2011 at 16:13
  • 1
    Your example line works fine (with the #encoding: UTF-8 magick comment). Maybe Stack Overflow filters out the invalid chars? Commented Dec 3, 2011 at 19:30

1 Answer 1

6

The problem is that your string contains non-UTF-8 characters, but seems to have UTF-8 encoding forced. The following short code demonstrates the issue:

a = "\xff"
a.force_encoding "utf-8"
a.valid_encoding?  # returns false
a =~ /x/           # provokes ArgumentError: invalid byte sequence in UTF-8

The best way to fix this is to apply the proper encoding right from the beginning. If this is not an option, you can use String#encode:

a = "\xff"
a.force_encoding "utf-8"
a.valid_encoding?  # returns false

a.encode!("utf-8", "utf-8", :invalid => :replace)
a.valid_encoding?  # returns true now
a ~= /x/           # works now
Sign up to request clarification or add additional context in comments.

6 Comments

This answered a question that I had an issue with which I was dealing. Thanks!
@Sean: I can only repeat that this is an ugly hack as it causes loss of information. If you have the option to handle encoding properly from the beginning, that would be the way to go. If not, then you are welcome :)
Understood. I can lose data in this application, so it's not a problem for me.
@Niklas is there a reason this wouldn't work inside a rails views?
On ruby 1.9, 2.0: a.encode!("utf-8", "utf-8", :invalid => :replace) a.valid_encoding? # STILL RETURNS FALSE
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.