I have UTF-8 text data from Twitter (so it my be very dirty). When input into mysql (database char set is utf8) some text get garbaged. I would like a way to clean data before putting it in.
Insert ignore search_tweets set id_str = 'pass1',text = 'RT @youpon_info: Youponです!この度はキャンペーン参加ありがとうございました。たくさんの方々にキャンペーンに参加して頂きました。' ;
Insert ignore search_tweets set id_str = 'fail',text = 'RT @youpon_info: Youponです!この度はキャンペーン参加ありがとうございました。たくさんの方々にキャンペーンに参加して頂きました。また次のキャンペーンをすぐに予定しております!もう少' ;
Insert ignore search_tweets set id_str = 'pass2',text = 'また次のキャンペーンをすぐに予定しております!もう少' ;
fail.text = pass1.text + pass2.text and they both go in and come out of mysql fine. fail comes out as
RT @youpon_info: Youponã§ãï¼ãã®åº¦ã¯ãã£ã³ãã¼ã³åå ãããã¨ããããã¾ãããããããã®æ¹ã
I have done this with direct MySQL calls, although originally it was all done in Ruby datamapper and direct calls.
I would like to know how to clean the data so it goes in/comes out of MySQL the same. If possible a ruby solution would be nice but just knowing how to clean it would great.