3

I have a list of dictionaries data_dump which contains dictionaries like:

d = {"ids": s_id, "subject": subject}

I'm following the tutorial trying to do a bulk insert:

connection = Connection(host,port)
db = connection['clusters']
posts = db.posts
posts.insert(data_dump)

Which fails with the following error:

 File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 312, in insert
continue_on_error, self.__uuid_subtype), safe)
bson.errors.InvalidStringData: strings in documents must be valid UTF-8

Please advise. Thanks

3
  • The exception is pretty clear. Some string in data_dump isn't valid utf8. Where did data_dump come from? Commented Jan 26, 2012 at 0:37
  • Try using the codecs.open function to read the file. So open("file.txt", "r") would become import codecs; codecs.open("file.txt", "r", "utf-8") Commented Jan 26, 2012 at 0:43
  • @MikeSteder how do i ensure that.. sorry about that. I am simply reading them as a f = open(filename,"r") .. Is there a way to force that?? Commented Jan 26, 2012 at 0:44

2 Answers 2

3

Solved: Well.. forced the encoding by 1) Stripping the string of symbols etc and then 2) converting ascii to utf-8 by raw.decode('ascii') and then decoded_string.encode('utf8') Thanks guys.. :)

Sign up to request clarification or add additional context in comments.

Comments

0

I couldn't afford to lose the non utf-8 characters. So I chose to convert the string to Binary, instead.

As per your example,

>>> print subject
u'Math'
>>> d = {"ids": s_id, "subject": bson.Binary(str(subject))} # convert subject from unicode to Binary

You can't run full-text searches, which is the latest feature in Mongo, but it works well for everything else.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.