2

i'm trying to create a news app for a schoolproject where i get information off rss feeds of my local newspapers, in order to combine multiple newspapers into one.

i'm running into problems when i try to insert my collected data into my Mysql database.

When i simply print my date (example: print urlnzz.entries[0].description) there is no problem with the german characters such as ü ä ö é à.

when i try to insert the data into the Mysql databse however, I get "UnicodeEncodeError: 'ascii' codec can't encode character..". Weird is, that this only happens for .title and .description, not for .category (even though there are also ü etc in there)

i've been looking for an answer for quite some time now, i changed the encoding of the variables with

t = urlbernerz.entries[i].title


print t.encode('utf-8')

changed the charset to utf-8 when i connect to the database and even tried the "try / except " function of python, yet nothing seems to work.

I've checked the type of each entry with type(u['entries'].title) and they are all unicode, now i need to encode them in a way that i can put them into my mysqldatabase

on the rss websites it states that it's already encoded as utf-8, and even though i explicitly tell python to encode it as utf-8 as well, it still gives me the error:'ascii' codec can't encode character u'\xf6'

i've tried many answer to this subject already, such as using str() or using chardet but nothing seem to work. Here's my code

import MySQLdb
import feedparser
#!/usr/bin/env python
# -*- coding: UTF-8 -*-

db = MySQLdb.connect(host="127.0.0.1", 
                     user="root",
                      passwd="",
                      db="FeedStuff",
                     charset='UTF8')
db.charset="utf8"
cur = db.cursor()




urllistnzz =['international', 'wirtschaft', 'sport']
urllistbernerz =['kultur', 'wissen', 'leben']


for u in range (len(urllistbernerz)):
    urlbernerz = feedparser.parse('http://www.bernerzeitung.ch/'+urllistbernerz[u]+'/rss.html')
    k = len(urlbernerz['entries'])
    for i in range (k):
        cur.execute("INSERT INTO articles (title, description, date, category, link, source) VALUES (' "+ str(urlbernerz.entries[i].title)+"  ', ' " + str(urlbernerz.entries[i].description)+ " ', ' " + urlbernerz.entries[i].published + " ', ' " + urlbernerz.entries[i].category + " ', ' " + urlbernerz.entries[i].link + " ',' Berner Zeitung')")

for a in range (len(urllistnzz)):
    urlnzz = feedparser.parse('http://www.nzz.ch/'+urllistnzz[a]+'.rss')
    k = len(urlnzz['entries'])
    for i in range (k):
        cur.execute("INSERT INTO articles (title, description, date, category, link, source) VALUES (' "+str(urlnzz.entries[i].title)+" ', ' " + str(urlnzz.entries[i].description)+ " ', ' " + urlnzz.entries[i].published + " ', ' " + urlnzz.entries[i].category + " ', ' " + urlnzz.entries[i].link + " ', 'NZZ')")



db.commit()

cur.close()
db.close()
3
  • unrelated: don't hardcode the encoding of outside environment (terminal) inside your script, print Unicode instead: print t Commented Sep 23, 2015 at 22:38
  • have you tried use_unicode=True connect() parameter? Again, don't encode, pass Unicode string -- let the db driver to encode using the correct encoding (specified via charset parameter earlier). Commented Sep 23, 2015 at 22:39
  • unrelated: don't use string formatting to insert sql values, use parametrized queries instead. Commented Oct 20, 2015 at 10:22

3 Answers 3

1

It is possible that there are characters with other encodings present in the text from RSS feeds. First, you can try different encodings in nested try except blocks. Secondly you can add 'ignore' to the encode methods. Like:

try:
    s = raw_s.encode('utf-8', 'ignore')
except UnicodeEncodeError:
    try:
        s = raw_s.encode('latin-1', 'ignore')
    except UnicodeEncodeError:
        print raw_s

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

0

The major issue is that you're calling str() on Unicode objects. Depending on many factors, this may result in Python trying to encode the Unicode into ASCII, which is not possible with non-ASCII chars.

You should try to keep Unicode objects as Unicode objects for as long as possible in your code and only convert when it's totally necessary. Fortunately, the MySQL driver is Unicode compliant, so you can pass it Unicode strings and it will encode internally. The only thing you need to do is to tell the driver to use UTF-8. Feedparser is also Unicode compliant and is decoding the rss feed automatically to Unicode strings (strings without encoding).

There's also some parts of your code, which would benefit from using Python's in built features like for each in something:, String.format(), and triple quotes (""") for long pieces of text.

Pulling this all together looks like:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import MySQLdb
import feedparser

db = MySQLdb.connect(host="127.0.0.1",
                     user="root",
                      passwd="",
                      db="FeedStuff",
                     charset='UTF8')

urllistnzz =['international', 'wirtschaft', 'sport']
urllistbernerz =['kultur', 'wissen', 'leben']

cur = db.cursor()

for uri in urllistbernerz:
    urlbernerz = feedparser.parse('http://www.bernerzeitung.ch/{uri}/rss.html'.format(uri=uri))

    for entry in urlbernerz.entries:
        insert_sql = u"""INSERT INTO articles (title, description, date, category,
                        link, source) VALUES ("{e.title}", "{e.description}",
                        "{e.published}", "{e.category}", "{e.link}", "Berner Zeitung")
                        """.format(e=entry)

        cur.execute(insert_sql)

for uri in urllistnzz:
    urlnzz = feedparser.parse('http://www.nzz.ch/{uri}.rss'.format(uri=uri) )

    for entry in urlnzz.entries:
        insert_sql = u"""INSERT INTO articles (title, description, date, category,
                        link, source) VALUES ("{e.title}", "{e.description}",
                        "{e.published}", "{e.category}", "{e.link}", "NZZ")
                        """.format(e=entry)

        cur.execute(insert_sql)

db.commit()

cur.close()
db.close()

7 Comments

This worked! thanks a lot, i'll have to figure out exactly what you changed with the "uri" and .format(uri=uri) because i need to document both the coding and the theoretical background to it in my school work, so i'll do some research now :)
hey, i just had to start using this, and it turns out that the solution you gave me doesn't give me any errors anymore, but it also doesn't show me all the articles i want. it also confuses things such as the link and messes a lot of things up, now that i start to use this in further code... are you sure this is supposed to work?
Yes, this code is supposed to work. You're going to have to be more specific about what isn't working and make sure it's not because your 3rd party website have changed.
I added some more detail in the question above, it was too big for a comment. even when i put the counter directly in to the code which you suggested, the two numbers don't match, and the mixing of information that i get in my db is really strange. thanks for your time btw :D
When you have these kind of problems, you ought to try to debug your code. Ask yourself "How could I end up with articles attributed to the wrong source?". I quickly found that I left a typo in my code - a second iteration of for entry in urlbernerz.entries: instead of for entry in urlnzz.entries:. The code above is now fixed. It'd be wise to understand the for x in iteratable syntax
|
0

Assuming cur.execute() expects a utf-8 encoded string: you need to encode it as utf-8 explicitly when you pass it to MySQL, just doing str() will attempt to encode it as ascii which fails and produces your error:

   cur.execute("INSERT INTO articles (title, description, date, \
   category, link, source) VALUES ('"+ \
   urlnzz.entries[i].title.encode('utf-8') +" ', ' " + \
   urlnzz.entries[i].description.encode('utf-8') + " ', ' " +  \
   urlnzz.entries[i].published + " ', ' " +  \
   urlnzz.entries[i].category + " ', ' " + urlnzz.entries[i].link + " ', 'NZZ')")

Being a unicode object is something distinct from being a str in utf-8 encoding. The encode method on a unicode object will produce a utf-8 formatted str (assuming Python 2)

1 Comment

This is wrong. You should pass Unicode strings to .execute(). The driver will encode where necessary: stackoverflow.com/a/6203782/1554386

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.