0

So for the past few days I've been trying to learn Python in App Engine. However, I've been encountering a number of problems with ASCII and UTF encoding. The freshest issue is as follows:

I have the following piece of code of a simplistic chatroom from the book 'Code in the Cloud'

from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import datetime


# START: MainPage
class ChatMessage(object):
def __init__(self, user, msg):
    self.user = user
    self.message = msg
    self.time = datetime.datetime.now()

def __str__(self):
    return "%s (%s): %s" % (self.user, self.time, self.message)

Messages = []

class ChatRoomPage(webapp.RequestHandler):
def get(self):
    self.response.headers["Content-Type"] = "text/html"
    self.response.out.write("""
       <html>
         <head>
           <title>MarkCC's AppEngine Chat Room</title>
         </head>
         <body>
           <h1>Welcome to MarkCC's AppEngine Chat Room</h1>
           <p>(Current time is %s)</p>
       """ % (datetime.datetime.now()))
    # Output the set of chat messages
    global Messages
    for msg in Messages:
        self.response.out.write("<p>%s</p>" % msg)
    self.response.out.write("""
       <form action="" method="post">
       <div><b>Name:</b> 
       <textarea name="name" rows="1" cols="20"></textarea></div>
       <p><b>Message</b></p>
       <div><textarea name="message" rows="5" cols="60"></textarea></div>
       <div><input type="submit" value="Send ChatMessage"></input></div>
       </form>
     </body>
   </html>
   """)
 # END: MainPage    
 # START: PostHandler
def post(self):
    chatter = self.request.get("name")
    msg = self.request.get("message")
    global Messages
    Messages.append(ChatMessage(chatter, msg))
    # Now that we've added the message to the chat, we'll redirect
    # to the root page, which will make the user's browser refresh to
    # show the chat including their new message.
    self.redirect('/')        
# END: PostHandler




# START: Frame
chatapp = webapp.WSGIApplication([('/', ChatRoomPage)])


def main():
run_wsgi_app(chatapp)

if __name__ == "__main__":
main()
# END: Frame

It works ok in English. However, the moment I add some non-standard characters all sorts of problems start

First of all, in order for the thing to be actually able to display characters in HTML I add meta tag - charset=UTF-8" etc

Curiously, if you enter non-standard letters, the program processes them nicely, and displays them with no issues. However, it fails to load if I enter any non-ascii letters to the web layout iteself withing the script. I figured out that adding utf-8 encoding line would work. So I added (# -- coding: utf-8 --). This was not enough. Of course I forgot to save the file in UTF-8 format. Upon that the program started running.

That would be the good end to the story, alas....

It doesn't work

Long story short this code:

# -*- coding: utf-8 -*-
from google.appengine.ext import webapp
from google.appengine.ext.webapp.util import run_wsgi_app
import datetime


# START: MainPage
class ChatMessage(object):
def __init__(self, user, msg):
    self.user = user
    self.message = msg
    self.time = datetime.datetime.now()

def __str__(self):
    return "%s (%s): %s" % (self.user, self.time, self.message)

Messages = []
class ChatRoomPage(webapp.RequestHandler):
def get(self):
    self.response.headers["Content-Type"] = "text/html"
    self.response.out.write("""
       <html>
         <head>
           <title>Witaj w pokoju czatu MarkCC w App Engine</title>
           <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
         </head>
         <body>
           <h1>Witaj w pokoju czatu MarkCC w App Engine</h1>
           <p>(Dokladny czas Twojego logowania to: %s)</p>
       """ % (datetime.datetime.now()))
    # Output the set of chat messages
    global Messages
    for msg in Messages:
        self.response.out.write("<p>%s</p>" % msg)
    self.response.out.write("""
       <form action="" method="post">
       <div><b>Twój Nick:</b> 
       <textarea name="name" rows="1" cols="20"></textarea></div>
       <p><b>Twoja Wiadomość</b></p>
       <div><textarea name="message" rows="5" cols="60"></textarea></div>
       <div><input type="submit" value="Send ChatMessage"></input></div>
       </form>
     </body>
   </html>
   """)
# END: MainPage    
# START: PostHandler
def post(self):
    chatter = self.request.get(u"name")
    msg = self.request.get(u"message")
    global Messages
    Messages.append(ChatMessage(chatter, msg))
    # Now that we've added the message to the chat, we'll redirect
    # to the root page, which will make the user's browser refresh to
    # show the chat including their new message.
    self.redirect('/')        
# END: PostHandler




# START: Frame
chatapp = webapp.WSGIApplication([('/', ChatRoomPage)])


def main():
run_wsgi_app(chatapp)

if __name__ == "__main__":
main()
# END: Frame

Fails to process anything I write in the chat application when it's running. It loads but the moment I enter my message (even using only standard characters) I receive

File "D:\Python25\lib\StringIO.py", line 270, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 64: ordinal not in       range(128) 

error message. In other words, if I want to be able to use any characters within the application I cannot put non-English ones in my interface. Or the other way round, I can use non-English characters within the app only if I don't encode the file in utf-8. How to make it all work together?

7
  • If you've not already come across it, unicode bootcamp: joelonsoftware.com/articles/Unicode.html . This is essential to understanding what's actually going on. Then look at the warning about unicode in the StringIO docs: joelonsoftware.com/articles/Unicode.html Commented Aug 21, 2011 at 17:30
  • @Thomas K. I see what you mean, and I understand the need and use of different encodings. As you see in the second example of the code I accounted for different charsets by adding lines such as # -- coding: utf-8 -- or the HTML charset meta tag. The thing I don't understand is how Python handles it all. Why Python demands me to constantly encode and decode things back nad forth, myself? How I can accomplish it in this example. I've been toying with various methods, including (unicode( s, "utf-8" )) and (.encode( "utf-8") with little success. Yes, I'm very inexperienced. Commented Aug 21, 2011 at 18:23
  • I don't know exactly what's going on with your application, but on lines 21 and 35, try making your strings start with u""", so they are unicode strings. The problem is that you're trying to write out a mixture of encoded strings and unicode. Commented Aug 21, 2011 at 19:21
  • @Thomas K. Thank you for the linked article. It made me think that I was doing something in the wrong order. The line Messages.append(ChatMessage(chatter, msg)) should look like this: Messages.append(ChatMessage(chatter.encode( "utf-8" ), msg.encode( "utf-8" ))) I would post this as an aswer but it seems I cannot, for at least 3 hours. Commented Aug 21, 2011 at 19:55
  • That will work, but it's better practice to store them as unicode strings and only encode when you're calling self.response.out.write. Commented Aug 21, 2011 at 21:37

2 Answers 2

2

Your strings contain unicode characters, but they're not unicode strings, they're byte strings. You need to prefix each one with u (as in u"foo") in order to make them into unicode strings. If you ensure all your strings are Unicode strings, you should eliminate that error.

You should also specify the encoding in the Content-Type header rather than a meta tag, like this:

self.response.headers['Content-Type'] = 'text/html; charset=UTF-8'

Note your life would be a lot easier if you used a templating system instead of writing HTML inline with your Python code.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks. I will remember your advice.
@Mathias would find his life would be easier if he either used Python 3 at the very least did a from __future__ import unicode_literals. He.encode("UTF-8") also.encode("UTF-8") needs.encode("UTF-8") to.encode("UTF-8") set.encode("UTF-8") the.encode("UTF-8") stream.encode("UTF-8") output.encode("UTF-8") encoding.encode("UTF-8") to.encode("UTF-8") avoid.encode("UTF-8") all.encode("UTF-8") this.encode("UTF-8") utterly.encode("UTF-8") stupid.encode("UTF-8") crap..encode("UTF-8")
@tchrist That would be good advice, except that he's using App Engine, which doesn't run Python 3.
1

@Thomas K. Thank you for your guidance here. Thanks to you I was able to come up with, maybe - as you said - a little roudabout solution - so the credit for the answer should go to you. The following line of code:

Messages.append(ChatMessage(chatter, msg))

Should look like this:

Messages.append(ChatMessage(chatter.encode( "utf-8" ), msg.encode( "utf-8" )))

Basically I have to encode all the utf-8 string to ascii.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.