1

As far as I know there is a difference between strings and unicode strings in Python. But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?

So when I get a text input, I don't need to use unicode()?

I might sound lazy but I am just interested if this is possible...

p.s. I don't know a lot about character encoding so please correct me if I got anything wrong

15
  • 3
    Yes, simply use Python 3. It doesn't have non-unicode strings. Commented Jul 1, 2016 at 0:58
  • But what if I prefer using Python 2? Commented Jul 1, 2016 at 1:02
  • 1
    @Cosinux. Have you actually used Python 3? If so, what specific problems did you have with it that made you prefer Python 2? Commented Jul 1, 2016 at 1:33
  • What Stefan said. Unicode requires different handling to simple ASCII strings. If you don't like the way Python 2 does it then you should be using Python 3. For that matter, you should be using Python 3 for all new code. The only reason to use Python 2 these days is if you're forced to work on legacy code, or you need to use some obscure library that hasn't been ported. But you should take a look at Pragmatic Unicode by SO veteran Ned Batchelder. Commented Jul 1, 2016 at 1:35
  • @StefanPochmann That's not correct, both Python 2 and 3 have both byte strings and unicode strings, you have 'abc' and u'abc' in Python 2, and b'abc' and 'abc' in Python 3. Commented Jul 1, 2016 at 1:35

3 Answers 3

3

But is it possible to instruct Python to use unicode strings instead of regular ones whenever a string object is created?

There are two type of strings in Python (on both Python 2 and 3): a bytestring (a sequence of bytes) and a Unicode string (a sequence of Unicode codepoints).

bytestring = b'abc'
unicode_text = u'abc'

The type of string created using 'abc' string literal depends on Python version and the presence of from __future__ import unicode_literals import. Without the import on Python 2, 'abc' literal creates a bytestring otherwise it creates a Unicode string.

Add the encoding declaration at the top of your Python source file if you use non-ascii characters in string literals e.g.: # -*- coding: utf-8 -*-.

So when I get a text input, I don't need to use unicode()?

If by "text input" you mean that your program receives bytes somehow (from a file, network, from the command-line) then no: you shouldn't rely on Python to convert bytes to Unicode implicitly -- you should do it explicitly as soon as you receive the bytes using unicode_text = bytestring.decode(character_encoding).

And in reverse, keep the text as Unicode inside your program. Convert Unicode strings to bytes as late as possible when it is necessary (e.g., to send the text via the network).

Use bytestrings to work with a binary data: an image, a compressed content, etc. Use Unicode strings to work with text in Python.

To read Unicode from a file, use io.open() (you have to know the correct character encoding if it is not locale.getpreferredencoding(False)).

What character encoding to use when you receive your Unicode text via network may depend on the corresponding protocol e.g., the charset can be specified in Content-Type http header:

    text = data.decode(response.headers.getparam('charset'))

You could use universal_newlines=True or io.TextIOWrapper() to get Unicode text from an external process started using subprocess module. It can be non-trivial to figure out what character encoding should be used on Windows (if you read Russian, see the gory details here: Byte при печати вывода внешней команды).

Sign up to request clarification or add additional context in comments.

Comments

3

For Example(In pyhon interactive,diff in GUI Shell) :

>>> s = '你好'
>>> s
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> us = u'你好'
>>> us
u'\u4f60\u597d'
>>> print type(s)
<type 'str'>
>>> print type(us)
<type 'unicode'>
>>> len(s)
6
>>> len(us)
2

In short:
First, a string object is a sequence of characters,a Unicode string is a sequence of code points(Unicode code units), which are numbers from 0 to 0x10ffff.
Them, len(string) will reture a set of bytes,len(unicode) will return a number of characters.This sequence needs to be represented as a set of bytes (meaning, values from 0-255) in memory. The rules for translating a Unicode string into a sequence of bytes are called an encoding.
I think you should use raw_input to instead input, if you want to get bytestring.

Comments

2

In Python 2.6+ you can use from __future__ import unicode_literals, but that only makes string literals Unicode. All functions that returned byte strings still return byte strings.

Example:

>>> s = 'abc'
>>> type(s)
<type 'str'>
>>> from __future__ import unicode_literals
>>> s = 'abc'
>>> type(s)
<type 'unicode'>

For the behavior you want, use Python 3.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.