python os.walk and unicode error

Question

two questions: 1. why does

In [21]:                                                                                   
   ....:     for root, dir, file in os.walk(spath):
   ....:         print(root)

print the whole tree but

In [6]: for dirs in os.walk(spath):                             
...:     print(dirs)

chokes on this unicode error?

UnicodeEncodeError: 'charmap' codec can't encode character '\u2122' in position 1477: character maps to <undefined>

[NOTE: this is the TM symbol]

I looked at these answers

Scraping works well until I get this error: 'ascii' codec can't encode character u'\u2122' in position

What's the deal with Python 3.4, Unicode, different languages and Windows?

python 3.2 UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 9629: character maps to <undefined>

https://github.com/Drekin/win-unicode-console

https://docs.python.org/3/search.html?q=IncrementalDecoder&check_keywords=yes&area=default

and tried these variations

----> 1 print(dirs, encoding='utf-8')                                                           
TypeError: 'encoding' is an invalid keyword argument for this function       
In [11]: >>> u'\u2122'.encode('ascii', 'ignore')                                                
Out[11]: b''                       

print(dirs).encode(‘utf=8’)

all to no effect.

This was done with python 3.4.3 and visual studio code 1.6.1 on Windows 10. The default settings in Visual Studio Code include:

// The default character set encoding to use when reading and writing files. "files.encoding": "utf8",

python 3.4.3 visual studio code 1.6.1 ipython 3.0.0

UPDATE EDIT I tried this again in the Sublime Text REPL, running a script. Here's what I got:

# -*- coding: utf-8 -*-
import os

spath = 'C:/Users/Semantic/Documents/Align' 

with open('os_walk4_align.txt', 'w') as f:
    for path, dirs, filenames in os.walk(spath):
        print(path, dirs, filenames, file=f)

Traceback (most recent call last):
File "listdir_test1.py", line 8, in <module>
print(path, dirs, filenames, file=f)
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2605' in position 300: character maps to <undefined>

This code is only 217 characters long, so where does ‘position 300’ come from?

I assume you mean 'unicode', not 'unicorn'. I am testing out the new Visual Studio Code on Windows 10, that is why I am using it, and as I said, the default is already set to utf-8. Furthermore, I tried this in Sublime Text, and I am STILL getting unicode errors, albeit different ones. — Malik A. Rumi
– Malik A. Rumi, Commented Oct 17, 2016 at 21:29
Setting the source encoding (#coding:utf8) has nothing to do with the output encoding. As you can see from your error cp1252 is the output encoding and doesn't support the characters being printed to the terminal. The easiest way around this is to write to a file with UTF-8 encoding insteading of printing to a display, or use an Python IDE that supports UTF-8 output. I'm not familiar with Sublime Text, but it probably has a way to adjust the output encoding as well. — Mark Tolonen
– Mark Tolonen, Commented Oct 17, 2016 at 22:13

Mark Tolonen · Accepted Answer · 2016-10-18 01:46:41Z

Here's a test case:

C:\TEST
├───dir1
│       file1™
│
└───dir2
        file2

Here's a script (Python 3.x):

import os

spath = r'c:\test'

for root,dirs,files in os.walk(spath):
    print(root)

for dirs in os.walk(spath):                             
    print(dirs)

Here's the output, on an IDE that supports UTF-8 (PythonWin, in this case):

c:\test
c:\test\dir1
c:\test\dir2
('c:\\test', ['dir1', 'dir2'], [])
('c:\\test\\dir1', [], ['file1™'])
('c:\\test\\dir2', [], ['file2'])

Here's the output, on my Windows console, which defaults to cp437:

c:\test
c:\test\dir1
c:\test\dir2
('c:\\test', ['dir1', 'dir2'], [])
Traceback (most recent call last):
  File "C:\test.py", line 9, in <module>
    print(dirs)
  File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2122' in position 47: character maps to <undefined>

For Question 1, the reason print(root) works is that no directory had a character that wasn't supported by the output encoding, but print(dirs) is now printing a tuple containing (root,dirs,files) and one of the files has an unsupported character in the Windows console.

For Question 2, the first example misspelled utf-8 as utf=8, and the second example didn't declare an encoding for the file the output was written to, so it used a default that didn't support the character.

Try this:

import os

spath = r'c:\test'

with open('os_walk4_align.txt', 'w', encoding='utf8') as f:
    for path, dirs, filenames in os.walk(spath):
        print(path, dirs, filenames, file=f)

Content of os_walk4_align.txt, encoded in UTF-8:

c:\test ['dir1', 'dir2'] []
c:\test\dir1 [] ['file1™']
c:\test\dir2 [] ['file2']

Ding! Ding! Ding! We have a winner! My own '=' typo aside, I had the encoding on the print line, when it should have been in the arguments. Your detailed answer helped a great deal. Thanks.

aneroid · Accepted Answer · 2016-10-18 00:32:16Z

-1

The console you're outputting to doesn't support non-ASCII by default. You need to use str.encode('utf-8').

That works on strings not on lists. So print(dirs).encode(‘utf=8’) won't works, and it's utf-8, not utf=8.

Print your lists with list comprehension like:

>>> print([s.encode('utf-8') for s in ['a', 'b']])
['a', 'b']
>>> print([d.encode('utf-8') for d in dirs])  # to print `dirs`

answered Oct 18, 2016 at 0:32

aneroid

16.7k3 gold badges42 silver badges77 bronze badges

Collectives™ on Stack Overflow

python os.walk and unicode error

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related