-1

So I am trying to read data off a .txt file and then find the most common 30 words and print them out. However, whenever I'm reading my txt file, I receive the error:

"UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 338: ordinal not in range(128)".

Here is my code:

filename = 'wh_2015_national_security_strategy_obama.txt'
#catches the year of named in the file
year = filename[0:4]
ecount = 30
#opens the file and reads it
file = open(filename,'r').read()   #THIS IS WHERE THE ERROR IS
#counts the characters, then counts the lines, replaces the non word characters, slipts the list and changes it all to lower case.
numchar = len(file)
numlines = file.count('\n')
file = file.replace(",","").replace("'s","").replace("-","").replace(")","")
words = file.lower().split()
dictionary = {}
#this is a dictionary of all the words to not count for the most commonly used. 
dontcount = {"the", "of", "in", "to", "a", "and", "that", "we", "our", "is", "for", "at", "on", "as", "by", "be", "are", "will","this", "with", "or",
             "an", "-", "not", "than", "you", "your", "but","it","a","and", "i", "if","they","these","has","been","about","its","his","no"
             "because","when","would","was", "have", "their","all","should","from","most", "were","such","he", "very","which","may","because","--------"
             "had", "only", "no", "one", "--------", "any", "had", "other", "those", "us", "while",
             "..........", "*", "$", "so", "now","what", "who", "my","can", "who","do","could", "over", "-",
             "...............","................", "during","make","************",
             "......................................................................", "get", "how", "after",
             "..................................................", "...........................", "much", "some",
             "through","though","therefore","since","many", "then", "there", "–", "both", "them", "well", "me", "even", "also", "however"}
for w in words:
    if not w in dontcount:
        if w in dictionary:
            dictionary[w] +=1
        else:
            dictionary[w] = 1
num_words = sum(dictionary[w] for w in dictionary)
#This sorts the dictionary and makes it so that the most popular is at the top.
x = [(dictionary[w],w) for w in dictionary]
x.sort()
x.reverse()
#This prints out the number of characters, line, and words(not including stop words.
print(str(filename))
print('The file has ',numchar,' number of characters.')
print('The file has ',numlines,' number of lines.')
print('The file has ',num_words,' number of words.')
#This provides the stucture for how the most common words should be printed out
i = 1
for count, word in x[:ecount]:
    print("{0}, {1}, {2}".format(i,count,word))
    i+=1
3
  • 1
    Possible duplicate stackoverflow.com/questions/21129020/… & stackoverflow.com/questions/26619801/… Commented May 7, 2016 at 2:03
  • See the post I linked to and the Python 3 docs for open, especially its encoding parameter. For Python 2, the "new" version of open is in io.open. PS: That byte is most likely a nonstandard (Microsoft) right-single-quote, frequently misused as a "curly" apostrophe. Commented May 7, 2016 at 2:15
  • It's none of the above - all those questions and answers deal with Python 2. Not one will help the OP fix the very simple question relating to Python 3's TextIOWrapper throwing an exception, which has to be corrected by selecting the right encoding Commented May 7, 2016 at 11:34

2 Answers 2

3

In Python 3, when opening files in text mode (the default), Python uses your environment settings to choose an appropriate encoding.

If it can't resolve it (or your environment specifically defines ASCII), then it will use ASCII. This is what has happened in your case.

If the ASCII decoder finds anything that's not ASCII, then it will throw an error. In your case, it's thrown an error on the byte 0x92. This is not valid ASCII, nor valid UTF-8. It does make sense in windows-1252 encoding, however, where it's a (Smart quote / 'RIGHT SINGLE QUOTATION MARK'). It could also make sense in other 8bit code pages, but you'll have to know or work that out yourself.

To make your code read windows-1252 encoded files, you need to change your open() command to:

file = open(filename, 'r', encoding='windows-1252').read()
Sign up to request clarification or add additional context in comments.

Comments

-3

I am learning python, so please take this response with that in mind.

file = open(filename,'r').read() #THIS IS WHERE THE ERROR IS

From what I have learned so far your read is combined with the open() object creation. The open() function creates the file handle, the read() function reads the file into a string. Both functions would return I presume success/fail, or in the open() function's case in part the file object reference. I am not sure they can be combined successfully.

Thus far from what I have learned this is to be done in 2 steps. i.e.

file = open(filename, 'r') # creates the object myString = file.read() # reads the entire object into a string

the open() function creates the file object, so probably returns the object number, or success/fail.

The read, read(n), readline() or readlines() functions are used on the object.

.read reads entire file into a single string .read(n) read next n bytes into a string .readline() read the next line into a string .readline() read entire file into a list of strings

You can split them up and see if the same result happens ??? just a thought from a newbie :)

1 Comment

Assigning a file-like object to a local variable before reading it does not change the contents of that file, nor how they are converted from bytes to strings, which is what caused the UnicodeDecodeError. See the encoding and errors parameters for open, and also the various read-related methods of the TextIOBase ("text file") it returns.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.