4

I am trying to read the Movie Lens dataset: http://files.grouplens.org/datasets/movielens/ml-100k/ using Pandas.

I am using Python version 3.4 and I am following the tutorial given here" http://www.gregreda.com/2013/10/26/using-pandas-on-the-movielens-dataset/ ".

When I try to read the u.item data using the code mentioned there:

# the movies file contains columns indicating the movie's genres
# let's only load the first five columns of the file with usecols
m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5), encoding='UTF-8')

I get the following error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte".

What could be a possible reason for this error and what would be a solution

I tried adding encoding='utf-8' to the pd.read_csv( encoding='utf-8' ), but it didn't solve anything unfortunately.

The error trace back is:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-4cc01a7faf02> in <module>()
      9 # let's only load the first five columns of the file with usecols
     10 m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
---> 11 movies = pd.read_csv('ml-100k/u.item', sep='|', names=m_cols, usecols=range(5), encoding='UTF-8')

/usr/local/lib/python3.4/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473 
--> 474         return _read(filepath_or_buffer, kwds)
    475 
    476     parser_f.__name__ = name

/usr/local/lib/python3.4/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    258         return parser
    259 
--> 260     return parser.read()
    261 
    262 _parser_defaults = {

/usr/local/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
    719                 raise ValueError('skip_footer not supported for iteration')
    720 
--> 721         ret = self._engine.read(nrows)
    722 
    723         if self.options.get('as_recarray'):

/usr/local/lib/python3.4/site-packages/pandas/io/parsers.py in read(self, nrows)
   1168 
   1169         try:
-> 1170             data = self._reader.read(nrows)
   1171         except StopIteration:
   1172             if nrows is None:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7544)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7784)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8617)()

pandas/parser.pyx in pandas.parser.TextReader._convert_column_data (pandas/parser.c:9928)()

pandas/parser.pyx in pandas.parser.TextReader._convert_tokens (pandas/parser.c:10714)()

pandas/parser.pyx in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:12118)()

pandas/parser.pyx in pandas.parser.TextReader._string_convert (pandas/parser.c:12283)()

pandas/parser.pyx in pandas.parser._string_box_utf8 (pandas/parser.c:17655)()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 3: invalid continuation byte
3
  • try passing param encoding='utf-8' to read_csv Commented Jun 10, 2015 at 9:55
  • Tried it.. I am getting the same error. Commented Jun 10, 2015 at 12:47
  • Please share your code and the exact error traceback. Commented Jun 10, 2015 at 16:18

3 Answers 3

7

This is a bit of an old one but i'm going toss this here in case anyone else runs into it as I had the same problem in Nov 2018.

I checked the encoding with file -i:

$ file -i u.item
u.item: text/plain; charset=iso-8859-1

Then fed that encoding to pd.read_csv()

>>> import pandas as pd
>>> df = pd.read_csv("u.item", sep="|", encoding="iso-8859-1")
>>> 

Success!

Sign up to request clarification or add additional context in comments.

2 Comments

Good answer! It would be awesome if pandas could detect this automagically.
just open with iso-8859-1 works
2

If found two possible tricks to solve the problem:

1/ open the file in a texte editor and save the file with encoding "UTF-8"

===> For example, in Sublime Text follow the tabs: >>>"Edit" >>>"Save with encoding" >>> "UTF-8"

2/ or just open the doc with python2...

Couldn't find a better solution.

Comments

1

Would like to clarify xb353's answer - should be file -I u.item instead. Otherwise -i would just return regular file or something..

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.