0

I have been trying to use pandas to analyze some genomics data. When reading a csv, I get the CParserError: Error tokenizing data. C error: out of memory error, and I have narrowed down to the particular line that causes it, which is 43452. As shown below, the error doesn't happen until the parser goes beyond Line 43452.

I have also pasted the relevant lines from less output with the long sequences truncated, and the second column (seq_len) shows the length of that sequences. As you could see, some of the sequences are fairly long with a few millions of characters (i.e. bases in genomics). I wonder if the error is a result of too big a value in the csv. Does pandas post a limit to the length of a value at a cell? If so, how big is it?

BTW, the data.csv.gz is about 9G in size if decompressed with less than 2 million lines. My system has over 100G memory, so I think physical memory is unlikely to be the cause.

Successful read at Line 43451

In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
                         compression='gzip', header=None,
                         names=['accession', 'seq_len', 'tax_id', 'seq'],
                         nrows=43451)

Failed read at Line 43452

In [1]: import pandas as pd
In [2]: df = pd.read_csv('data.csv.gz',
                         compression='gzip', header=None,
                         names=['accession', 'seq_len', 'tax_id', 'seq'],
                         nrows=43452)
---------------------------------------------------------------------------
CParserError                              Traceback (most recent call last)
<ipython-input-1-036af96287f7> in <module>()
----> 1 import pandas as pd; df = pd.read_csv('filtered_gb_concatenated.csv.gz', compression='gzip', header=None, names=['accession', 'seq_len', 'tax_id', 'seq'], nrows=43452)

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
472                     skip_blank_lines=skip_blank_lines)
    473
    --> 474         return _read(filepath_or_buffer, kwds)
    475
        476     parser_f.__name__ = name

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
254                                   " together yet.")
    255     elif nrows is not None:
    --> 256         return parser.read(nrows)
    257     elif chunksize or iterator:
        Successful258         return parser

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
719                 raise ValueError('skip_footer not supported for iteration')
    720
    --> 721         ret = self._engine.read(nrows)
    722
        723         if self.options.get('as_recarray'):

/path/to/venv/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
   1168
  1169         try:
  -> 1170             data = self._reader.read(nrows)
     1171         except StopIteration:
    1172             if nrows is None:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:7544)()

pandas/parser.pyx in pandas.parser.TextReader._read_low_memory (pandas/parser.c:7952)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:8401)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8275)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:20691)()

CParserError: Error tokenizing data. C error: out of memory

Line 43450-43455 of less -N -S output with long seq truncated. The first column is line number, after which are csv content separated by commas. The column names are ['accession', 'seq_len', 'tax_id', 'seq']

43450 FP929055.1,3341681,657313,AAAGAACCTTGATAACTGAACAATAGACAACAACAACCCTTGAAAATTTCTTTAAGAGAA....
43451 FP929058.1,3096657,657310,TTCGCGTGGCGACGTCCTACTCTCACAAAGGGAAACCCTTCACTACAATCGGCGCTAAGA....
43452 FP929059.1,2836123,717961,GTTCCTCATCGTTTTTTAAGCTCTTCTCCGTACCCTCGACTGCCTTCTTTCTCACTGTTC....
43453 FP929060.1,3108859,245012,GGGGTATTCATACATACCCTCAAAACCACACATTGAAACTTCCGTTCTTCCTTCTTCCTC....
43454 FP929061.1,3114788,649756,TAACAACAACAGCAACGGTGTAGCTGATGAAGGAGACATATTTGGATGATGAATACTTAA....
43455 FP929063.1,34221,29290,CCTGTCTATGGGATTTGGCAGCGCAATGCAGGAAAACTACGTCCTAAGTGTGGAGATCGATGC....
6
  • To test the gzip file is not corrupt gunzip -t file.gz Commented Sep 24, 2015 at 18:39
  • also, check md5sum file.csv.gz with original source ..... Commented Sep 24, 2015 at 18:54
  • I created the file.csv.gz, so md5sum doesn't work. will try gunzip -t Commented Sep 24, 2015 at 18:59
  • review this solution chrisaycock's answer .... may be, it can help you Commented Sep 24, 2015 at 19:07
  • $Jose, gunzip -t finishes successfully, so the gzip integrity is good. Commented Sep 24, 2015 at 19:10

1 Answer 1

0

Well, the last line says it all, it doesn't have enough memory to split a chunk of data. I'm not sure how the archive block reading works and how much data it loads into memory, but it's clear that you will have to somehow control the size of the chunks. I found a solution here:

pandas-read-csv-out-of-memory

and here:

out-of-memory-error-when-reading-csv-file-in-chunk

Please try to read the normal file line by line and see if it works.

Sign up to request clarification or add additional context in comments.

2 Comments

The two are essentially the same answer. Reading line by line finishes fines, but that's not what I want, I do want to read it into a DataFrame.
@zyxue I understand, but I suggested line by line to see if it works that way. The error seems to be thrown when it tries to split a line to obtain the fields. My advice is to juggle the following params: engine, nrows, chunksize. Here's the doc: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.