Error tokenizing data. C error: out of memory pandas python, large file csv

Question

I have a large csv file of 3.5 go and I want to read it using pandas.

This is my code:

import pandas as pd
tp = pd.read_csv('train_2011_2012_2013.csv', sep=';', iterator=True, chunksize=20000000, low_memory = False)
df = pd.concat(tp, ignore_index=True)

I get this error:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:8771)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:23325)()

CParserError: Error tokenizing data. C error: out of

The capacity of my ram is 8 Go.

what about just pd.read_csv('train_2011_2012_2013.csv', sep=';') ? — Zeugma
– Zeugma, Commented Dec 23, 2016 at 14:35
In addition to any other suggestions, you should also specify dtypes. — 3novak
– 3novak, Commented Dec 23, 2016 at 14:49
Noobie's answer above is using even more memory because you are loading a chunk and appending it to mylist (creating a second copy of the data). You should read in a chunk , process it, store the result , then continue reading next chunk. Also , setting dtype for columns will reduce memory. — marneezy
– marneezy, Commented May 23, 2017 at 18:55

ℕʘʘḆḽḘ · Accepted Answer · 2016-12-23 14:44:12Z

25

try this bro:

mylist = []

for chunk in  pd.read_csv('train_2011_2012_2013.csv', sep=';', chunksize=20000):
    mylist.append(chunk)

big_data = pd.concat(mylist, axis= 0)
del mylist

answered Dec 23, 2016 at 14:44

ℕʘʘḆḽḘ

19.5k39 gold badges149 silver badges259 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Amal Kostali Targhi Over a year ago

Thanks for your help but there an error in big_data = pd.concat(mylist, axis=0) out = np.empty(out_shape, dtype=dtype, order='F') 929 else: --> 930 out = np.empty(out_shape, dtype=dtype) 931 932 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis, MemoryError:

Kokokoko Over a year ago

Just came across this. Perfect!

Hasnu zama Over a year ago

I am reading two big csv files one after another. This is not working. Any suggestions please? My csv size is 980 MB

Erfan Over a year ago

This worked for me. I had a CSV file with a size of 13.2 GB.

Justas · Accepted Answer · 2019-03-05 07:52:27Z

4

This error could also be caused by the chunksize=20000000. Decreasing that fixed the issue in my case. In ℕʘʘḆḽḘ's solution chunksize is also decreased which might have done the trick.

answered Mar 5, 2019 at 7:52

Justas

1591 silver badge6 bronze badges

2 Comments

Mark Melgo Over a year ago

If it is already answered in ℕʘʘḆḽḘ's solution then just comment this. No need to put it as an answer.

Justas Over a year ago

I wanted to do that but didn't have enough reputation. Just wanted to leave this info for future reference, I haven't found it when I was googling for this error

Dutse I · Accepted Answer · 2017-10-25 17:12:29Z

2

You may try setting error_bad_lines = False when calling the csv file i.e.

import pandas as pd
df = pd.read_csv('my_big_file.csv', error_bad_lines = False)

answered Oct 25, 2017 at 17:12

Dutse I

312 bronze badges

1 Comment

farid Over a year ago

Pandas version: 1.3.4 :FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version.

Thinker · Accepted Answer · 2020-10-12 22:17:19Z

0

You may try to add parameter engine='python. It loads the data slower but it helped in my situation.

answered Oct 12, 2020 at 22:17

Thinker

11 silver badge

Collectives™ on Stack Overflow

Error tokenizing data. C error: out of memory pandas python, large file csv

4 Answers 4

4 Comments

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related