13

I have a large csv file of 3.5 go and I want to read it using pandas.

This is my code:

import pandas as pd
tp = pd.read_csv('train_2011_2012_2013.csv', sep=';', iterator=True, chunksize=20000000, low_memory = False)
df = pd.concat(tp, ignore_index=True)

I get this error:

pandas/parser.pyx in pandas.parser.TextReader.read (pandas/parser.c:8771)()

pandas/parser.pyx in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)()

pandas/parser.pyx in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)()

pandas/parser.pyx in pandas.parser.raise_parser_error (pandas/parser.c:23325)()

CParserError: Error tokenizing data. C error: out of 

The capacity of my ram is 8 Go.

4
  • what about just pd.read_csv('train_2011_2012_2013.csv', sep=';') ? Commented Dec 23, 2016 at 14:35
  • In addition to any other suggestions, you should also specify dtypes. Commented Dec 23, 2016 at 14:49
  • @Boud my computer don't support it Commented Dec 23, 2016 at 21:42
  • Noobie's answer above is using even more memory because you are loading a chunk and appending it to mylist (creating a second copy of the data). You should read in a chunk , process it, store the result , then continue reading next chunk. Also , setting dtype for columns will reduce memory. Commented May 23, 2017 at 18:55

4 Answers 4

25

try this bro:

mylist = []

for chunk in  pd.read_csv('train_2011_2012_2013.csv', sep=';', chunksize=20000):
    mylist.append(chunk)

big_data = pd.concat(mylist, axis= 0)
del mylist
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your help but there an error in big_data = pd.concat(mylist, axis=0) out = np.empty(out_shape, dtype=dtype, order='F') 929 else: --> 930 out = np.empty(out_shape, dtype=dtype) 931 932 func = _get_take_nd_function(arr.ndim, arr.dtype, out.dtype, axis=axis, MemoryError:
Just came across this. Perfect!
I am reading two big csv files one after another. This is not working. Any suggestions please? My csv size is 980 MB
This worked for me. I had a CSV file with a size of 13.2 GB.
4

This error could also be caused by the chunksize=20000000. Decreasing that fixed the issue in my case. In ℕʘʘḆḽḘ's solution chunksize is also decreased which might have done the trick.

2 Comments

If it is already answered in ℕʘʘḆḽḘ's solution then just comment this. No need to put it as an answer.
I wanted to do that but didn't have enough reputation. Just wanted to leave this info for future reference, I haven't found it when I was googling for this error
2

You may try setting error_bad_lines = False when calling the csv file i.e.

import pandas as pd
df = pd.read_csv('my_big_file.csv', error_bad_lines = False)

1 Comment

Pandas version: 1.3.4 :FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version.
0

You may try to add parameter engine='python. It loads the data slower but it helped in my situation.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.