0

I import into a pandas DataFrame a directory of |-delimited.dat files. The following code works, but I eventually run out of RAM with a MemoryError:.

import pandas as pd
import glob

temp = []
dataDir = 'C:/users/richard/research/data/edgar/masterfiles'
for dataFile in glob.glob(dataDir + '/master_*.dat'):
    print dataFile
    temp.append(pd.read_table(dataFile, delimiter='|', header=0))

masterAll = pd.concat(temp)

Is there a more memory efficient approach? Or should I go whole hog to a database? (I will move to a database eventually, but I am baby stepping my move to pandas.) Thanks!

FWIW, here is the head of an example .dat file:

cik|cname|ftype|date|fileloc
1000032|BINCH JAMES G|4|2011-03-08|edgar/data/1000032/0001181431-11-016512.txt
1000045|NICHOLAS FINANCIAL INC|10-Q|2011-02-11|edgar/data/1000045/0001193125-11-031933.txt
1000045|NICHOLAS FINANCIAL INC|8-K|2011-01-11|edgar/data/1000045/0001193125-11-005531.txt
1000045|NICHOLAS FINANCIAL INC|8-K|2011-01-27|edgar/data/1000045/0001193125-11-015631.txt
1000045|NICHOLAS FINANCIAL INC|SC 13G/A|2011-02-14|edgar/data/1000045/0000929638-11-00151.txt

1 Answer 1

3

Usually, if you mind memory usage, it's better to use generators instead of creating a list ahead. Something like:

dir_path = os.path.join(data_dir, 'master_*.dat')
master_all = pd.concat(pd.read_table(data_file, delimiter='|', header=0)
                                     for data_file in glob.glob(dir_path))

Or you can write a generator function for a more verbose version.

Anyway this wont solve the problem if the RAM is not enough to contain the final result + some temp space for at list a complete file(and probably more... it depends on how the garbage collector works).

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! This works better, but I still run out of memory, so I'm moving to a database. Thanks for the lesson on generators.
Well, generators are not really special. They are simply "lazy evaluated" so you can process data and then throw it away, allowing you to have more free memory. But if you have to create a 6GB string with only 4GB of RAM then you simply can't do anything to avoid memory errors or swapping.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.