20

I guess this question needs some insight into the implementation of concat.

Say, I have 30 files, 1G each, and I can only use up to 32 G memory. I loaded the files into a list of DataFrames, called 'list_of_pieces'. This list_of_pieces should be ~ 30G in size, right?

if I do pd.concat(list_of_pieces), does concat allocate another 30G (or maybe 10G 15G) in the heap and do some operations, or it run the concatation 'in-place' without allocating new memory?

anyone knows this?

Thanks!

2
  • 2
    I don't think it's inplace... as an aside, I don't think you actually want to read that much into memory (you're not going to leave much room for actually doing calculations)! I think HDF5 store is a much better choice for you. Commented Jun 7, 2013 at 11:51
  • @AndyHayden, i am afraid i do need that size of data in memory, i need to so some interactive analysis on them :-( Commented Jun 7, 2013 at 12:49

2 Answers 2

16

The answer is no, this is not an in-place operation; np.concatenate is used under the hood, see here: Concatenate Numpy arrays without copying

A better approach to the problem is to write each of these pieces to an HDFStore table, see here: http://pandas.pydata.org/pandas-docs/dev/io.html#hdf5-pytables for docs, and here: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore for some recipies.

Then you can select whatever portions (or even the whole set) as needed (by query or even row number)

Certain types of operations can even be done when the data is on-disk: https://github.com/pydata/pandas/issues/3202?source=cc, and here: http://pytables.github.io/usersguide/libref/expr_class.html#

Sign up to request clarification or add additional context in comments.

Comments

3

Try this:

dfs = [df1, df2]

temp = pd.concat(dfs, copy=False, ignore_index=False)
    
df1.drop(df1.index[0:], inplace=True)

df1[temp.columns] = temp 

4 Comments

Try adding code formatting for better readability
I've tested your solution with 1.2Gb table. It's definitely slower. Such slow, that I had been waiting for 10 minutes, the script still was working. (using just pd.concat it takes 30 seconds)
I find this a very clever way to handle the problem. Thanks. It wasn't that slow. I used it for around 1Gb data on AWS, it worked almost instantaneously.
This is a resourceful way to save limited [memory] resources.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.