DataFrame of DataFrames in Python (Pandas)

Question

The idea here is that for every year, I am able to create three dataframes(df1, df2, df3), each containing different firms and stock prices('firm' and 'price' are the two columns in df1~df3). I would like to use another dataframe (named 'store' below) to store the three dataframes every year.

Here is what I code:

store = pd.DataFrame(list(range(1967,2014)), columns=['year'])
for year in range(1967,2014):
    ....some codes that allow me to generate df1, df2 and df3 correctly...
    store.loc[store['year']==year, 'df1']=df1
    store.loc[store['year']==year, 'df2']=df2
    store.loc[store['year']==year, 'df3']=df3

I am not getting error warning or anything after this code. But in the "store" dataframe, columns 'df1', 'df2' and 'df3' are all 'NAN' values.

What is DataFrame - gvkey? And what is fyear? Can you add sample of df1 and desired output of store? — jezrael
– jezrael, Commented Mar 11, 2016 at 6:13
Just based on the code, I think you should use three dictionaries instead of one dataframe. I personally won't store dataframes in a dataframe. — Patrick the Cat
– Patrick the Cat, Commented Mar 11, 2016 at 13:39

Ami Tavory · Accepted Answer · 2016-03-11 14:04:41Z

11

I think that pandas offers better alternatives to what you're suggesting (rationale below).

For one, there's the pandas.Panel data structure, which was meant for things like you're doing here.

However, as Wes McKinney (the Pandas author) noted in his book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, multi-dimensional indices, to a large extent, offer a better alternative.

Consider the following alternative to your code:

dfs = []
for year in range(1967,2014):
    ....some codes that allow me to generate df1, df2 and df3 
    df1['year'] = year
    df1['origin'] = 'df1'
    df2['year'] = year
    df2['origin'] = 'df2'
    df3['year'] = year
    df3['origin'] = 'df3'
    dfs.extend([df1, df2, df3])
df = pd.concat(dfs)

This gives you a DataFrame with 4 columns: 'firm', 'price', 'year', and 'origin'.

This gives you the flexibility to:

Organize hierarchically by, say, 'year' and 'origin': df.set_index(['year', 'origin']), by, say, 'origin' and 'price': df.set_index(['origin', 'price'])
Do groupbys according to different levels
In general, slice and dice the data along many different ways.

What you're suggesting in the question makes one dimension (origin) arbitrarily different, and it's hard to think of an advantage to this. If a split along some dimension is necessary due, to, e.g., performance, you can combine DataFrames better with standard Python data structures:

A dictionary mapping each year to a Dataframe with the other three dimensions.
Three DataFrames, one for each origin, each having three dimensions.

answered Mar 11, 2016 at 14:04

Ami Tavory

76.7k13 gold badges152 silver badges196 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dan Over a year ago

This is very helpful!

Joey Baruch Over a year ago

Hi @Ami, can you please reference where in the book they discuss multi-dimensional indexes?

VoteCoffee Over a year ago

Note that pandas.Panel was deprecated in v0.20.0. The documentation notes that the recommended way to represent 3-D data are with a MultiIndex on a DataFrame via the to_frame() method or with the xarray package.

Collectives™ on Stack Overflow

DataFrame of DataFrames in Python (Pandas)

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related