59

I am running a basic script that loops over a nested dictionary, grabs data from each record, and appends it to a Pandas DataFrame. The data looks something like this:

data = {"SomeCity": {"Date1": {record1, record2, record3, ...}, "Date2": {}, ...}, ...}

In total it has a few million records. The script itself looks like this:

city = ["SomeCity"]
df = DataFrame({}, columns=['Date', 'HouseID', 'Price'])
for city in cities:
    for dateRun in data[city]:
        for record in data[city][dateRun]:
            recSeries = Series([record['Timestamp'], 
                                record['Id'], 
                                record['Price']],
                                index = ['Date', 'HouseID', 'Price'])
            FredDF = FredDF.append(recSeries, ignore_index=True)

This runs painfully slow, however. Before I look for a way to parallelize it, I just want to make sure I'm not missing something obvious that would make this perform faster as it is, as I'm still quite new to Pandas.

3
  • 2
    Have you looked at from_dict? Commented Jan 13, 2015 at 19:00
  • 13
    Appending rows to DataFrames is inherently inefficient. Try to create the entire DataFrame with its final size in one go. As EdChum says, in this case you can probably do this with from_dict. Commented Jan 13, 2015 at 19:03
  • Thanks! I'll give both a try and see how it performs. Commented Jan 13, 2015 at 20:47

7 Answers 7

80
+500

I also used the dataframe's append function inside a loop and I was perplexed how slow it ran.

A useful example for those who are suffering, based on the correct answer on this page.

Python version: 3

Pandas version: 0.20.3

# the dictionary to pass to pandas dataframe
d = {}

# a counter to use to add entries to "dict"
i = 0 

# Example data to loop and append to a dataframe
data = [{"foo": "foo_val_1", "bar": "bar_val_1"}, 
       {"foo": "foo_val_2", "bar": "bar_val_2"}]

# the loop
for entry in data:

    # add a dictionary entry to the final dictionary
    d[i] = {"col_1_title": entry['foo'], "col_2_title": entry['bar']}
    
    # increment the counter
    i = i + 1

# create the dataframe using 'from_dict'
# important to set the 'orient' parameter to "index" to make the keys as rows
df = DataFrame.from_dict(d, "index")

The "from_dict" function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html

Sign up to request clarification or add additional context in comments.

11 Comments

This example is definitely very helpful!
This is a fast approach for sure but since default dictionary of Python is not ordered data in excel might get mixed randomly. I highly recommend using the OrderedDict library from collections.
This is really quick. Operation taking around 20 seconds now completes within milliseconds. Thanks a ton :)
Great tip. Incredibly useful. For my use case, I went from 45+ minutes down to less than 5 by using this method.
I dropped from nearly 2h to less than 5 seconds xD THANKS!
|
13

Appending rows to lists is far more efficient than to a DataFrame. Hence you would want to

  1. append the rows to a list.
  2. Then convert it into DataFrame and
  3. set the index as required.

2 Comments

Great and simple solution! For everyone searching for the implementation of step 2: Simple do df = pd.DataFrame(my_list, columns=['col1', 'col2']).
It's really good simple solution, i hope you go to ArbaeenWalk for prize <3
9

Another way is to make it into a list and then use pd.concat

import pandas as pd 

df = pd.DataFrame({'num_legs': [2, 4, 8, 0],

                   'num_wings': [2, 0, 0, 0],

                   'num_specimen_seen': [10, 2, 1, 8]},

                  index=['falcon', 'dog', 'spider', 'fish'])

def append(df):
    df_out = df.copy()
    for i in range(1000):
        df_out = df_out.append(df)
    return df_out

def concat(df):
    df_list = []
    for i in range(1001):
        df_list.append(df)

    return pd.concat(df_list)


# some testing
df2 = concat(df)
df3 = append(df)

pd.testing.assert_frame_equal(df2,df3)

%timeit concat(df):

20.2 ms ± 794 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit append(df)

275 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It is the recommended way to concatenate rows in pandas now:

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. link

Comments

7

I think the best way to do it is, if you know the data you are going to receive, allocate before hand.

import numpy as np
import pandas as pd

random_matrix = np.random.randn(100, 100)
insert_df = pd.DataFrame(random_matrix)

df = pd.DataFrame(columns=range(100), index=range(200))
df.loc[range(100), df.columns] = random_matrix
df.loc[range(100, 200), df.columns] = random_matrix

This is the pattern that I think makes the most sense. append will be faster if you have a very small dataframe, but it doesn't scale.

In [1]: import numpy as np; import pandas as pd

In [2]: random_matrix = np.random.randn(100, 100)
   ...: insert_df = pd.DataFrame(random_matrix)
   ...: df = pd.DataFrame(np.random.randn(100, 100))

In [2]: %timeit df.append(insert_df)
272 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
493 µs ± 4.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(100), df.columns] = insert_df
821 µs ± 8.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

When we run this with a 100,000 row dataframe, we see much more dramatic results.

In [1]: df = pd.DataFrame(np.random.randn(100_000, 100))

In [2]: %timeit df.append(insert_df)
17.9 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: %timeit df.loc[range(100), df.columns] = random_matrix
465 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: %timeit df.loc[range(99_900, 100_000), df.columns] = random_matrix
465 µs ± 5.75 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: %timeit df.loc[range(99_900, 100_000), df.columns] = insert_df
1.02 ms ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So we can see an append is about 17 times slower than an insert with a dataframe, and 35 times slower than an insert with a numpy array.

Comments

5

I ran into a similar problem where I had to append many times to a DataFrame, but did not know the values in advance of the appends. I wrote a lightweight DataFrame like data structure that is just blists() under the hood. I use that to accumulate all of the data and then when it is complete transform the output into a Pandas DataFrame. Here is a link to my project, all open source so I hope it helps others:

https://pypi.python.org/pypi/raccoon

1 Comment

nice library - adding it to my core mvp's
1

In my case I was loading a large number of dataframes with the same columns from different files and wanted to append them to create one large data frame.

My solution was to first load all the dataframes into a list, and then use

all_dfs = []
for i in all_files:
  all_dfs.append(/* load df from file */)

master_df = pd.concat(all_dfs, ignore_index=True)

1 Comment

This worked really well for me.
-2
N=100000

t0=time.time()
d=[]
for i in range(N):
    d.append([i, i+1,i+2,i+3,i+0.1,1+0.2])
testdf=pd.DataFrame.from_records(d, columns=["x1","x2","x3","x4", "x5", "x6"])
print(time.time()-t0)

t0=time.time()
d={}
for i in range(N):
    d[len(d)+1]={"x1":i, "x2":i+1, "x3":i+2,"x4":i+3,"x5":i+0.1,"x6":1+0.2}
testdf=pd.DataFrame.from_dict(d, "index")
print(time.time()-t0)

t0=time.time()
testdf=pd.DataFrame()
for i in range(N):
    testdf=testdf.append({"x1":i, "x2":i+1, "x3":i+2,"x4":i+3,"x5":i+0.1,"x6":1+0.2}, ignore_index=True)
print(time.time()-t0)


=== result for N=10000 ===
list:0.016329050064086914
dict:0.03952217102050781
DataFrame:10.598219871520996

=== result for N=100000 ===
list: 0.4076499938964844
dict: 0.45696187019348145
DataFrame: 187.6609809398651

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.